Eclectic-Sheep / sheeprl

Distributed Reinforcement Learning accelerated by Lightning Fabric
https://eclecticsheep.ai
Apache License 2.0
322 stars 33 forks source link

Dreamerv3 in easy environments #229

Closed miniwa closed 7 months ago

miniwa commented 8 months ago

Hi,

Using your implementation of PPO, I can train a policy on the gym CartPole-v1 environment to consistently get 500 (maximum possible) reward in about ~1 minute, on my cpu without any gpu acceleration.

I try the same environment with XS dreamerv3 model, using sheeprl exp=dreamer_v3 algo=dreamer_v3_XS env=gym env.id=CartPole-v1 fabric.accelerator=gpu on an Nvidia L4. After 1 hour of training and 140.000 steps, the model averages around ~144 reward in training.

Is the algorithm simply not suited for easy tasks or is there a configuration issue going on? If this is a config issue, what variables make the biggest impact? My algo.mlp_keys.encoder configuration is empty, is this a problem?

Thank you for hard work so far. I'm excited to see where this project goes in the future, seeing as your results so far are impressive.

belerico commented 8 months ago

Hi @miniwa, sorry for the late response! Sadly, we've never tried to run dreamer-v3 on CartPole, as we've always tried it on more difficult environments.

The fact that dreamer-v3 is not learning in CartPole is quite surprising and we can definitely investigate.

The algo.encoder.cnn_keys empty means that the algorithm does not use pixel-based observations. You could try to use them by specifying the name of the pixel observations to use: in the specific case of CartPole you can decide how to name those (rgb for example), since they are not returned by default by the environment, but instead are generated through the render function.

Regarding the parameters that matter most, those are the algo.learning_starts and both the algo.per_rank_batch_size and algo.per_rank_sequence_length. Can you try to lower both the algo.per_rank_sequence_length to 8 or 16 and/or the algo.learning_starts to 128/256?

jmribeiro commented 8 months ago

Hi @belerico !

I was also about to open an issue regarding Dreamer on feature-vector based (partially observable) environments where no Cnn is needed (and as a matter of fact, to also handle gridworlds where the observation space is also with matrices, except non-RGB, for example with 1's where the agents are and 0's otherwise).

What would be necessary to adapt from the code? I would be willing to help you guys!

belerico commented 8 months ago

Hi @jmribeiro, for observation spaces that are matrices but not images the first thing that it comes to my mind is to flatten them and use any algorithm accepting vector-based observations. In this case a wrapper is needed. Do you have any particular request? What is your use case?

jmribeiro commented 8 months ago

Hi @belerico, The goal was not to learn an environment with any other algorithm, but with DreamerV3.

The Conv2D layers are not a part of Dreamer itself, right? They are used to extract a feature-vector from the images. For environments which are not RGB, such as CartPole, the Conv2D layers could be dropped altogether, passing the obs vector directly to the MLP. These observations would still be reconstructed and the algorithm would stay the same.

The same goes for environments with different "image" shapes --- the Conv2D could be adapted to handle new input shape/kernel sizes/num kernels.

Do you think this is easy to do with the codebase?

belerico commented 8 months ago

Hi @jmribeiro,

The Conv2D layers are not a part of Dreamer itself, right? They are used to extract a feature-vector from the images. For environments which are not RGB, such as CartPole, the Conv2D layers could be dropped altogether, passing the obs vector directly to the MLP. These observations would still be reconstructed and the algorithm would stay the same.

Dreamer-V3 accepts both image and vector observations, with the user deciding which one to use by setting accordingly the algo.encoder.cnn_keys, algo.encoder.mlp_keys, algo.decoder.cnn_keys and algo.decoder.mlp_keys (more information can be found in the corresponding how-tos, 1 and 2). So for CartPole, which by default gives vector-based obs, you could run an experiment with:

python sheeprl.py exp=dreamer_v3 env=gym env.id=CartPole-v1 fabric.accelerator=gpu fabric.precision=bf16-mixed algo=dreamer_v3_S algo.cnn_keys.encoder=\[\] algo.mlp_keys.encoder=\["vector"\] algo.cnn_keys.decoder=\[\] algo.mlp_keys.decoder=\["vector"\]

The same goes for environments with different "image" shapes --- the Conv2D could be adapted to handle new input shape/kernel sizes/num kernels.

This is different instead, becuase right now the Conv2D are just used for image observations of a predefined shape, i.e. we accept 2D or 3D images and by default we treat observations with those shapes as images, while 1D observations as vectors; if we have something larger than 3D we don't do anything right now as can be seen here, which is where we transform a Box-based env into a Dict-based env.

Do you think this is easy to do with the codebase?

It can be done, but we have to think a couple of things:

jmribeiro commented 8 months ago

Hi @belerico

Here it goes:

Which kind of observations do we want to support?

The observations are custom made for an environment called "Level-Based Foraging"

An agent has a fov-window of size 5x5 centered around himself. Each channel contains specific information regarding objects in its surroundings. Shape: 5 channels x 5 width x 5 height Example for agent #0 image

How can we specify that some 2D/3D observations has to be treated not as images but as vector-based observations?

I believe the only issues in the code I could find are handcoded properties such as assuming all are RGB arrays (and dividing sometimes by 255) and fixed padding/strides on some classes.

Right now I'm learning this environment with a DQN as following (an extra channel due to an extra teammate on the environment).

image

What happens to observations larger than 3D?

This should not happen, at least in my use case.

Which kind of models should be employed to handle those observations?

Conv2D are enough.

belerico commented 8 months ago

Hi @jmribeiro, sorry for the late response! Let us move your new request on a different issue so that we can follow from there!

ajlangley commented 8 months ago

Has there been any update on the original issue in this thread? I trained DreamerV3 XS and it gets up to 500 reward briefly on CartPole-v1 after about 25K steps, but then becomes unstable and the reward declines to around 200, where the algorithm then converges.

Here is the command I ran:

sheeprl exp=dreamer_v3 env=gym env.id=CartPole-v1 algo.actor.optimizer.lr=0.00001 algo.critic.optimizer.lr=0.00001 algo=dreamer_v3_XS algo.learning_starts=256 algo.cnn_keys.encoder=[] algo.mlp_keys.encoder=["observation"] algo.cnn_keys.decoder=[] env.num_envs=1 metric.log_every=1000 algo.per_rank_sequence_length=16

belerico commented 8 months ago

Hi @ajlangley, the update is the following: we have spotted a bug in our code regarding on how we save the done flag in the replay buffer.

What we do is the following: we compute the done flag as the or between the terminated flag (i.e. the MDP reaches a final state) and the truncated (i.e. the env is stopped for whatever other reasons outside the MDP definition, e.g. a maximum number of timesteps are performed) one.

Given this done flag we then reset the environments accordingly and update what we had already saved in the buffer.

Dreamer-V3 learns the continuation flag given what we have saved in the buffer under the dones key computing the continue_targets=1-data["dones"], but this prevents the agent to bootstrap correctly when for example the terminated flag is False while the truncated flag is True.

If we consider the CartPole example above, we want the agent to bootstrap when it reaches a 500 reward and the MDP has not reached a final state, i.e.terminated=False and truncated=True.

We are now fixing this, but if you want to try it out I'll prepare a branch with those modification. I'm trying it right now and it's learning quite well.

cc @miniwa

belerico commented 8 months ago

I'm running an experiment with the following:

python sheeprl.py exp=dreamer_v3 \
env=gym env.id=CartPole-v1 \
env.num_envs=4 \
fabric.accelerator=gpu \
fabric.precision=bf16-mixed \
algo=dreamer_v3_S \
algo.learning_starts=1024 \
algo.cnn_keys.encoder=\[\] \
algo.mlp_keys.encoder=\["vector"\] \
algo.cnn_keys.decoder=\[\] \
algo.mlp_keys.decoder=\["vector"\] \
algo.per_rank_sequence_length=64 \
algo.train_every=1 \
algo.per_rank_gradient_steps=2

The algo.train_every=1, env.num_envs=4 and algo.per_rank_gradient_steps=2 let you have a replay-ratio=0.5, i.e. 1 gradient step every 2 policy step as specified here

cc @miniwa @ajlangley

ajlangley commented 8 months ago

Thanks @belerico! That totally makes sense. I'll try it out over the weekend if I can.

As a side note, I actually never considered to use the terminated vs truncated flag this way. :)

belerico commented 7 months ago

Hi @ajlangley @miniwa, from this branch (but should be equally valuable the main branch) i can train a simple agent on CartPole-V1:

python sheeprl.py exp=dreamer_v3 \                   
env=gym env.id=CartPole-v1 \
env.num_envs=4 \
fabric.accelerator=gpu \
fabric.precision=16-mixed \
algo=dreamer_v3_S \
algo.learning_starts=1024 \
algo.cnn_keys.encoder=\[\] \
algo.mlp_keys.encoder=\["vector"\] \
algo.cnn_keys.decoder=\[\] \
algo.mlp_keys.decoder=\["vector"\] \
algo.per_rank_sequence_length=64 \
algo.replay_ratio=0.5 \
algo.world_model.decoupled_rssm=False \
algo.world_model.learnable_initial_recurrent_state=False