Closed miniwa closed 7 months ago
Hi @miniwa, sorry for the late response! Sadly, we've never tried to run dreamer-v3 on CartPole, as we've always tried it on more difficult environments.
The fact that dreamer-v3 is not learning in CartPole is quite surprising and we can definitely investigate.
The algo.encoder.cnn_keys
empty means that the algorithm does not use pixel-based observations. You could try to use them by specifying the name of the pixel observations to use: in the specific case of CartPole you can decide how to name those (rgb
for example), since they are not returned by default by the environment, but instead are generated through the render
function.
Regarding the parameters that matter most, those are the algo.learning_starts
and both the algo.per_rank_batch_size
and algo.per_rank_sequence_length
. Can you try to lower both the algo.per_rank_sequence_length
to 8 or 16 and/or the algo.learning_starts
to 128/256?
Hi @belerico !
I was also about to open an issue regarding Dreamer on feature-vector based (partially observable) environments where no Cnn is needed (and as a matter of fact, to also handle gridworlds where the observation space is also with matrices, except non-RGB, for example with 1's where the agents are and 0's otherwise).
What would be necessary to adapt from the code? I would be willing to help you guys!
Hi @jmribeiro, for observation spaces that are matrices but not images the first thing that it comes to my mind is to flatten them and use any algorithm accepting vector-based observations. In this case a wrapper is needed. Do you have any particular request? What is your use case?
Hi @belerico, The goal was not to learn an environment with any other algorithm, but with DreamerV3.
The Conv2D layers are not a part of Dreamer itself, right? They are used to extract a feature-vector from the images. For environments which are not RGB, such as CartPole, the Conv2D layers could be dropped altogether, passing the obs vector directly to the MLP. These observations would still be reconstructed and the algorithm would stay the same.
The same goes for environments with different "image" shapes --- the Conv2D could be adapted to handle new input shape/kernel sizes/num kernels.
Do you think this is easy to do with the codebase?
Hi @jmribeiro,
The Conv2D layers are not a part of Dreamer itself, right? They are used to extract a feature-vector from the images. For environments which are not RGB, such as CartPole, the Conv2D layers could be dropped altogether, passing the obs vector directly to the MLP. These observations would still be reconstructed and the algorithm would stay the same.
Dreamer-V3 accepts both image and vector observations, with the user deciding which one to use by setting accordingly the algo.encoder.cnn_keys
, algo.encoder.mlp_keys
, algo.decoder.cnn_keys
and algo.decoder.mlp_keys
(more information can be found in the corresponding how-tos, 1 and 2). So for CartPole, which by default gives vector-based obs, you could run an experiment with:
python sheeprl.py exp=dreamer_v3 env=gym env.id=CartPole-v1 fabric.accelerator=gpu fabric.precision=bf16-mixed algo=dreamer_v3_S algo.cnn_keys.encoder=\[\] algo.mlp_keys.encoder=\["vector"\] algo.cnn_keys.decoder=\[\] algo.mlp_keys.decoder=\["vector"\]
The same goes for environments with different "image" shapes --- the Conv2D could be adapted to handle new input shape/kernel sizes/num kernels.
This is different instead, becuase right now the Conv2D are just used for image observations of a predefined shape, i.e. we accept 2D or 3D images and by default we treat observations with those shapes as images, while 1D observations as vectors; if we have something larger than 3D we don't do anything right now as can be seen here, which is where we transform a Box-based env into a Dict-based env.
Do you think this is easy to do with the codebase?
It can be done, but we have to think a couple of things:
Hi @belerico
Here it goes:
Which kind of observations do we want to support?
The observations are custom made for an environment called "Level-Based Foraging"
An agent has a fov-window of size 5x5 centered around himself. Each channel contains specific information regarding objects in its surroundings. Shape: 5 channels x 5 width x 5 height Example for agent #0
How can we specify that some 2D/3D observations has to be treated not as images but as vector-based observations?
I believe the only issues in the code I could find are handcoded properties such as assuming all are RGB arrays (and dividing sometimes by 255) and fixed padding/strides on some classes.
Right now I'm learning this environment with a DQN as following (an extra channel due to an extra teammate on the environment).
What happens to observations larger than 3D?
This should not happen, at least in my use case.
Which kind of models should be employed to handle those observations?
Conv2D are enough.
Hi @jmribeiro, sorry for the late response! Let us move your new request on a different issue so that we can follow from there!
Has there been any update on the original issue in this thread? I trained DreamerV3 XS and it gets up to 500 reward briefly on CartPole-v1 after about 25K steps, but then becomes unstable and the reward declines to around 200, where the algorithm then converges.
Here is the command I ran:
sheeprl exp=dreamer_v3 env=gym env.id=CartPole-v1 algo.actor.optimizer.lr=0.00001 algo.critic.optimizer.lr=0.00001 algo=dreamer_v3_XS algo.learning_starts=256 algo.cnn_keys.encoder=[] algo.mlp_keys.encoder=["observation"] algo.cnn_keys.decoder=[] env.num_envs=1 metric.log_every=1000 algo.per_rank_sequence_length=16
Hi @ajlangley, the update is the following: we have spotted a bug in our code regarding on how we save the done
flag in the replay buffer.
What we do is the following: we compute the done
flag as the or between the terminated
flag (i.e. the MDP reaches a final state) and the truncated
(i.e. the env is stopped for whatever other reasons outside the MDP definition, e.g. a maximum number of timesteps are performed) one.
Given this done flag we then reset the environments accordingly and update what we had already saved in the buffer.
Dreamer-V3 learns the continuation flag given what we have saved in the buffer under the dones
key computing the continue_targets=1-data["dones"]
, but this prevents the agent to bootstrap correctly when for example the terminated flag is False while the truncated flag is True.
If we consider the CartPole example above, we want the agent to bootstrap when it reaches a 500 reward and the MDP has not reached a final state, i.e.terminated=False
and truncated=True
.
We are now fixing this, but if you want to try it out I'll prepare a branch with those modification. I'm trying it right now and it's learning quite well.
cc @miniwa
I'm running an experiment with the following:
python sheeprl.py exp=dreamer_v3 \
env=gym env.id=CartPole-v1 \
env.num_envs=4 \
fabric.accelerator=gpu \
fabric.precision=bf16-mixed \
algo=dreamer_v3_S \
algo.learning_starts=1024 \
algo.cnn_keys.encoder=\[\] \
algo.mlp_keys.encoder=\["vector"\] \
algo.cnn_keys.decoder=\[\] \
algo.mlp_keys.decoder=\["vector"\] \
algo.per_rank_sequence_length=64 \
algo.train_every=1 \
algo.per_rank_gradient_steps=2
The algo.train_every=1, env.num_envs=4
and algo.per_rank_gradient_steps=2
let you have a replay-ratio=0.5
, i.e. 1 gradient step every 2 policy step as specified here
cc @miniwa @ajlangley
Thanks @belerico! That totally makes sense. I'll try it out over the weekend if I can.
As a side note, I actually never considered to use the terminated vs truncated flag this way. :)
Hi @ajlangley @miniwa, from this branch (but should be equally valuable the main branch) i can train a simple agent on CartPole-V1:
python sheeprl.py exp=dreamer_v3 \
env=gym env.id=CartPole-v1 \
env.num_envs=4 \
fabric.accelerator=gpu \
fabric.precision=16-mixed \
algo=dreamer_v3_S \
algo.learning_starts=1024 \
algo.cnn_keys.encoder=\[\] \
algo.mlp_keys.encoder=\["vector"\] \
algo.cnn_keys.decoder=\[\] \
algo.mlp_keys.decoder=\["vector"\] \
algo.per_rank_sequence_length=64 \
algo.replay_ratio=0.5 \
algo.world_model.decoupled_rssm=False \
algo.world_model.learnable_initial_recurrent_state=False
Hi,
Using your implementation of PPO, I can train a policy on the gym CartPole-v1 environment to consistently get 500 (maximum possible) reward in about ~1 minute, on my cpu without any gpu acceleration.
I try the same environment with XS dreamerv3 model, using
sheeprl exp=dreamer_v3 algo=dreamer_v3_XS env=gym env.id=CartPole-v1 fabric.accelerator=gpu
on an Nvidia L4. After 1 hour of training and 140.000 steps, the model averages around ~144 reward in training.Is the algorithm simply not suited for easy tasks or is there a configuration issue going on? If this is a config issue, what variables make the biggest impact? My
algo.mlp_keys.encoder
configuration is empty, is this a problem?Thank you for hard work so far. I'm excited to see where this project goes in the future, seeing as your results so far are impressive.