Issues with running mujoco walker 2d

Eclectic-Sheep / sheeprl

Distributed Reinforcement Learning accelerated by Lightning Fabric

Apache License 2.0

303 stars 31 forks source link

Hi, when I run the code python sheeprl.py exp=dreamer_v3 env=mujoco env.id=Walker2d-v4 algo.cnn_keys.encoder=[rgb], the following is the output:

Rank-0: policy_step=788, reward_env_2=0.31653469800949097 Rank-0: policy_step=828, reward_env_2=-1.8613801002502441 Rank-0: policy_step=828, reward_env_3=-0.6631340384483337 Rank-0: policy_step=840, reward_env_0=-3.4890027046203613 Rank-0: policy_step=876, reward_env_3=-4.6154303550720215 Rank-0: policy_step=880, reward_env_1=10.097464561462402 Rank-0: policy_step=888, reward_env_2=-6.006372928619385 Rank-0: policy_step=916, reward_env_0=2.8062071800231934 Rank-0: policy_step=928, reward_env_3=2.518906831741333 Rank-0: policy_step=944, reward_env_1=0.48591500520706177 Rank-0: policy_step=952, reward_env_2=0.014924541115760803 Rank-0: policy_step=964, reward_env_0=2.63313364982605 Rank-0: policy_step=996, reward_env_1=1.226755142211914 Rank-0: policy_step=1020, reward_env_2=1.3471245765686035 Rank-0: policy_step=1024, reward_env_0=-1.6578210592269897 Rank-0: policy_step=1024, reward_env_3=-6.501708507537842

It stuck at policy_step=1024, although no errors. But also no videos saved, no trained checkpoint. I didn't change anything, just clone the repo and ran it.

Btw, what is the difference between the policy step and the environment step in the original dreamer v3 paper? How to convert them?

Thanks very much.

Hi @ruiiu, by default the selected accelerator is the cpu, if you did not change it and you have a GPU on which train your agent, I suggest you to run the following command:

python sheeprl.py exp=dreamer_v3 env=mujoco env.id=Walker2d-v4 algo.cnn_keys.encoder=[rgb] fabric.accelerator=cuda

The difference between policy and environment steps is that the policy step is the number of times the actor selects actions during the environment interaction: at each iteration, the policy steps are incremented by num_envs * world_size. Where num_envs is the number of environments and world_size is the number of devices you are using for training (you can define them with the fabric.devices=<world_size> parameter. Instead, the environment steps are the steps performed by the environments, they can be different from the policy steps because you can set, for example, the action_repeat. The action repeat parameter specifies that every time the actor selects an action, that action is repeated action_repeat times in the environment. For example, let us suppose you are using a device, a single environment for training and the action_repeat = 2. After running 500 policy steps, the number of environment steps will be 1000.

I hope you are now clearer on the difference between policy and environment steps

Eclectic-Sheep / sheeprl

Issues with running mujoco walker 2d #314