Eclectic-Sheep / sheeprl

Distributed Reinforcement Learning accelerated by Lightning Fabric
https://eclecticsheep.ai
Apache License 2.0
274 stars 26 forks source link

DreamerV3: Hardware Resources Underutilized? #288

Closed defrag-bambino closed 6 days ago

defrag-bambino commented 1 month ago

Hi,

when I run (DreamerV3) experiments, especially ones with a replay_ratio > 1.0, training takes quite a long time. During these runs, my hardware resources are not being used much (e.g. only 1-2 cpu cores at around 50% each) - so clearly there is more computational power available. I was wondering if there is anything I can do to make SheepRL use more of the available hardware resources. I am already running multiple environments in parallel. I also tried increasing the num_threads, but this seems to have no effect.

Here is a simple example training command:

sheeprl fabric.accelerator=cuda fabric.strategy=ddp fabric.devices=1 fabric.precision=16-mixed exp=dreamer_v3 algo=dreamer_v3_S env=gym env.id=CartPole-v1 algo.total_steps=10000 algo.cnn_keys.encoder=\[\] algo.mlp_keys.encoder=\["vector"\] algo.cnn_keys.decoder=\[\] algo.mlp_keys.decoder=\["vector"\] env.num_envs=12 num_threads=16 checkpoint.every=1000 metric.log_every=100 algo.replay_ratio=10.0

Training this for up to aroung 8000 steps, where it reached the ~500 reward threshold, took around 3 hours. In the log data it lists a Time/sps_train of ~0.046 (which I assume is environment steps per second).

Thanks in advance for this great library!

belerico commented 1 month ago

Hi @defrag-bambino, the slowdown when raising the replay-ratio is expected, as the higher the replay-ratio the more gradient steps are computed by the agent per policy-step. Since the training steps happens mainly in the GPU i would look at the GPU rather than the CPU (which is used mainly for saving experiences in the buffer and running a fairly simple env in this case) stats.

Furthermore I suggest you to not use the fabric.strategy=ddp when running on single device.

Another suggestion to speedup the training is to use this branch where we have introduced the compilation through torch.compile which should speedup your training on the right GPU.

If you try out that branch can you kindly report your findings in this issue?

Thank you

belerico commented 1 week ago

Hi @defrag-bambino, have this fixed your issue? Are there any other consideration that you want to share?

defrag-bambino commented 6 days ago

Yes, this is OK for now! Thanks