alex-petrenko / sample-factory

High throughput synchronous and asynchronous reinforcement learning
https://samplefactory.dev
MIT License
811 stars 109 forks source link

Reproducing Atari FPS #280

Open djbyrne opened 1 year ago

djbyrne commented 1 year ago

Hi guys, really liking the repo and found the paper very insightful! Very excited to see the potential of single node RL experimentation :smile:

I am trying to reproduce the throughput shown in the paper, ~45k for System 1 and ~130k for System 2, However I am currently platauing at ~20k on a machine that surpasses system 2.

Would it be possible to share the optimal config for reproducing the max throughput?

Thanks so much,

Donal

alex-petrenko commented 1 year ago

Hi @djbyrne !

First of all, see this section in the documentation: https://www.samplefactory.dev/09-environment-integrations/vizdoom/#reproducing-paper-results

It's on VizDoom but I bet you can use similar configurations to reach very high throughput. Specifically, the last one:

python -m sf_examples.vizdoom.train_vizdoom --env=doom_benchmark --algo=APPO --env_frameskip=4 --use_rnn=True --num_workers=72 --num_envs_per_worker=24 --num_policies=1 --batch_size=8192 --wide_aspect_ratio=False --experiment=doom_battle_appo_w72_v24 --policy_workers_per_policy=2

Replace Doom-related params with Atari, and you should be good to go.

The most important parameters for throughput:

num_workers: this should ideally be the same as number of logical CPUs on your machine

num_envs_per_worker: usually in 10-20 range, but if you see below 100% CPU utilization, increase a bit more?

worker_num_splits=2 to enable double buffering

alex-petrenko commented 1 year ago

You would also need to increase the batch size to accommodate so much data. Start in 2048-4096 range and go from here.

alex-petrenko commented 1 year ago

That said, there's actually a better way to work with Atari: https://www.samplefactory.dev/09-environment-integrations/envpool/

Envpool is a C++ vectorized env runner that supports atari and some other envs. It is even faster than running many envs in Python multiprocessing. You need very different parameters for envpool, because it's essentially one very big vectorized environment, rather than hundreds of individual envs.

Here's my guess:

num_workers: 1-4?
num_envs_per_worker 1 or 2 if you use double buffering
worker_num_splits 1 or 2 for double buffering
env_agents=64 - how many params in a vector we have... I'm not sure what it should be, try as many as you have CPU cores and go from there!
djbyrne commented 1 year ago

Hey @alex-petrenko thank you for the insight! Apologies, I did not think to look at the other environments for this config :see_no_evil:

I will run with the what you have given above :smile:

Yes, I have worked with envpool before, this is what I will try next. Have you done a benchmark comparison between envpool and standard atari on SampleFactory yet? I would imagine it gets similar speed up seen in the Sebulba PodRacer architecture, as it is also using a C++ based implementation for vectorising the environments.

alex-petrenko commented 1 year ago

I haven't done comparisons really, but I know Costa did. He has some implementations here that maybe you can harvest for parameters https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_envpool.py

There's also some info in their paper and repo: https://arxiv.org/abs/2206.10558 https://github.com/vwxyzjn/envpool-cleanrl

My guess is that you should be able to get 100+K easily with or without envpool, because you're probably going to be bottlenecked by the convnet backprop.