alex-petrenko / sample-factory

High throughput synchronous and asynchronous reinforcement learning
https://samplefactory.dev
MIT License
796 stars 109 forks source link

[Q?] How to max fps? #29

Closed jarlva closed 4 years ago

jarlva commented 4 years ago

Hey all, I have a system with 13G ram, two cores and a GPU. While training I notice that top shows overall ~25% cpu utilization (~55% each, in the columns) and same for GPU. FPS is around 2700. The gym environment is similar to cartpole, args below. If I push num_envs_per_worker above 40 it logs a lot of: "Waiting for trajectory buffer". Also tried increasing batch_size=1024.

What's the best way (methodology) to max fps?

   '--num_envs_per_worker=40',
    '--policy_workers_per_policy='2',
    '--batch_size=512',
    '--env=gym_Myrl-v0',
    '--experiment=' + timestamp_dir,
    '--recurrence=1',
    '--num_batches_per_iteration=4',
    '--hidden_size=256', #default 256
    '--encoder_type=mlp',
    '--encoder_subtype=mlp_mujoco',
    '--reward_scale=0.1',
    '--save_every_sec=30',
    '--experiment_summaries_interval=3',
    '--ppo_epochs=4',
    '--max_policy_lag=140',  # orig:100000
    '--seed=5', 
    '--use_rnn=False',
    '--with_vtrace=False',
    '--algo=APPO',

image

On another machine with 32 cores/64GB ram 1070 TI GPU I'm seeing: Learner 0 accumulated too much experience, stop experience collection! Learner is likely a bottleneck in your experiment (50 times)

Tried to increase --learner_main_loop_num_cores anywhere from 2 to 10. Any ideas what is hapening?

alex-petrenko commented 4 years ago

I haven't worked with this particular environment so I can't say what kind of performance you can expect. Can you please run the sampler to determine the max performance of your environment on the hardware you're using? Example: python -m run_algorithm --algo=DUMMY_SAMPLER --env=doom_benchmark --num_workers=20 --num_envs_per_worker=1 --experiment=dummy_sampler --sample_env_frames=5000000

40 envs per core is probably a bit too many, I usually had the best results with 10-20 envs per core. Hard to diagnose Waiting for trajectory buffer alone, are there other errors? You might be running out of memory or something.

In terms of CPU utilization, I'd recommend using htop and looking at CPU utilization per core rather than per process. From what you attached I can see that the total CPU usage is close to 200% which should be maximum number for a 2-core system. Could you attach htop screenshot?

With a 32-core machine, this message is probably correct. It is not an error, just a warning. You're collecting experience a bit faster than the learner can process it. Increasing batch size may help (but it can also reduce sample efficiency). Looking at the profiling output at the end of the program may help too. Try using CPU device instead of GPU just to see if it makes any difference.

jarlva commented 4 years ago

ok, got some info. Using a 12-core, 32GB ram and 1070 TI, latest git updates. Getting lots of error with the following args. Please let me know if more info is needed.

        '--num_envs_per_worker=20',
        '--policy_workers_per_policy=2'
        '--batch_size=512',
        '--env=gym_Myrl-v0',
        '--experiment=' + timestamp_dir,
        str('--train_dir=' + root_dir + '/train_dir'),
        '--recurrence=1',
        '--num_batches_per_iteration=4',
        '--hidden_size=512',
        '--encoder_type=mlp',
        '--encoder_subtype=mlp_mujoco',
        '--reward_scale=0.1',
        '--save_every_sec=30',
        '--experiment_summaries_interval=3',
        '--ppo_epochs=4',
        '--max_policy_lag=140',
        '--seed=5', 
        '--use_rnn=False',
        '--with_vtrace=False',
        '--algo=APPO',

[2020-07-15 17:12:00,893][03365] Initializing vector env runner 10...
[2020-07-15 17:12:00,898][03365] Unknown exception in rollout worker
Traceback (most recent call last):
  File "/content/sample-factory/algorithms/appo/actor_worker.py", line 824, in _run
    self._handle_reset()
  File "/content/sample-factory/algorithms/appo/actor_worker.py", line 723, in _handle_reset
    for split_idx, env_runner in enumerate(self.env_runners):
TypeError: 'NoneType' object is not iterable
[2020-07-15 17:12:00,945][03366] Initializing vector env runner 11...
[2020-07-15 17:12:00,952][03366] Unknown exception in rollout worker
Traceback (most recent call last):
  File "/content/sample-factory/algorithms/appo/actor_worker.py", line 824, in _run
    self._handle_reset()
  File "/content/sample-factory/algorithms/appo/actor_worker.py", line 723, in _handle_reset
    for split_idx, env_runner in enumerate(self.env_runners):
TypeError: 'NoneType' object is not iterable
[2020-07-15 17:12:00,968][03106] Worker 0 is stuck or failed (0.748). Reset!
[2020-07-15 17:12:01,005][03106] Worker 1 is stuck or failed (0.703). Reset!
[2020-07-15 17:12:01,013][03367] Initializing vector env runner 0...
[2020-07-15 17:12:01,018][03367] Initializing envs for env runner 0...
[2020-07-15 17:12:01,045][03106] Worker 2 is stuck or failed (0.658). Reset!

HTOP:

image

Ran the following. It was 100% cpu-bound and no GPU load. No errors

bash python -m run_algorithm --algo=DUMMY_SAMPLER --env=doom_benchmark --num_workers=20
 --num_envs_per_worker=1 --experiment=dummy_sampler --sample_env_frames=5000000

Sampling FPS: (1 sec: 11454.9, 10 sec: 9728.8, 60 sec: 10633.9, 300 sec: 10512.4, 600 sec: 10512.4). 
Total frames collected: 1006216
alex-petrenko commented 4 years ago

The exception itself is not very informative, it looks like something actually happened before this line that causes env_runners to be set to None.

Can you please send the full log file from the beginning? Preferably reduce the number of workers, this way you will have less log output to deal with.

The sampler is supposed to be cpu bound because it only samples your environment with no learning. This script provides you the optimistic estimate of upper bound on training performance with your environment on your system. Of course the actual training will be slower, but 10000fps is the fastest you can hope on this system with any RL algorithm.

jarlva commented 4 years ago

Reduced the num_workers to 10 and back to the previous: Learner 0 accumulated too much experience, stop experience collection! Learner is likely a bottleneck in your experiment (50 times) The training completed successfully.

From htop, below, one process is at 100%, I assume it's the learner? Set '--learner_main_loop_num_cores=4' without noticeable change. Is there a way to dedicate/allocate more resources to the learner maybe?

Args:

        '--num_envs_per_worker=10',
        '--policy_workers_per_policy=2'
        '--batch_size=256',
        '--env=gym_Myrl-v0',
        '--experiment=' + timestamp_dir,
        str('--train_dir=' + root_dir + '/train_dir'),
        '--recurrence=1',
        '--num_batches_per_iteration=4',
        '--hidden_size=256', #default 256
        '--reward_scale=0.1',
        '--save_every_sec=30',
        '--experiment_summaries_interval=3',
        '--ppo_epochs=4',
        '--max_policy_lag=140',  # orig:100000
        '--seed=5', 
        '--use_rnn=False',
        '--with_vtrace=False',
        '--algo=APPO',
        '--encoder_type=mlp',
        '--encoder_subtype=mlp_mujoco',

Log:

[2020-07-16 06:06:39,297][05236] Env runner 0, CPU aff. [0], rollouts 1910: timing wait_actor: 0.0106, waiting: 103.9107, reset: 0.0837, save_policy_outputs: 5.2160, env_step: 6.1193, overhead: 11.7368, complete_rollouts: 0.0802, enqueue_policy_requests: 0.8717, one_step: 0.0016, work: 31.6930, wait_buffers: 2.4094
[2020-07-16 06:06:39,307][05237] Env runner 1, CPU aff. [1], rollouts 1840: timing wait_actor: 0.0214, waiting: 103.8715, reset: 0.0398, save_policy_outputs: 5.0395, env_step: 6.1603, overhead: 11.5000, complete_rollouts: 0.9234, enqueue_policy_requests: 0.9175, one_step: 0.0021, work: 31.7465, wait_buffers: 2.1630
[2020-07-16 06:06:39,538][05233] Policy worker avg. requests 5.88, timing: init: 3.3960, wait_policy_total: 98.6753, wait_policy: 0.0002, handle_policy_step: 34.7921, one_step: 0.0037, deserialize: 1.6098, obs_to_device: 1.8058, stack: 5.3684, forward: 16.7451, postprocess: 6.0988, weight_update: 0.0005
[2020-07-16 06:06:39,560][05234] Policy worker avg. requests 5.96, timing: init: 3.4001, wait_policy_total: 98.0908, wait_policy: 0.0051, handle_policy_step: 35.0198, one_step: 0.0000, deserialize: 1.6538, obs_to_device: 1.8364, stack: 5.4664, forward: 16.7415, postprocess: 6.2013, weight_update: 0.0008
[2020-07-16 06:06:39,639][05226] GPU learner timing: extract: 0.5028, buffers: 0.1809, calc_gae: 2.6278, batching: 2.7622, buff_ready: 0.3820, tensors_gpu_float: 0.4104, squeeze: 0.0282, prepare: 6.4896, batcher_mem: 2.7325
[2020-07-16 06:06:39,973][05226] Train loop timing: init: 1.6184, train_wait: 0.0000, epoch_init: 19.7844, minibatch_init: 11.4171, forward_head: 4.8619, bptt_initial: 0.8734, bptt_forward_core: 0.2109, bptt: 0.4994, tail: 11.6709, losses: 10.0997, clip: 10.4050, update: 69.1529, after_optimizer: 0.1725, train: 133.1607
[2020-07-16 06:06:40,149][05205] Collected {0: 693248}, FPS: 5087.8
[2020-07-16 06:06:40,651][05205] Done!

image

alex-petrenko commented 4 years ago

Thank you for providing the profiling output. It is clear that the learner is the slowest component (by far) in your setup. Learning took 133 seconds (train: 133.1607) and the total amount of work on the average actor is only about 31 seconds (work: 31.6930).

There is no easy way to speed up the learner, its bottleneck is SGD done on GPU (by default). If you're aiming for peak throughput, here are the parameters I would suggest: 1) --ppo_epochs=4 => 1 epoch Currently, you are doing 4 forward/backward passes on the learner for every batch of experience. This can improve sample efficiency on many tasks, but it is also computationally expensive. Reducing the number of epochs might be trading sample efficiency for throughput, it might be worth it or not worth it depending on the task. Use a hyperparameter search to find the value that gives the best wall-time performance. In all experiments in the paper, we used 1 epoch, not 4.

2) --max_grad_norm=0.0 on many tasks gradient clipping is not required, and this is quite a heavy operation. Should lead to ~10% speedup of the learner

3) --num_batches_per_iteration=4 => 1. If you use multiple minibatches per iteration of learning SampleFactory will randomize and shuffle these minibatches, which takes time. With a single minibatch, we do not do it for obvious reasons.

4) --batch_size=256 -> 1024 (or more). Batch size greatly affects the learner throughput. However, this can affect sample efficiency as well. This greatly depends on the particular task, particularly on the variance of your gradients. Again, you can use hyperparameter search to find the sweet spot in terms of sample efficiency / wall time performance.

--learner_main_loop_num_cores=4 helps to speed up some of the pytorch operations done on CPU, batching in particular. But this is only relevant when the observation size is big, in this case the main loop can become the bottleneck. In you case, the train loop is a lot slower than the main loop, so this does not have any effect. I would keep it at 1.

alex-petrenko commented 4 years ago

It is worth mentioning that nvidia-smi probably won't report 100% GPU utilization even if your GPU is busy 100% of the time. The reason is that the majority of GPU cores aren't doing anything (because your policy network is very small). So you're constrained by GPU core frequency rather than max GPU throughput. Honestly, I have never seen 100% utilization with a small MLP policy. But we haven't experimented with them all that much..

jarlva commented 4 years ago

With your recommendations the system jumped 5X to ~20k! No errors. A few observations:

  1. Although I limited the cores to 9 (out of 12 available cores) htop shows that all 12 are busy.

  2. Pushing above num_envs_per_worker=14 (16 for ex.) or of I increase num_workers from 9 to 10 will spit the reset errors from earlier. So it seems that for many cores it makes sense to limit the num_workers. In this case: 9 out of 12 cores (3/4)?

  3. Increasing batch_size from 1024 to 2048 did not make a change in my case.

Maybe down the road if I may suggest a simple mechanism based on the hardware, for the system to self-configure these settings (cores/env/workers) for optimal results and simplification.

Args:

        '--num_envs_per_worker=14',
        '--batch_size=1024',
        # '--env=gym_CartPole-v1',
        '--env=gym_Myrl-v0',
        '--experiment=' + timestamp_dir,
        str('--train_dir=' + root_dir + '/train_dir'),
        '--recurrence=1',
        '--num_batches_per_iteration=1', # 4
        '--hidden_size=256',
        '--reward_scale=0.1',
        '--save_every_sec=30',
        '--experiment_summaries_interval=3',
        '--ppo_epochs=1',   # 4
        '--max_policy_lag=140',
        '--seed=5', 
        '--use_rnn=False',
        '--with_vtrace=False',
        '--algo=APPO',
        '--encoder_type=mlp',
        '--encoder_subtype=mlp_mujoco',
        '--num_workers=9',
        '--max_grad_norm=0.0',
        '--learner_main_loop_num_cores=2'

image

alex-petrenko commented 4 years ago

Although I limited the cores to 9 (out of 12 available cores) htop shows that all 12 are busy.

num_workers sets the number of rollout worker processes. Besides that, there are other processes (policy workers, learner, main process). Also, in a system like this, usually you won't see some of the cores busy and some fully unoccupied. The OS is free to schedule the processes as she pleases. This should not concern you.

My suggestion is to set num_workers to the number of CPU cores you have, unless you have a very good reason not to. This way we can actually assign each one of them to individual cores which reduces the amount of context switching. We also use lower priority processes (note blue color in htop) so you can usually continue working on your PC even during training.

Pushing above num_envs_per_worker=14 (16 for ex.) or of I increase num_workers from 9 to 10 will spit the reset errors from earlier. So it seems that for many cores it makes sense to limit the num_workers. In this case: 9 out of 12 cores (3/4)?

It would be interesting to see the full log output of the program. There is no limit in SampleFactory on the number of workers/envs per worker. In some of our experiments we used 112 workers and 30 envs on each worker without any problems. It is most likely something related to your setup (running out of memory or other resource maybe?)

Maybe down the road if I may suggest a simple mechanism based on the hardware, for the system to self-configure these settings (cores/env/workers) for optimal results and simplification.

Good suggestion! We were considering this, but an automatic algorithm like this is not trivial. There are many parameters and some of them trade sample efficiency for pure speed. Personally I used hyperparameter search in combination with the analysis of the profiling output. Although an automatic mechanism would be nice I agree.

jarlva commented 4 years ago

If I may suggest one more thing: Most of my work is in Windows, because the tool that creates the data is Windows only. So I have to keep rebooting between Windows and Linux, which is frustrating. I beleive the only diability is faster-fifo. Everything else works natively in Windows (pytorch/cuda/pip/etc). Windows subsystem 2 is not avialble yet due to some issues with the new Windows 2004. I belvelive that running Sample-factory in Windows will increase sample-factory's popularity and it's a low hanging fruit. Since faster-fifo is the only barrier.

jarlva commented 4 years ago

(I'll address the above next time I'm on a high-core Linux).

If I may suggest one more thing: most of my work is in Windows. Because the tool that generates the data is Windows-only. So I keep rebooting between Windows and Linux. Or rebuilding a rented remote VM. The hardware changes each time I rent so I have to optimize the parameters for it. Which is frustrating because it shits the focus away from actual work. Windows subsystem 2 is not viable yet due to some issues with the new Windows 2004 version (Investigated and tried).

I believe the only barrier to running Sample-Factory in Windows is faster-fifo, a low-hanging fruit. Everything else already works (pytorch/cuda/conda/pip/etc). Running Sample-Factory in Windows will increase it's popularity.

alex-petrenko commented 4 years ago

If this is indeed the case (faster fifo is the only barrier) then I can suggest that you write a python module called faster_fifo that just routes all the calls (get(), put(), etc.) to the standard multiprocessing.Queue. The interface is literally exactly the same so this should take you just a few minutes.

The only method that is different is get_many(), you can just replace it with something like:

msgs = []
while True:
  try:
    msgs.append(q.get_nowait()) 
  except Empty:
    break

return msgs

Bottom line: it's very easy to implement a dummy version faster-fifo for Windows, it's just not going to be "faster". The performance drop should be negligible in your configuration though. Keep in mind that SampleFactory was never tested on Windows so you might run into other issues.

jarlva commented 4 years ago

I'm not at that level programmer :) I'd be happy to compile/test if I can get the code and basic instructions how-to

alex-petrenko commented 4 years ago

Closing this. Feel free to add Windows support as a separate feature request ;)