facebookresearch / nocturne

A data-driven, fast driving simulator for multi-agent coordination under partial observability.
MIT License
259 stars 29 forks source link

[Question] Optimal hyperparameters and scripts to reach 2000 steps/sec training speed #58

Open wenjie-mo opened 1 year ago

wenjie-mo commented 1 year ago

Question

Hello I am wondering which script and hyperparameters could achieve the 2000+ step/sec training speed as mentioned in the paper. So I have tried the following:

  1. run_sample_factory.py algorithm=APPO Problem: When using sample_factory library: parameters lr_schedule and max_entropy_coeff are missing, not sure what are the optimal numbers I should use.

  2. run_rllib.py Problem: same run time error for every worker, attached below:

image

  1. nocturne_runner.py Problem: the training speed is not that fast (around 100 steps/sec with around 40 fps). I have tried #38 and fps improved to around 80fps but the steps are still around the same.

My settings: Code: newest code from main branch OS: Ubuntu 20.04 GPU: RTX 3080 with CUDA 11.6 sample_factory: I have tried latest and aed6cc92a7eb3510c4d4bcfac083ced07b5222f9 (as mentioned in paper)

Please let me know if I made anything wrong when running the scripts. Thanks so much for answering!

eugenevinitsky commented 1 year ago

Hi! Sorry you've been having trouble. Let me answer each one piece by piece. First off, that 2k number corresponds to environment stepping time (i.e. no RL algo in the loop) so during training you'll see an FPS that differs significantly from algorithm depending on the type of policy used and whether the environment calculates a per-agent FPS or an overall "amount of experience generated per second in total". As for each particular one.

  1. In the first type, we didn't freeze our sample factory version and the newest one has an additional hparam that we didn't have in our version. This is fixed here https://github.com/facebookresearch/nocturne/pull/59 and will be merged shortly. If you run on that PR on the machine you have you should see about 10k-20k fps.

  2. Looking into this one, this one usually means something went wrong with setting the config.

  3. For this one, you need to increase the value of n_training_threads. The environment is running without any vectorization by default. Hope that helps

wenjie-mo commented 1 year ago

Hi Eugene, thanks so much for the reply and clarification! I will try out these solutions soon and let you know if they all works!

wenjie-mo commented 1 year ago

Hi, sorry I accidentally closed the issue. I would like to keep the issue open just for tracking purpose. Thanks!