Question about Multi-Processing Training

Shua-Kang commented 5 days ago

Hi, Thank you for your great work!

I would like to know if BenchMARL supports multi-cpu/gpu for training. Another similar library like Marllib uses Ray for parallel training. I did not find BenchMARL to have the support of this. If not, is there any plan to implement multi-process training in the future?

matteobettini commented 5 days ago

Hello,

Thanks for reaching out!

When you say multiple devices for training, you mean for collection or for actual gradient updates?

When it comes to collection, it should be simple to allow to collect from multiple processes for non-vectorized environments. It might be as simple as changing the SerialEnv in this line https://github.com/facebookresearch/BenchMARL/blob/dc793b5dce2eeae94be03e7ca72b10b4b16c66db/benchmarl/experiment/experiment.py#L421 to ParallelEnv. Ofc this is not very useful in vectorized environments like VMAS as for those it is faster to use a huge batch size on one GPU.

When it comes to training, I never thought about it. The way I remember RLlib did it was with multiple collection workers that feed to one trainer. We could envision the gradient update to be split to different devices, but untill now I did not see a use-case for it. Happy to think about it, how does RLlib do it?

Shua-Kang commented 4 days ago

Thank you for your reply!

I'm referring to both aspects. It seems that to adjust the settings from the arguments, I can modify this line:

https://github.com/facebookresearch/BenchMARL/blob/dc793b5dce2eeae94be03e7ca72b10b4b16c66db/benchmarl/conf/experiment/base_experiment.yaml#L55

by using --experiment.on_policy_n_envs_per_worker=20.

I’m not fully familiar with the specific details of how MARllib implements multi-GPU training. I just saw from Marllib that they can set the number of GPUs to use. https://github.com/Replicable-MARL/MARLlib/blob/368c6173577d0f9c0ad70fb5b4b6afa12c864c15/marllib/marl/ray/ray.yaml#L30

Seems they directly use the implementation from Ray However, when I use Marllib, using more GPU does not make the training faster. I think the reason is the current models, like MLP or GRU, are relatively small, so the bottleneck is likely in the data collection process.

matteobettini commented 4 days ago

I see

by using --experiment.on_policy_n_envs_per_worker=20.

Yes, by changing that you can use more workers for collections. With VMAS and other vectorized envs, these will be the environments in the batch. With normal environments these will be workers that collect serially in the same process.

To enable multi-process collection we just have to allow users to change SerialEnv in the snippet I linked above to ParallelEnv. This is a change we can do and has been on the todos (#94) for a while.

I’m not fully familiar with the specific details of how MARllib implements multi-GPU training. I just saw from Marllib that they can set the number of GPUs to use. https://github.com/Replicable-MARL/MARLlib/blob/368c6173577d0f9c0ad70fb5b4b6afa12c864c15/marllib/marl/ray/ray.yaml#L30

Seems they directly use the implementation from Ray However, when I use Marllib, using more GPU does not make the training faster. I think the reason is the current models, like MLP or GRU, are relatively small, so the bottleneck is likely in the data collection process.

num_gpus in ray dictates the total number of gpus (for training and collection) it is unclear to me what ray does when multiple gpus are allocated just to training

Here https://docs.ray.io/en/latest/rllib/rllib-training.html#specifying-resources it says

num_gpus – Number of GPUs to allocate to the algorithm process. Note that not all algorithms can take advantage of GPUs. Support for multi-GPU is currently only available for tf-[PPO/IMPALA/DQN/PG]. This can be fractional (e.g., 0.3 GPUs).

and here https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.learners.html it says

num_gpus_per_learner – Number of GPUs allocated per Learner worker. If num_learners=0, any value greater than 0 runs the training on a single GPU on the main process, while a value of 0 runs the training on main process CPUs. If num_gpus_per_learner is > 0, then you shouldn’t change num_cpus_per_learner (from its default value of 1).

if you ask me to interpret this I would say that ray only ever uses max 1 GPU for training but i might be wrong

Anyway, for BenchMARL, I do not envision training on multiple processes as of yet, but collecting in multiple processes is definitely possible and will be implemented

Shua-Kang commented 2 days ago

Thank you for your reply and explanation!

I am currently designing a new multi-agent environment. Currently, I am integrating my environment into the Pettingzoo and then testing different algorithms on Benchmarl. After my environment is finished, I think I will also include it in the Benchmarl. Hope that won't take much time. :D

Thank you again for this great work. I saw a lot of other multi-agent libraries are no longer maintained. I believe more and more people will benefit from your work.

facebookresearch / BenchMARL

Question about Multi-Processing Training #129