facebookresearch / torchbeast

A PyTorch Platform for Distributed RL
Apache License 2.0
734 stars 113 forks source link

Using more than 2 GPUs with Polybeast #31

Open Batom3 opened 3 years ago

Batom3 commented 3 years ago

Hi, Firstly, thanks for the repository! As far as our understanding goes, IMPALA can be distributed across more than 2 GPUs. The example you have in the repo uses up to 2 GPUs. We have access to more GPUs in a single machine and want to utilize all in order to get the maximal throughput. What would be the best way to do it (more learners etc.) and what do we have to add/change to the code?

heiner commented 3 years ago

That's a great question, and doing so is not currently easy. One way to go about this would be to add PyTorch DistributedDataParallel logic to the learner. Another good (perhaps easier) way to use several GPUs is to run several experiments and have an additional "evolution" controller on top, which copies hyperparameters from one agent to the other.

In practise, we've used our GPU fleet to run more experiments in parallel instead. Later this year we also hope to share an update to TorchBeast that allows using more GPUs for a single learner, but it isn't quite ready yet.

ege-k commented 2 years ago

Hi,

Thanks for the answer. After my colleague posted this question, we actually managed to add DataParallel logic to the learner, and get roughly 2.5x-3x speedup (when distributing the learner across 3 GPUs while using another GPU for the actor). We would be happy to issue a pull request if this would be useful to the community.

Having done that, we are looking for other ways to speed up training further, and would greatly appreciate feedback from your end on the following ideas:

  1. Switching from DataParallel to DistributedDataParallel, like you suggested. I haven't used DistributedDataParallel before, but from the docs, it seems that I would have to wrap the train() function from polybeast_learner.py with DDP. Could you confirm this intuition?
  2. Also distributing the actor model across multiple GPUs. We have already tried this, but currently it does not work due to dynamic batching. AFAIK, DataParallel should handle undivisible batch sizes, but in our case it seems to divide the batch incorrectly, leading to errors.
  3. Increasing num_actors. It seems like the current bottleneck for us is actually the number of actors. Is there a way to decide on the optimal value for this hyperparameter (in terms of runtime), or should I just try a bunch of different values and pick the fastest? For reference, our machine has 128 cores. Also, is it possible with the current code to run agents across multiple machines? Finally, does this hyperparameter affect the learning dyamics?

Thanks a lot for your time.

heiner commented 2 years ago

Hey @ege-k,

That's great and we're more than happy to have a pull request for this.

As for your questions:

DataParallel vs DistributedDataParallel: I forgot about the exact details of these implementations, but I believe the main difference is that the latter uses multiprocessing as a way to get around contention on the Python GIL. This issue is more acute for RL agents, and slightly harder to deal with, since in this case the data itself is produced by the model and hence cannot just be loaded from disk to be ready when needed. Getting this to work in TorchBeast likely requires not using the top-level abstractions but the underlying tools.

Distribution the actor model: I'd assume a ratio of 1:3 or 1:4 for actor GPUs to learner GPUs is ideal in a typical setting. Once you want to use many more learner GPUs, distributing the actor model makes sense. This could be done by having different learners with different addresses and telling each actor which one to use. Dynamic batching would still happen, but only on that learner.

Your third question is the hardest one. Unfortunately, in RL often "everything depends on everything" so I cannot rule out that the number of actors influences the learning dynamics and therefore also changes the optimal hyperparameter setting. It certainly would if you also change batch sizes, which is likely required in order to find the best throughput. I don't think I know of a better way than to try various settings -- aiming to slightly overshoot as modern Linux kernels are quite efficient around thread/process scheduling and thus not a lot of waste is generated by the context switching.

As for the second part of your question: TorchBeast can be fully distributed if you find a way to tell each node the names of its peers. E.g., if you know about your setup and have fixed IP addresses, you could hardcode them. Often, that's not the case and you'll need some other means of communicating the names/addresses. E.g., you could use a shared file system (lots of issues around that, but it can work "in practise"), or a real name service, or a lock service on top of something like like etcd.

BTW, we are working on somewhat alternative designs currently and might have an update on that in a few weeks. Feel free to drop me an email if you would like to get an early idea of what we want to do.

kshitijkg commented 2 years ago

Are there any updates on this @heiner?

I was also wondering if it would be possible for @ege-k @Batom3 to share their code with me?

heiner commented 2 years ago

Hey everyone!

I've since left Facebook, but my amazing colleagues are have written https://github.com/facebookresearch/moolib, which you should check out for a multi-GPU IMPALA.