facebookresearch / torchbeast

A PyTorch Platform for Distributed RL
Apache License 2.0
734 stars 113 forks source link

Distributor Training #13

Closed vwxyzjn closed 4 years ago

vwxyzjn commented 4 years ago

Hi, I think your implementation of IMPALA is really well done. The code is concise, clear, and understandable.

I do have a question regarding distributed training. In https://github.com/facebookresearch/torchbeast#running-polybeast, it seems the instructions still assume that the script will be run under a single machine. In the TF implementation, we can configure the multi machine setting using ClusterSpec, as shown here https://github.com/deepmind/scalable_agent/blob/6c0c8a701990fab9053fb338ede9c915c18fa2b1/experiment.py#L479.

I was wondering if there's anyway to do the same with `torchbeast.

Thanks a lot.

heiner commented 4 years ago

Hey Costa,

Thanks for your interest and kind comments!

You are right that this example runs PolyBeast on the same machine. In order to run it across machines, you'd have to change the code a little bit I'm afraid: Right now it finds the environment servers via their pipe names, which look like unix://path/to/a/file. You'd instead have to use ip/port addresses like 127.0.0.1:12345. That change would happen e.g. here: https://github.com/facebookresearch/torchbeast/blob/master/torchbeast/polybeast.py#L448

Good luck :)