Closed Eric-Zhang1990 closed 5 years ago
Hi @Eric-Zhang1990, we have examples on how to train distributed PyTorch on FfDL. https://github.com/IBM/FfDL/blob/master/etc/examples/c10d-native-parallelism/model-files/train_dist_parallel.py#L187-L227
On FfDL, all the learners will have a shared working directory under /job/
path. We used that path to discover all the learner container's IP and connect them with 'gloo'/'nccl'/'mpi' protocols. Then during model training, we will average the gradient at the end of each batch for our example.
@Tomcli Thank you for your kind reply.
@Tomcli @sboagibm Sorry about bothering you. I am still confused about multi learners. What I understand is that multi learners mean distributed training, but when I run pytorch job using 3 learners (each learner has 1 gpu), I can get 3 training results, and they run independently.
Which means I just run the same job on different servers, not distributed training, right? Thank you.