distributed training questions

IBM / FfDL

Fabric for Deep Learning (FfDL, pronounced fiddle) is a Deep Learning Platform offering TensorFlow, Caffe, PyTorch etc. as a Service on Kubernetes

https://developer.ibm.com/code/patterns/deploy-and-use-a-multi-framework-deep-learning-platform-on-kubernetes/

Apache License 2.0

692 stars 187 forks source link

distributed training questions #168

Closed Eric-Zhang1990 closed 5 years ago

Eric-Zhang1990 commented 5 years ago

@Tomcli @sboagibm Sorry about bothering you. I am still confused about multi learners. What I understand is that multi learners mean distributed training, but when I run pytorch job using 3 learners (each learner has 1 gpu), I can get 3 training results, and they run independently.

Which means I just run the same job on different servers, not distributed training, right? Thank you.

Tomcli commented 5 years ago

Hi @Eric-Zhang1990, we have examples on how to train distributed PyTorch on FfDL. https://github.com/IBM/FfDL/blob/master/etc/examples/c10d-native-parallelism/model-files/train_dist_parallel.py#L187-L227

On FfDL, all the learners will have a shared working directory under /job/ path. We used that path to discover all the learner container's IP and connect them with 'gloo'/'nccl'/'mpi' protocols. Then during model training, we will average the gradient at the end of each batch for our example.

Eric-Zhang1990 commented 5 years ago

@Tomcli Thank you for your kind reply.