astooke / Synkhronos

Extension to Theano for multi-GPU data parallelism
MIT License
20 stars 5 forks source link

multi-node support #12

Open astooke opened 7 years ago

astooke commented 7 years ago

Starting a new issue in reference to question: (https://github.com/astooke/Synkhronos/issues/11#issuecomment-326628646)

I have not experimented with running Synkhronos multi-node. Currently it's only built for single-node. To run multi-node would require another layer to coordinate and communicate among nodes. Certainly sounds possible, with a separate instance of the current Synkhronos running on each node. I haven't put a lot of thought into this yet, because my current research is well-suited to running single-node.

Apparently the new version of NCCL, version 2, supports inter-node communication. I have not tried it yet (Synkhronos is currently built on version 1). Synkhronos uses NCCL through libgpuarray and pygpu...I'm not sure what the compatibility status is through that chain.

Note that a key to scaling well to 256 GPUs in the large minibatch ResNet paper is to start communicating on gradients as they are computed layer-by-layer, simultaneously with performing the rest of the backpropagation.

I'd be curious to hear if you try anything!

Have you tried any other packages / libraries for running multi-GPU? e.g. TensorFlow, PyTorch, Chainer? And how does using them compare to Synkhronos?

nouiz commented 7 years ago

for libgpuarray, the new version 0.7 (and 0.7.1 to be released today to fix some compilation crash with cuda8) need NCCL2. So someone can start to explore how to use it for multi-node computation.

Note, Theano master don't support that version of pygpu yet. We need to merge a PR first. We are waiting for 0.7.1 to be out for that.

On Fri, Sep 1, 2017 at 1:56 PM astooke notifications@github.com wrote:

Starting a new issue in reference to question: (#11 (comment) https://github.com/astooke/Synkhronos/issues/11#issuecomment-326628646)

I have not experimented with running Synkhronos multi-node. Currently it's only built for single-node. To run multi-node would require another layer to coordinate and communicate among nodes. Certainly sounds possible, with a separate instance of the current Synkhronos running on each node. I haven't put a lot of thought into this yet, because my current research is well-suited to running single-node.

Apparently the new version of NCCL, version 2, supports inter-node communication. I have not tried it yet (Synkhronos is currently built on version 1). Synkhronos uses NCCL through libgpuarray and pygpu...I'm not sure what the compatibility status is through that chain.

Note that a key to scaling well to 256 GPUs in the large minibatch ResNet paper https://arxiv.org/pdf/1706.02677.pdf is to start communicating on gradients as they are computed layer-by-layer, simultaneously with performing the rest of the backpropagation.

I'd be curious to hear if you try anything!

Have you tried any other packages / libraries for running multi-GPU? e.g. TensorFlow, PyTorch, Chainer? And how does using them compare to Synkhronos?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/astooke/Synkhronos/issues/12, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-1kONg7hYVZqu9j-b65LCn8Coo0Aks5seEUugaJpZM4PKdE7 .

Nqabz commented 7 years ago

I recently tried NCCLv2 but discovered that I pygpu was not updated to match the API changes. NVIDIA then recently fixed some GPUArray issues for compatibility with NCCLv2 and they are waiting to commit sometime soon. Not sure if that will be internally or open source? Will post back once I have more information.

nouiz commented 7 years ago

Note,

If you install yourself libgpuarray and pygpu from the master of libgpuarray it will work.

But you will need to install this branch of Theano:

https://github.com/Theano/Theano/pull/6317

and rebase it too. This should be merged next week. (I already wrote that, but this should be real this time:)

So maybe wait next week, it would be simpler.

On Fri, Sep 1, 2017 at 2:27 PM dl_starTN notifications@github.com wrote:

I recently tried NCCLv2 but discovered that I pygpu was not updated to match the API changes. NVIDIA then recently fixed some GPUArray issues for compatibility with NCCLv2 and they are waiting to commit sometime soon. Not sure if that will be internally or open source? Will post back once I have more information.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/astooke/Synkhronos/issues/12#issuecomment-326652204, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-7K8KC57WFRSq68IzV-iKW7f-Sz5ks5seEx4gaJpZM4PKdE7 .

Nqabz commented 7 years ago

@nouiz Sounds great. I will await your merge next week.

Thanks!