Communication nodes: overlapping

mshiryaev commented 7 years ago

Hi,

Is it possible to achieve good computation/communication overlapping using current implementation of communication nodes? Currently I see that they are expressed over 1 node in graph. How to postpone completion of communication till moment when result of communication will be required?

slayton58 commented 7 years ago

Yes it is possible - caffe2 graphs are generally traversed in parallel using multiple threads - this means that communication can potentially start as soon as the gradients have been calculated on all devices, and when the communication has finished, it lets any operators that rely on the result to start.

jawaid2018 commented 7 years ago

26.04.2017, 06:32, "Simon Layton" notifications@github.com:Yes it is possible - caffe2 graphs are generally traversed in parallel using multiple threads - this means that communication can potentially start as soon as the gradients have been calculated on all devices, and when the communication has finished, it lets any operators that rely on the result to start.

—You are receiving this because you are subscribed to this thread.Reply to this email directly, view it on GitHub, or mute the thread.

mshiryaev commented 7 years ago

Simon Layton, thanks for reply.

Does it mean that Caffe2 builds many sub-graphs for operations which can be executed in parallel with communications (or it is still one big graph with dependencies)?

And how Caffe2 will handle case when there are multiple communications (for example, for each layer on back-propagation) - will be multiple threads used for these communications (i.e. multiple cores will be busy with waiting of communications)?

slayton58 commented 7 years ago

It's just one big graph with dependencies.

Multiple threads can be used on multiple communications in back prop (although remember that in general one hopes that communication time < computation time so that we don't have many communications outstanding at any given time) - NCCL calls on multiple GPUs are explicitly serialised though, as each call is designed to achieve maximum bandwidth, and forcing calls to share precious resources, whether it be PCIe or NVLINK can slow things down overall.

mshiryaev commented 7 years ago

Do I correctly understand that to get good computation/communication overlapping over NCCL user should queue communication operation and computations which depend from it in one CUDA stream and all remaining operations (which don't depend from communication) in other CUDA stream?

I am trying to figure out why Caffe2 uses only one graph node to express communication (i.e. only blocking MPI collectives). In case of GPU where computations and communications are offloaded as kernels and can be overlapped using multiple CUDA streams - it is enough. But what is about CPU? The blocking collective will nullify all potential overlapping.

slayton58 commented 7 years ago

That happens automatically - each thread consuming the computational graph ("worker threads" from now on) run asynchronously, and each worker thread has it's own cuDNN handle, cublas handle, cuda stream, etc. in order to ensure that GPU work can also run in parallel with work dispatched from other worker threads.

So in the case of the blocking MPI comms, they will block only the thread that calls them. All other threads traversing the graph will be unaffected.

mshiryaev commented 7 years ago

I understand that approach. Just want to note that currently MPI implementations have limited multi-threading support (may be you use the one that has good MT support). In the typical implementation the blocking collective will block global lock and will keep it until completion of communication. And that's why the other thread will not be able to start his communication. Of course it is possible that the first thread will release lock after some spinning to allow the second thread to start communication but that logic adds extra partial serialization.

As far as I understand for GPU case it is not a problem cause operations are queued by different threads to different streams and all scheduling happens on GPU side (and invocation of NCCL collective is non-blocking, it is just queuing to stream). But for CPU case that approach is limiting the performance, the non-blocking MPI collectives can give performance gain and to use them communication should be expressed over 2 nodes.

I raised that topic cause Caffe1 (I mean Intel's fork of Caffe) has multi-node support over non-blocking communications and I expect that Caffe2 will be the same or better from multi-node performance point of view. But the current Caffe2's design with blocking collective looks like the step back.

slayton58 commented 7 years ago

Ok, I think I understand your issue now. It would be easy to add support for the non-blocking collectives (MPI_Iallreduce presumably), but I would suspect that having multiple collectives operating over the same network at the same time would either not give a benefit or give an overall negative effect on scaling due to adding contention to network links.

My experience here is with NCCL, which is designed to max out communication links available to it, so trying to run multiple NCCL calls in parallel was pointless, as they just competed for the shared resources.

Also worth noting that C2 uses the gloo collectives library for multi-node, which I'm not too familiar with: \cc @pietern

facebookarchive / caffe2

Communication nodes: overlapping #388