Asynchronous allreduce?

hiyijian commented 7 years ago

Hi baidu research team, Is it possible to make an asynchronous allreduce based on this project? I think it is quite important when we integrate allreudce into deep learning framework such as Caffe. Would you like to shed a light on it?

Thanks

shubho commented 7 years ago

What exactly is an asynchronous allreduce?

hiyijian commented 7 years ago

Sorry. Let me make it more clear.

synchronous allreduce is like MPI_Allreduce(...). It will not return untill the actull communition finished.

asynchronous allreduce is like MPI_Iallreduce(..., MPI_Request *request). It will immediately return, and we can wait it finished wherever we want. In deep learning framework, It gives an opportunity to overlap calculation and communiction as Baidu SVAIL blog has already mentioned

Thanks

shubho commented 7 years ago

The best way to do it is at the framework level - so allreduce is done on a separate thread and the thread that requires the output of allreduce can wait on the allreduce thread while all other threads can continue executing. This is how we do it in our internal framework.

This mechanism varies from framework from framework - so should be done by the framework authors.

Alternatively you can wrap this up in a std::async and then wait on the std::future to get the same functionality.

hiyijian commented 7 years ago

Thank you for the advise. I will have a try.

hiyijian commented 7 years ago

hi, I got some trouble when doing allreduce in a seperate thread in GPU mode.

We take computing in main thread, while do allreduce in another thread during backward. It sounds perfect. However, in practice all device memory allocated in cuda context of main thread. the context is bind to this thread only. Of cause We can pass data pointer to the seperate thread easily, but the pointer make no sense in that thread, since cuda context of this thread is DIFFRIENT from main thread. It caused following error

--------------------------------------------------------------------------
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0x130a620608
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------

Do you have any idea to avoid this?

Thanks a alot

gibiansky commented 7 years ago

In our tensorflow-allreduce patch, we end up having a single background MPI thread. That MPI thread establishes the CUDA context and then runs a loop that checks a queue for allreduce requests. In order to do an allreduce, a thread can put something into the "allreduce queue" and wait for the background MPI thread to get to it. You then need some mechanism to continue -- in our patch, we have each "allreduce request" come with the thing to reduce and also with a callback to call after the reduction is complete.

Our logic is coupled with TensorFlow a fair amount, but you can look at the MPI Background Thread loop here.

Is there a similar approach that could work for your application? The key is to make sure that a single thread establishes the GPU context, calls MPI_Init, and does all future MPI communication going forward.

hiyijian commented 7 years ago

Thanks @gibiansky . We did the job in a very similary way as yours. I finnaly solved the CUDA context problem via CUDA v8.0 cuCtxGetCurrent/cuCtxSetCurrent. Specifically, I call cuCtxGetCurrent to get a context handle in computing main thread , and then call cuCtxSetCurrent to set the same context handle in communication thread.

It works like a charm ~

Thanks a lot

baidu-research / baidu-allreduce

Asynchronous allreduce? #3