Add support for sharing an OpenCL device between two MPI ranks.

ancahamuraru commented 9 years ago

Sharing an OpenCL device between two MPI ranks fails. The launch below will result in a crash: gmx mdrun -ntmpi 2 -gpu_id 00

The main change required for this configuration to work is moving context and program fields outside of gmx_device_info_t. They could be stored instead in gmx_nbnxn_ocl_t.

pszi1ard commented 9 years ago

I am not entirely familiar with OpenCL contexts, so sorry for the noob questions - got two of them.

Are there any drawbacks to creating separate contexts? More generally, how do tasks submitted to the GPU behave when submitted in different conditions: i) in the same queue and same ctx (I assume like in CUDA implicitly dependent); ii) in different queues in the same ctx - I assume overlaps is possible but what can actually overlap? iii) in different ctx iv) in different processes (is that equivalent with the above)?
Is there any way to avoid creating multiple contexts?

This is important to know because with CUDA, as context handling is automatic and multiple threads implicitly share a context, all tasks submitted from any of the thread-MPI ranks sharing a GPU (or MPI rank with CUDA MPS) can overlap, most importantly this allows overlapping a lot of the PCI-E transfer costs as well as mitigating the tail-effects of kernels.

sharpneli commented 9 years ago

Memory objects are shared only within a context and events for waiting something to be finished are also usable within a single context. Otherwise there is no downside in creating multiple contexts.

i) Implicitly dependent. So basically serialized. (assuming no Out of Order command queue and event trickery has been made, so just the basic default command queue). ii) Everything can overlap based on what the HW can do. In general it's memory transfers and computation which will overlap. One can restrict this overlap using events iii) Same as above except no restriction using events can be made iv) Correct, equivalent to above.

There is no need to create multiple contexts within a single process assuming a single implementation is used. The only non threadsafe API is the clSetKernelArg, and that's solved by having separate kernel object per thread. A context is only valid within a single process though, it cannot be shared between processes.

If one wants to use device from multiple platforms then a context per platform is required. As an example one cannot get Intel GPU and AMD GPU into the same context.

pszi1ard commented 9 years ago

ii) Everything can overlap based on what the HW can do.

What can the hardware do? Can kernels overlap (e.g. when a kernel does not fill the device) on AMD GPUs? How about Intel?

The only non threadsafe API is the clSetKernelArg

So I assume that means there are no ugly global objects like texture references in CUDA, right?

sharpneli commented 9 years ago

AMD GPU can theoretically overlap kernels. But I haven't checked if it actually does so. Same goes for Intel. On graphics both of them do overlap.

There is no global state in OpenCL. Everything always wants whatever it needs as a parameter.

pszi1ard commented 9 years ago

Thanks for the information. In the light of all this, we should be able to accomplish the same behavior and performance characteristics as in CUDA - except the functionality that the MPS server provides.

StreamHPC / gromacs

Add support for sharing an OpenCL device between two MPI ranks. #91