Open simonbyrne opened 3 years ago
CUDA.synchronize()
synchronizes the context, so that should do the job. CUDA.synchronize(CuStreamPerThread())
only synchronizes the per-thread one, and I assume MPI doesn't use that stream. Can't you configure MPI to use the CuStreamPerThread
stream?
Yeah we need to synchronize the CUDA stream on which the async memcpy HtoD is launched. So I think the right thing to do is to CUDA.synchronize(CuStreamPerThread())
before MPI ops. After staring at the OpenMPI code I think they are doing it right since they block CPU progress on the Gather
until their cuda events have finished.
CUDA.synchronize(CuStreamPerThread())
Don't use the per-stream thread specifically, just use the global one. It automatically 'translates' to the per-thread one, because we use per-thread-specific API calls. You only want to use the per-thread stream object when interfacing with binary libraries, like CUBLAS, that do not use the per-thread-specific API versions.
Would that still work if we were calling MPI on multiple threads?
Would that still work if we were calling MPI on multiple threads?
Depends on which operations you want to make sure have happened. If it's only about operations on that thread, then yes. If not, then you need a global synchronization. Either use synchronize()
, which synchronizes the entire context, or (IIUC) call synchronize(CuStreamLegacy())
which performs old-style stream synchronization of the global stream, which will synchronize all other streams (that aren't created with an explicit NONBLOCK flag).
What I was thinking was something where each thread was creating its own CuArray, executing a kernel and then passing it to MPI:
Threads.@threads for i in 1:Threads.nthreads()
data = CuArray{Float32}(undef, N)
@cuda blocks=1 threads=64 kernel(data, N)
synchronize(CuDefaultStream()) # what should this be?
MPI.Isend(data, dest, tag, comm)
end
CuDefaultStream
. All other operations do too, and in the presence of threads that will behind the scenes use a per-thread stream.
Thanks. So I guess the question then is should we call this by default whenever a CuArray is passed as a buffer? Or leave it to the user?
Related comments in: https://github.com/JuliaParallel/MPI.jl/pull/266
I considered this when I contributed code for CUDA-aware MPI, not knowing about the default global synchronization in CUDA.jl. One of the core contributors for this package was against making stream synchronization a default.
I think the decision was that we should document it, since there wasn't a clear way to make this work consistently with minimal overhead.
Just as a quick update the right thing to do now is CUDA.synchronize()
or to be explicit CUDA.synchronize(CUDA.stream())
I've fixed the tests, we probably still need a more prominent documentation example.
The failure on #439 is due to CUDA synchronization issues, From what I understand, the changes in https://github.com/JuliaGPU/CUDA.jl/pull/395 mean that streams are no longer globally synchronized. Since Open MPI operates its own streams, this means that operations can potentially overlap (https://github.com/open-mpi/ompi/issues/7733).
It seems like we need to call
CUDA.synchronize()
(or perhapsCUDA.synchronize(CuStreamPerThread())
?) before calling MPI?@vchuravy @maleadt do you have any suggestions/thoughts?