CUDA synchronization - Githubissues

simonbyrne commented 3 years ago

The failure on #439 is due to CUDA synchronization issues, From what I understand, the changes in https://github.com/JuliaGPU/CUDA.jl/pull/395 mean that streams are no longer globally synchronized. Since Open MPI operates its own streams, this means that operations can potentially overlap (https://github.com/open-mpi/ompi/issues/7733).

It seems like we need to call CUDA.synchronize() (or perhaps CUDA.synchronize(CuStreamPerThread())?) before calling MPI?

@vchuravy @maleadt do you have any suggestions/thoughts?

maleadt commented 3 years ago

CUDA.synchronize() synchronizes the context, so that should do the job. CUDA.synchronize(CuStreamPerThread()) only synchronizes the per-thread one, and I assume MPI doesn't use that stream. Can't you configure MPI to use the CuStreamPerThread stream?

vchuravy commented 3 years ago

Yeah we need to synchronize the CUDA stream on which the async memcpy HtoD is launched. So I think the right thing to do is to CUDA.synchronize(CuStreamPerThread()) before MPI ops. After staring at the OpenMPI code I think they are doing it right since they block CPU progress on the Gather until their cuda events have finished.

maleadt commented 3 years ago

CUDA.synchronize(CuStreamPerThread())

Don't use the per-stream thread specifically, just use the global one. It automatically 'translates' to the per-thread one, because we use per-thread-specific API calls. You only want to use the per-thread stream object when interfacing with binary libraries, like CUBLAS, that do not use the per-thread-specific API versions.

simonbyrne commented 3 years ago

Would that still work if we were calling MPI on multiple threads?

maleadt commented 3 years ago

Would that still work if we were calling MPI on multiple threads?

Depends on which operations you want to make sure have happened. If it's only about operations on that thread, then yes. If not, then you need a global synchronization. Either use synchronize(), which synchronizes the entire context, or (IIUC) call synchronize(CuStreamLegacy()) which performs old-style stream synchronization of the global stream, which will synchronize all other streams (that aren't created with an explicit NONBLOCK flag).

simonbyrne commented 3 years ago

What I was thinking was something where each thread was creating its own CuArray, executing a kernel and then passing it to MPI:

Threads.@threads for i in 1:Threads.nthreads()
    data = CuArray{Float32}(undef, N)
    @cuda blocks=1 threads=64 kernel(data, N)
    synchronize(CuDefaultStream()) # what should this be?
    MPI.Isend(data, dest, tag, comm)
end

maleadt commented 3 years ago

CuDefaultStream. All other operations do too, and in the presence of threads that will behind the scenes use a per-thread stream.

simonbyrne commented 3 years ago

Thanks. So I guess the question then is should we call this by default whenever a CuArray is passed as a buffer? Or leave it to the user?

kose-y commented 3 years ago

Related comments in: https://github.com/JuliaParallel/MPI.jl/pull/266

I considered this when I contributed code for CUDA-aware MPI, not knowing about the default global synchronization in CUDA.jl. One of the core contributors for this package was against making stream synchronization a default.

simonbyrne commented 3 years ago

I think the decision was that we should document it, since there wasn't a clear way to make this work consistently with minimal overhead.

vchuravy commented 2 years ago

Just as a quick update the right thing to do now is CUDA.synchronize() or to be explicit CUDA.synchronize(CUDA.stream())

simonbyrne commented 2 years ago

I've fixed the tests, we probably still need a more prominent documentation example.

JuliaParallel / MPI.jl

CUDA synchronization #443