Open mrakgr opened 7 months ago
I got it to work by putting the semaphore in global scope manually. It wouldn't be too problematic to use them like this, but if this is the way to make them work just what is the point of thread_scope
arguments at all? They don't seem to be affecting anything.
Hey @mrakgr, I see where you're getting confused. Let me try to help 🙂.
Ignore anything to do with the cuda::binary_semaphore
type for a moment.
In the CUDA programming model, you have two main ways to share data between threads:
__global__ void kernel(int *data) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
data[index] = ...; // All threads can access or modify data in global memory
}
__global__ void kernel() {
__shared__ int sharedData[256]; // Declare shared memory
int index = threadIdx.x;
sharedData[index] = ...; // Only threads in the same block can access or modify data in shared memory
__syncthreads();
}
Now, the trick with interthread communication is "How do you guarantee that when one thread writes something, it's visible to another thread?"
This opens several cans of worms about memory models, memory consistency, memory fences, etc that would take a few books to fully explain 🙂.
The quick and dirty version is that when one thread writes to memory, and another thread reads from that same memory location, you need a memory fence to guarantee all the previous side effects from the first thread are visible to the second.
Memory fences come in different "strengths", where weaker ones are less expensive, but have fewer guarantees about what threads can see the side-effects. Using memory fences directly is extremely complicated and error prone, which is why there are a whole bunch of abstractions to try and hide that complexity.
This brings us back to your question about cuda::binary_semaphore
. It is one such abstraction for simplifying inter-thread communication and the necessary memory fences.
There are two independent things at play that I believe you are conflating:
cuda::binary_semaphore
lives, aka, what memory does it reside in and therefore, what threads can share access to itThe cuda::thread_scope
parameter is only relevant for (2).
It controls how wide the fence is. For example, thread_scope_block
only guarantees visibility of side-effects to threads within the same block. thread_scope_device
for all threads in the same kernel. thread_scope_system
for all threads in the system (host and device).
So you could have a cuda::binary_semaphore<thread_scope_block>
that lives in global memory, and while all threads could technically access it, you'd only be guaranteed visibility of side-effects among threads within the same block. Likewise, you could (but shouldn't) have a cuda::binary_semaphore<thread_scope_system>
in shared memory. Only threads within the same block can access it, but it would use a much stronger fence than necessary (and impact performance).
Thank you for the explanation. I saw thread_scope_system
variables being local in the doc example so I was confused when I tried using them and it turned out they were actually local, unlike say grid_group
cooperative groups. I'll leave this issue open since the docs need improvement.
Also, for anybody stumbling upon this, instead of using global variables to share semaphores, another way of doing that would be to use static
variables.
Is this a duplicate?
Is this for new documentation, or an update to existing docs?
Update
Describe the incorrect/future/missing documentation
I want to use semaphores to block all the threads in the system until the current one is done printing, but it appears that the objects aren't being shared amongst threads even if they have the system scope.
Here is an example in Spiral that I am trying to get to work.
Only the last Cuda threads executes the write statements, and the rest deadlock. This tells me that the rest of the threads do not have access to the same
binary_semaphore
object.Here is the complete compiled output.
The documentation examples are very sparse regarding this. Just how are these objects supposed to be shared amongst the threads? I thought the
cuda::thread_scope_system
being passed as a template argument would be responsible for that, but it seems I was wrong.If this is a correction, please provide a link to the incorrect documentation. If this is a new documentation request, please link to where you have looked.
https://nvidia.github.io/cccl/libcudacxx/extended_api/synchronization_primitives/binary_semaphore.html