[DOC]: How do I share the semaphores amongst threads?

mrakgr commented 7 months ago

Is this a duplicate?

[X] I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Is this for new documentation, or an update to existing docs?

Update

Describe the incorrect/future/missing documentation

I want to use semaphores to block all the threads in the system until the current one is done printing, but it appears that the objects aren't being shared amongst threads even if they have the system scope.

Here is an example in Spiral that I am trying to get to work.

        inl lock : ref (_ _ thread_scope_system) = create_binary_semaphore 0
        if grid_group_thread_rank()+1 = grid_group_num_threads() then release lock
        acquire lock
        console.write_ln "hello"
        console.write_ln {id=grid_group_thread_rank()}
        release lock

Only the last Cuda threads executes the write statements, and the rest deadlock. This tells me that the rest of the threads do not have access to the same binary_semaphore object.

Here is the complete compiled output.

The documentation examples are very sparse regarding this. Just how are these objects supposed to be shared amongst the threads? I thought the cuda::thread_scope_system being passed as a template argument would be responsible for that, but it seems I was wrong.

If this is a correction, please provide a link to the incorrect documentation. If this is a new documentation request, please link to where you have looked.

https://nvidia.github.io/cccl/libcudacxx/extended_api/synchronization_primitives/binary_semaphore.html

mrakgr commented 7 months ago

I got it to work by putting the semaphore in global scope manually. It wouldn't be too problematic to use them like this, but if this is the way to make them work just what is the point of thread_scope arguments at all? They don't seem to be affecting anything.

jrhemstad commented 7 months ago

Hey @mrakgr, I see where you're getting confused. Let me try to help 🙂.

Ignore anything to do with the cuda::binary_semaphore type for a moment.

In the CUDA programming model, you have two main ways to share data between threads:

Global Memory

__global__ void kernel(int *data) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    data[index] = ...;  // All threads can access or modify data in global memory
}

Shared Memory

__global__ void kernel() {
    __shared__ int sharedData[256];  // Declare shared memory
    int index = threadIdx.x;
    sharedData[index] = ...;  // Only threads in the same block can access or modify data in shared memory
    __syncthreads();  
}

Now, the trick with interthread communication is "How do you guarantee that when one thread writes something, it's visible to another thread?"

This opens several cans of worms about memory models, memory consistency, memory fences, etc that would take a few books to fully explain 🙂.

The quick and dirty version is that when one thread writes to memory, and another thread reads from that same memory location, you need a memory fence to guarantee all the previous side effects from the first thread are visible to the second.

Memory fences come in different "strengths", where weaker ones are less expensive, but have fewer guarantees about what threads can see the side-effects. Using memory fences directly is extremely complicated and error prone, which is why there are a whole bunch of abstractions to try and hide that complexity.

This brings us back to your question about cuda::binary_semaphore. It is one such abstraction for simplifying inter-thread communication and the necessary memory fences.

There are two independent things at play that I believe you are conflating:

Where the cuda::binary_semaphore lives, aka, what memory does it reside in and therefore, what threads can share access to it
How strong the memory fence is

The cuda::thread_scope parameter is only relevant for (2).

It controls how wide the fence is. For example, thread_scope_block only guarantees visibility of side-effects to threads within the same block. thread_scope_device for all threads in the same kernel. thread_scope_system for all threads in the system (host and device).

So you could have a cuda::binary_semaphore<thread_scope_block> that lives in global memory, and while all threads could technically access it, you'd only be guaranteed visibility of side-effects among threads within the same block. Likewise, you could (but shouldn't) have a cuda::binary_semaphore<thread_scope_system> in shared memory. Only threads within the same block can access it, but it would use a much stronger fence than necessary (and impact performance).

mrakgr commented 7 months ago

Thank you for the explanation. I saw thread_scope_system variables being local in the doc example so I was confused when I tried using them and it turned out they were actually local, unlike say grid_group cooperative groups. I'll leave this issue open since the docs need improvement.

Also, for anybody stumbling upon this, instead of using global variables to share semaphores, another way of doing that would be to use static variables.

NVIDIA / cccl