EPCCed / gpu-directives

Contains material for a course using GPU directives
0 stars 0 forks source link

asynchronous offload #7

Open lucaparisi91 opened 2 months ago

lucaparisi91 commented 2 months ago
Compiler Support Notes
nvhpc 24.5 Yes No overlapping with data transfers
clang 18.1.8 Yes
cce 16.0.1 Yes Requires multiple CPU threads for concurrent launches
lucaparisi91 commented 2 months ago

Resources:

lucaparisi91 commented 1 month ago

Async Map

#pragma omp map(to: a[:N] ) nowait

In OpenMP 5.0 a task can be detached, even if not completed. Needs to use API to signal the completion of the task.

omp_event_handle_t *event;

#pragma omp task A 
{
}
#pragma omp task B detach(event)
{
    do_stuff()
    hipStreamAddCallback(stream,callback,&hip_event,0)

}

void callback(hipStream_t stream, hipError_t status, void * cb_dat){
omp_fullfill_event( *(omp_event_handle_t *) cb_data );
}
#pragma omp taskwait;
lucaparisi91 commented 1 month ago

The main_single_transfer.cpp file demonstrates using tasking and openmp offload. It runs in parallel with both nnhpc 24.5 and clang 18.1.8. Below a screenshot for nvidia nvhpc 24.5. image

lucaparisi91 commented 1 month ago

The main_multiple_transfer.cpp file demonstrate overlapping computation and memory transfer. This kind-of works with clang 18.1.8 . See the image below. Each task creates a kernel in a new gpu thread.
image

The nvidia compiler also creates different streams , however there does not seem to be any overlap between the running kernels or with the memory transfer, as per the screenshot below. image

lucaparisi91 commented 4 weeks ago

On A2, with cce 16.0.1 , main_single_transfer.cpp runs in serial. However they do run concurrently when using multiple threads. Below a screenshot for 4 threads with main_single_transfer_multiple_threads.cpp. The number of threads is equal to the number of compilers.

image