Open lucaparisi91 opened 2 months ago
Compiler | Support | Notes |
---|---|---|
nvhpc 24.5 | Yes | No overlapping with data transfers |
clang 18.1.8 | Yes | |
cce 16.0.1 | Yes | Requires multiple CPU threads for concurrent launches |
Async Map
#pragma omp map(to: a[:N] ) nowait
In OpenMP 5.0 a task can be detached, even if not completed. Needs to use API to signal the completion of the task.
omp_event_handle_t *event;
#pragma omp task A
{
}
#pragma omp task B detach(event)
{
do_stuff()
hipStreamAddCallback(stream,callback,&hip_event,0)
}
void callback(hipStream_t stream, hipError_t status, void * cb_dat){
omp_fullfill_event( *(omp_event_handle_t *) cb_data );
}
#pragma omp taskwait;
The main_single_transfer.cpp
file demonstrates using tasking and openmp offload.
It runs in parallel with both nnhpc 24.5 and clang 18.1.8. Below a screenshot for nvidia nvhpc 24.5.
The main_multiple_transfer.cpp
file demonstrate overlapping computation and memory transfer.
This kind-of works with clang 18.1.8 . See the image below. Each task creates a kernel in a new gpu thread.
The nvidia compiler also creates different streams , however there does not seem to be any overlap between the running kernels or with the memory transfer, as per the screenshot below.
On A2, with cce 16.0.1 , main_single_transfer.cpp
runs in serial. However they do run concurrently when using multiple threads. Below a screenshot for 4 threads with main_single_transfer_multiple_threads.cpp
. The number of threads is equal to the number of compilers.