Open e-kayrakli opened 2 weeks ago
Doing a chpl_gpu_task_fence()
instead of the kernel launch doesn't help.
This over-synchronization is something we noted elsewhere where some arguments are passed by offloads to kernels. This results in page-locked allocation on the host, which is overly-synchronized and disrupts other GPUs' execution. Following runtime patch will probably address the issue for the other case, but it shouldn't impact this code.
diff --git a/runtime/src/chpl-gpu.c b/runtime/src/chpl-gpu.c
index b4b007fa88..cb65ea2063 100644
--- a/runtime/src/chpl-gpu.c
+++ b/runtime/src/chpl-gpu.c
@@ -330,8 +330,8 @@ static void cfg_add_offload_param(kernel_cfg* cfg, void* arg, size_t size) {
// TODO this doesn't work on EX, why?
// *kernel_params[i] = chpl_gpu_impl_mem_array_alloc(cur_arg_size, stream);
- *(cfg->kernel_params[i]) = chpl_gpu_mem_alloc(size, CHPL_RT_MD_GPU_KERNEL_ARG,
- cfg->ln, cfg->fn);
+ *(cfg->kernel_params[i]) = chpl_gpu_mem_array_alloc(size, CHPL_RT_MD_GPU_KERNEL_ARG,
+ cfg->ln, cfg->fn);
chpl_gpu_impl_copy_host_to_device(*(cfg->kernel_params[i]), arg, size,
cfg->stream);
Internal note for this is in https://github.com/Cray/chapel-private/issues/6167.
I looked a bit into whether we can allocate class instances in device memory. The motivation for keeping them on host is because we want to initialize class instances on host even if they are allocated on a GPU sublocale. My thinking was: after making that decisions we added the ability to do get/put to/from GPU memory, so even if the class instance is allocated on the device memory, the CPU can still initialize using gets/puts. I couldn't move past some codegen issues quickly. I put my branch in https://github.com/chapel-lang/chapel/compare/main...e-kayrakli:chapel:gpu-class-on-device
Summary of Problem
Page-locked host allocations in multiple-gpu-per-node setups can cause unnecessary synchronization. This shows as some GPUs taking much longer than others in a single node, where the workload is uniform.
Reported by @Guillaume-Helbecque on Gitter.
Description:
results in
where the behavior for t1 is hard to explain
Is this a blocking issue with no known work-arounds? I don't know.
A strange (and partial) workaround is to add a kernel launch:
This kernel must be changing the scheduling behavior, resulting in much more uniform t1s, while also hurting t2. I can't tell whether this is acceptable. Regardless, it is a data point in further investigations.
Configuration Information
GPU config with NVIDIA. I suspect to see the same behavior with AMD.