Variable performance with multiple GPUs per node (probably because of unnecessary synchronization) #24936

e-kayrakli commented 2 weeks ago

Summary of Problem

Page-locked host allocations in multiple-gpu-per-node setups can cause unnecessary synchronization. This shows as some GPUs taking much longer than others in a single node, where the workload is uniform.

Reported by @Guillaume-Helbecque on Gitter.


use Time;

config const nGpus = 1;
config const N = 40000;

proc main() {
  coforall gpuID in 0..#nGpus {
    var t1, t2: stopwatch;

    for i in 1..(N/nGpus) {
      var A: [0..#10000] int = 3;

      on here.gpus[gpuID] {
        const A_d = A;

    writeln("t1 = ", t1.elapsed(), ", t2 = ", t2.elapsed(), " on task ", gpuID);

results in

>>> ./sandbox.o --nGpus 1 
t1 = 0.602692, t2 = 9.92918 on task 0

>>> ./sandbox.o --nGpus 2 
t1 = 0.396585, t2 = 4.46918 on task 0
t1 = 5.24443, t2 = 4.93758 on task 1

>>> ./sandbox.o --nGpus 4 
t1 = 0.212392, t2 = 3.2442 on task 3
t1 = 3.70708, t2 = 3.22561 on task 0
t1 = 3.71488, t2 = 3.22474 on task 2
t1 = 3.7028, t2 = 3.30863 on task 1

where the behavior for t1 is hard to explain

Is this a blocking issue with no known work-arounds? I don't know.

A strange (and partial) workaround is to add a kernel launch:

      on here.gpus[gpuID] {
        var A_d = A;
        if doKernel then foreach a in 0..0 { A_d[0] = 1; }

This kernel must be changing the scheduling behavior, resulting in much more uniform t1s, while also hurting t2. I can't tell whether this is acceptable. Regardless, it is a data point in further investigations.

Configuration Information

GPU config with NVIDIA. I suspect to see the same behavior with AMD.

e-kayrakli commented 2 weeks ago

Doing a chpl_gpu_task_fence() instead of the kernel launch doesn't help.

This over-synchronization is something we noted elsewhere where some arguments are passed by offloads to kernels. This results in page-locked allocation on the host, which is overly-synchronized and disrupts other GPUs' execution. Following runtime patch will probably address the issue for the other case, but it shouldn't impact this code.

diff --git a/runtime/src/chpl-gpu.c b/runtime/src/chpl-gpu.c
index b4b007fa88..cb65ea2063 100644
--- a/runtime/src/chpl-gpu.c
+++ b/runtime/src/chpl-gpu.c
@@ -330,8 +330,8 @@ static void cfg_add_offload_param(kernel_cfg* cfg, void* arg, size_t size) {

   // TODO this doesn't work on EX, why?
   // *kernel_params[i] = chpl_gpu_impl_mem_array_alloc(cur_arg_size, stream);
-  *(cfg->kernel_params[i]) = chpl_gpu_mem_alloc(size, CHPL_RT_MD_GPU_KERNEL_ARG,
-                                                cfg->ln, cfg->fn);
+  *(cfg->kernel_params[i]) = chpl_gpu_mem_array_alloc(size, CHPL_RT_MD_GPU_KERNEL_ARG,
+                                                      cfg->ln, cfg->fn);

   chpl_gpu_impl_copy_host_to_device(*(cfg->kernel_params[i]), arg, size,

