chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.75k stars 410 forks source link

Variable performance with multiple GPUs per node (probably because of unnecessary synchronization) #24936

Open e-kayrakli opened 2 weeks ago

e-kayrakli commented 2 weeks ago

Summary of Problem

Page-locked host allocations in multiple-gpu-per-node setups can cause unnecessary synchronization. This shows as some GPUs taking much longer than others in a single node, where the workload is uniform.

Reported by @Guillaume-Helbecque on Gitter.

Description:

use Time;

config const nGpus = 1;
config const N = 40000;

proc main() {
  coforall gpuID in 0..#nGpus {
    var t1, t2: stopwatch;

    for i in 1..(N/nGpus) {
      t1.start();
      var A: [0..#10000] int = 3;
      t1.stop();

      t2.start();
      on here.gpus[gpuID] {
        const A_d = A;
      }
      t2.stop();
    }

    writeln("t1 = ", t1.elapsed(), ", t2 = ", t2.elapsed(), " on task ", gpuID);
  }
}

results in

>>> ./sandbox.o --nGpus 1 
t1 = 0.602692, t2 = 9.92918 on task 0

>>> ./sandbox.o --nGpus 2 
t1 = 0.396585, t2 = 4.46918 on task 0
t1 = 5.24443, t2 = 4.93758 on task 1

>>> ./sandbox.o --nGpus 4 
t1 = 0.212392, t2 = 3.2442 on task 3
t1 = 3.70708, t2 = 3.22561 on task 0
t1 = 3.71488, t2 = 3.22474 on task 2
t1 = 3.7028, t2 = 3.30863 on task 1

where the behavior for t1 is hard to explain

Is this a blocking issue with no known work-arounds? I don't know.

A strange (and partial) workaround is to add a kernel launch:

      on here.gpus[gpuID] {
        var A_d = A;
        if doKernel then foreach a in 0..0 { A_d[0] = 1; }
      }

This kernel must be changing the scheduling behavior, resulting in much more uniform t1s, while also hurting t2. I can't tell whether this is acceptable. Regardless, it is a data point in further investigations.

Configuration Information

GPU config with NVIDIA. I suspect to see the same behavior with AMD.

e-kayrakli commented 2 weeks ago

Doing a chpl_gpu_task_fence() instead of the kernel launch doesn't help.


This over-synchronization is something we noted elsewhere where some arguments are passed by offloads to kernels. This results in page-locked allocation on the host, which is overly-synchronized and disrupts other GPUs' execution. Following runtime patch will probably address the issue for the other case, but it shouldn't impact this code.

diff --git a/runtime/src/chpl-gpu.c b/runtime/src/chpl-gpu.c
index b4b007fa88..cb65ea2063 100644
--- a/runtime/src/chpl-gpu.c
+++ b/runtime/src/chpl-gpu.c
@@ -330,8 +330,8 @@ static void cfg_add_offload_param(kernel_cfg* cfg, void* arg, size_t size) {

   // TODO this doesn't work on EX, why?
   // *kernel_params[i] = chpl_gpu_impl_mem_array_alloc(cur_arg_size, stream);
-  *(cfg->kernel_params[i]) = chpl_gpu_mem_alloc(size, CHPL_RT_MD_GPU_KERNEL_ARG,
-                                                cfg->ln, cfg->fn);
+  *(cfg->kernel_params[i]) = chpl_gpu_mem_array_alloc(size, CHPL_RT_MD_GPU_KERNEL_ARG,
+                                                      cfg->ln, cfg->fn);

   chpl_gpu_impl_copy_host_to_device(*(cfg->kernel_params[i]), arg, size,
                                     cfg->stream);

Internal note for this is in https://github.com/Cray/chapel-private/issues/6167.


I looked a bit into whether we can allocate class instances in device memory. The motivation for keeping them on host is because we want to initialize class instances on host even if they are allocated on a GPU sublocale. My thinking was: after making that decisions we added the ability to do get/put to/from GPU memory, so even if the class instance is allocated on the device memory, the CPU can still initialize using gets/puts. I couldn't move past some codegen issues quickly. I put my branch in https://github.com/chapel-lang/chapel/compare/main...e-kayrakli:chapel:gpu-class-on-device