Variable performance with multiple GPUs per node (probably because of unnecessary synchronization)

chapel-lang / chapel

a Productive Parallel Programming Language

Other

1.75k stars 410 forks source link

Summary of Problem

Page-locked host allocations in multiple-gpu-per-node setups can cause unnecessary synchronization. This shows as some GPUs taking much longer than others in a single node, where the workload is uniform.

Reported by @Guillaume-Helbecque on Gitter.

Description:

use Time;

config const nGpus = 1;
config const N = 40000;

proc main() {
  coforall gpuID in 0..#nGpus {
    var t1, t2: stopwatch;

    for i in 1..(N/nGpus) {
      t1.start();
      var A: [0..#10000] int = 3;
      t1.stop();

      t2.start();
      on here.gpus[gpuID] {
        const A_d = A;
      }
      t2.stop();
    }

    writeln("t1 = ", t1.elapsed(), ", t2 = ", t2.elapsed(), " on task ", gpuID);
  }
}

results in

>>> ./sandbox.o --nGpus 1 
t1 = 0.602692, t2 = 9.92918 on task 0

>>> ./sandbox.o --nGpus 2 
t1 = 0.396585, t2 = 4.46918 on task 0
t1 = 5.24443, t2 = 4.93758 on task 1

>>> ./sandbox.o --nGpus 4 
t1 = 0.212392, t2 = 3.2442 on task 3
t1 = 3.70708, t2 = 3.22561 on task 0
t1 = 3.71488, t2 = 3.22474 on task 2
t1 = 3.7028, t2 = 3.30863 on task 1

where the behavior for t1 is hard to explain

Is this a blocking issue with no known work-arounds? I don't know.

A strange (and partial) workaround is to add a kernel launch:

      on here.gpus[gpuID] {
        var A_d = A;
        if doKernel then foreach a in 0..0 { A_d[0] = 1; }
      }

This kernel must be changing the scheduling behavior, resulting in much more uniform t1s, while also hurting t2. I can't tell whether this is acceptable. Regardless, it is a data point in further investigations.

Configuration Information

GPU config with NVIDIA. I suspect to see the same behavior with AMD.

diff --git a/runtime/src/chpl-gpu.c b/runtime/src/chpl-gpu.c index b4b007fa88..cb65ea2063 100644 --- a/runtime/src/chpl-gpu.c +++ b/runtime/src/chpl-gpu.c @@ -330,8 +330,8 @@ static void cfg_add_offload_param(kernel_cfg* cfg, void* arg, size_t size) { // TODO this doesn't work on EX, why? // *kernel_params[i] = chpl_gpu_impl_mem_array_alloc(cur_arg_size, stream); - *(cfg->kernel_params[i]) = chpl_gpu_mem_alloc(size, CHPL_RT_MD_GPU_KERNEL_ARG, - cfg->ln, cfg->fn); + *(cfg->kernel_params[i]) = chpl_gpu_mem_array_alloc(size, CHPL_RT_MD_GPU_KERNEL_ARG, + cfg->ln, cfg->fn); chpl_gpu_impl_copy_host_to_device(*(cfg->kernel_params[i]), arg, size, cfg->stream);

chapel-lang / chapel

Variable performance with multiple GPUs per node (probably because of unnecessary synchronization) #24936

Summary of Problem

Configuration Information