chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.76k stars 413 forks source link

Low CUDA API call performance when called from Chapel through interoperability #25311

Open e-kayrakli opened 2 weeks ago

e-kayrakli commented 2 weeks ago

I believe this has to do with how we handle CUDA contexts, but I can't really pinpoint it on a quick look. Consider the following set of codes:

runner.h:

#include <inttypes.h>

#ifdef __cplusplus
extern "C" {
#endif

typedef struct my_struct_s {
  int64_t x,y;
} my_struct_t;

my_struct_t* give_me_an_array(int64_t num_elems);

#ifdef __cplusplus
}
#endif

runner.c:

#include <cuda_runtime.h>
#include <stdio.h>
#include "runner.h"

my_struct_t* give_me_an_array(int64_t num_elems) {
  void* arr = NULL;
  printf("Calling cudaMalloc %ld\n", num_elems);
  cudaMalloc(&arr, num_elems*sizeof(my_struct_t));
  printf("cudaMalloc returned %ld\n", num_elems);

  return (my_struct_t*)arr;
}

int main() {
  my_struct_t* p = give_me_an_array(10);

  printf("%p\n", p);
}

app.chpl

require "runner.h";

use CTypes;

extern record my_struct_t {
  var x: int;
  var y: int;
}

extern proc give_me_an_array(numElems): c_ptr(my_struct_t);

config const numElems = 10;

on here.gpus[0] {
  var arrayFromC = give_me_an_array(numElems);

  @assertOnGpu
  foreach i in 0..#numElems {  // this should be a kernel in Chapel
    arrayFromC[i].x = i;
    arrayFromC[i].y = i;
  }

  for i in 0..#numElems {
    writeln(arrayFromC[i]);
  }
}

when compiled with

> nvcc -c runner.c
> chpl app.chpl runner.o

this application takes too long to execute cudaMalloc (I was certain that it froze, but it didn't). Subsequent runs are considerably faster, but still slower compared to native CUDA calls.

So far we have only made some rudimentary exploration towards interop with the GPU support enabled. This was my first time trying it out. So, I am glad that it works, but we may need to take a closer look for performance.

e-kayrakli commented 2 weeks ago

It'd be good to test this with a kernel launch, as well. I would expect a hit during kernel launch, but hopefully good performance during the kernel execution.