Low CUDA API call performance when called from Chapel through interoperability

I believe this has to do with how we handle CUDA contexts, but I can't really pinpoint it on a quick look. Consider the following set of codes:

runner.h:

#include <inttypes.h>

#ifdef __cplusplus
extern "C" {
#endif

typedef struct my_struct_s {
  int64_t x,y;
} my_struct_t;

my_struct_t* give_me_an_array(int64_t num_elems);

#ifdef __cplusplus
}
#endif

runner.c:

#include <cuda_runtime.h>
#include <stdio.h>
#include "runner.h"

my_struct_t* give_me_an_array(int64_t num_elems) {
  void* arr = NULL;
  printf("Calling cudaMalloc %ld\n", num_elems);
  cudaMalloc(&arr, num_elems*sizeof(my_struct_t));
  printf("cudaMalloc returned %ld\n", num_elems);

  return (my_struct_t*)arr;
}

int main() {
  my_struct_t* p = give_me_an_array(10);

  printf("%p\n", p);
}

app.chpl

require "runner.h";

use CTypes;

extern record my_struct_t {
  var x: int;
  var y: int;
}

extern proc give_me_an_array(numElems): c_ptr(my_struct_t);

config const numElems = 10;

on here.gpus[0] {
  var arrayFromC = give_me_an_array(numElems);

  @assertOnGpu
  foreach i in 0..#numElems {  // this should be a kernel in Chapel
    arrayFromC[i].x = i;
    arrayFromC[i].y = i;
  }

  for i in 0..#numElems {
    writeln(arrayFromC[i]);
  }
}

when compiled with

> nvcc -c runner.c
> chpl app.chpl runner.o

this application takes too long to execute cudaMalloc (I was certain that it froze, but it didn't). Subsequent runs are considerably faster, but still slower compared to native CUDA calls.

So far we have only made some rudimentary exploration towards interop with the GPU support enabled. This was my first time trying it out. So, I am glad that it works, but we may need to take a closer look for performance.

chapel-lang / chapel

Low CUDA API call performance when called from Chapel through interoperability #25311