require "runner.h";
use CTypes;
extern record my_struct_t {
var x: int;
var y: int;
}
extern proc give_me_an_array(numElems): c_ptr(my_struct_t);
config const numElems = 10;
on here.gpus[0] {
var arrayFromC = give_me_an_array(numElems);
@assertOnGpu
foreach i in 0..#numElems { // this should be a kernel in Chapel
arrayFromC[i].x = i;
arrayFromC[i].y = i;
}
for i in 0..#numElems {
writeln(arrayFromC[i]);
}
}
when compiled with
> nvcc -c runner.c
> chpl app.chpl runner.o
this application takes too long to execute cudaMalloc (I was certain that it froze, but it didn't). Subsequent runs are considerably faster, but still slower compared to native CUDA calls.
So far we have only made some rudimentary exploration towards interop with the GPU support enabled. This was my first time trying it out. So, I am glad that it works, but we may need to take a closer look for performance.
It'd be good to test this with a kernel launch, as well. I would expect a hit during kernel launch, but hopefully good performance during the kernel execution.
I believe this has to do with how we handle CUDA contexts, but I can't really pinpoint it on a quick look. Consider the following set of codes:
runner.h:
runner.c:
app.chpl
when compiled with
this application takes too long to execute
cudaMalloc
(I was certain that it froze, but it didn't). Subsequent runs are considerably faster, but still slower compared to native CUDA calls.So far we have only made some rudimentary exploration towards interop with the GPU support enabled. This was my first time trying it out. So, I am glad that it works, but we may need to take a closer look for performance.