chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 420 forks source link

How can we handle systems with integrated and discrete GPUs? #22782

Open e-kayrakli opened 1 year ago

e-kayrakli commented 1 year ago

https://github.com/chapel-lang/chapel/issues/22754 uncovered several issues with such systems. Which are predominantly non-HPC consumer-grade systems, I would presume. As Chapel wants to be portable across wide spectrum of systems and developing code on a personal system and than running it on a supercomputer is one of Chapel's appeals, I think we should support such systems.

However, we typically assume homogeneity among locales (not within). I don't know how hard-coded that expectation is. But I don't think we have much experience in running on 2 locales where core numbers differ for example, as that's an uncommon setup. By analogy, so is having two GPU sublocales that have different characteristics among number of cores in them. So, before we jump into trying to run across both the integrated and the discrete GPU on a system (assuming that we can even run on the former), I think we should add the ability to blissfully ignore the integrated GPU.

Under the original issue a combination of a very hacky runtime patch and using on here.gpus[1] made sure that the user always uses the discrete GPU. The runtime hack:

diff --git a/runtime/src/gpu/amd/gpu-amd.c b/runtime/src/gpu/amd/gpu-amd.c
index 6f6afe5403..515a9ba8d4 100644
--- a/runtime/src/gpu/amd/gpu-amd.c
+++ b/runtime/src/gpu/amd/gpu-amd.c
@@ -138,7 +138,7 @@ void chpl_gpu_impl_init(int* num_devices) {
   deviceClockRates = chpl_malloc(sizeof(int)*loc_num_devices);

   int i;
-  for (i=0 ; i<loc_num_devices ; i++) {
+  for (i=1 ; i<loc_num_devices ; i++) {
     hipDevice_t device;
     hipCtx_t context;

I also have a more proper patch that is untested on a real system with integrated GPUs. The patch is here: https://github.com/chapel-lang/chapel/files/12083995/ignoregpus.patch. It requires CHPL_RT_NUM_IGNORED_GPUS=N to be set, which causes the runtime to skip first N GPUs while initializing. This also shrinks here.gpus arrays properly while making gpus[0] the first discrete device. The big question with that patch is whether integrated GPUs always have lower ids. I have no idea whether there can be more than one integrated GPU, or whether a system with an integrated GPU can have more than on discrete GPU.

In writing all of this, I am also curious whether Grace Hopper falls into the "integrated" category here.

e-kayrakli commented 1 year ago

Looks like "device properties" types on both CUDA and HIP as a field "integrated". Can we use that field to ignore some devices?

CUDA: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g1bf9d625a931d657e08db2b4391170f0 HIP: https://docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/structhip_device_prop__t.html