Open e-kayrakli opened 1 year ago
Looks like "device properties" types on both CUDA and HIP as a field "integrated". Can we use that field to ignore some devices?
CUDA: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g1bf9d625a931d657e08db2b4391170f0 HIP: https://docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/structhip_device_prop__t.html
https://github.com/chapel-lang/chapel/issues/22754 uncovered several issues with such systems. Which are predominantly non-HPC consumer-grade systems, I would presume. As Chapel wants to be portable across wide spectrum of systems and developing code on a personal system and than running it on a supercomputer is one of Chapel's appeals, I think we should support such systems.
However, we typically assume homogeneity among locales (not within). I don't know how hard-coded that expectation is. But I don't think we have much experience in running on 2 locales where core numbers differ for example, as that's an uncommon setup. By analogy, so is having two GPU sublocales that have different characteristics among number of cores in them. So, before we jump into trying to run across both the integrated and the discrete GPU on a system (assuming that we can even run on the former), I think we should add the ability to blissfully ignore the integrated GPU.
Under the original issue a combination of a very hacky runtime patch and using
on here.gpus[1]
made sure that the user always uses the discrete GPU. The runtime hack:I also have a more proper patch that is untested on a real system with integrated GPUs. The patch is here: https://github.com/chapel-lang/chapel/files/12083995/ignoregpus.patch. It requires
CHPL_RT_NUM_IGNORED_GPUS=N
to be set, which causes the runtime to skip firstN
GPUs while initializing. This also shrinkshere.gpus
arrays properly while makinggpus[0]
the first discrete device. The big question with that patch is whether integrated GPUs always have lower ids. I have no idea whether there can be more than one integrated GPU, or whether a system with an integrated GPU can have more than on discrete GPU.In writing all of this, I am also curious whether Grace Hopper falls into the "integrated" category here.