Closed e-kayrakli closed 1 month ago
@jhh67 I added you as an assignee as I believe this is fallout from https://github.com/chapel-lang/chapel/pull/25734. I think chpl_topo_selectMyDevices
sees 0 devices with the problematic setting, which causes the issue down the line. Could you take a look when you get the chance?
If anyone else bumps into this, I am working with the following hack to get this mode to relatively more workable state:
diff --git a/runtime/src/gpu/nvidia/gpu-nvidia.c b/runtime/src/gpu/nvidia/gpu-nvidia.c
index d7e93173f3..0069b87002 100644
--- a/runtime/src/gpu/nvidia/gpu-nvidia.c
+++ b/runtime/src/gpu/nvidia/gpu-nvidia.c
@@ -169,7 +169,7 @@ void chpl_gpu_impl_init(int* num_devices) {
chpl_topo_pci_addr_t *addrs = chpl_malloc(sizeof(*addrs) * numAddrs);
int rc = chpl_topo_selectMyDevices(allAddrs, addrs, &numAddrs);
- if (rc) {
+ if (true) {
chpl_warning("unable to select GPUs for this locale, using them all",
0, 0);
for (int i = 0; i < numAllDevices; i++) {
How do I replicate this problem?
Probably the key configs are:
CHPL_LLVM: system # for GPU support
CHPL_LOCALE_MODEL: gpu
CHPL_COMM: gasnet
CHPL_COMM_SUBSTRATE: udp
CHPL_GASNET_SEGMENT: everything
GASNET_SPAWNFN: L
Compiling and running the code in the OP with -nl2
should generate the incorrect result.
Resolved by PR https://github.com/chapel-lang/chapel/pull/26059.
The runtime doesn't seem to report the correct number of devices in this config.
reports 0 GPUs for each locale when run with more than 1 locale. If you run this with
-nl1
in the given config we get the correct number of GPUs. Things must be fine with actual multilocale config as we have a ton of nightly testing for that, but not really for the oversubscribed config with GPUs.How to share multiple GPUs in an oversubscribed setting is not something we have completely answered. However, we have been giving all locales all GPUs and letting the GPU driver figure things out, which I believe just serializes requests from different processes. I think we should fix this and go back to that world.