chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 421 forks source link

Oversubscribed gasnet with GPU support is broken #25989

Closed e-kayrakli closed 1 month ago

e-kayrakli commented 1 month ago

The runtime doesn't seem to report the correct number of devices in this config.

for loc in Locales do on loc {
  writeln(here, " ", here.gpus.size);
}

reports 0 GPUs for each locale when run with more than 1 locale. If you run this with -nl1 in the given config we get the correct number of GPUs. Things must be fine with actual multilocale config as we have a ton of nightly testing for that, but not really for the oversubscribed config with GPUs.

How to share multiple GPUs in an oversubscribed setting is not something we have completely answered. However, we have been giving all locales all GPUs and letting the GPU driver figure things out, which I believe just serializes requests from different processes. I think we should fix this and go back to that world.

e-kayrakli commented 1 month ago

@jhh67 I added you as an assignee as I believe this is fallout from https://github.com/chapel-lang/chapel/pull/25734. I think chpl_topo_selectMyDevices sees 0 devices with the problematic setting, which causes the issue down the line. Could you take a look when you get the chance?

e-kayrakli commented 1 month ago

If anyone else bumps into this, I am working with the following hack to get this mode to relatively more workable state:

diff --git a/runtime/src/gpu/nvidia/gpu-nvidia.c b/runtime/src/gpu/nvidia/gpu-nvidia.c
index d7e93173f3..0069b87002 100644
--- a/runtime/src/gpu/nvidia/gpu-nvidia.c
+++ b/runtime/src/gpu/nvidia/gpu-nvidia.c
@@ -169,7 +169,7 @@ void chpl_gpu_impl_init(int* num_devices) {
   chpl_topo_pci_addr_t *addrs = chpl_malloc(sizeof(*addrs) * numAddrs);

   int rc = chpl_topo_selectMyDevices(allAddrs, addrs, &numAddrs);
-  if (rc) {
+  if (true) {
     chpl_warning("unable to select GPUs for this locale, using them all",
                  0, 0);
     for (int i = 0; i < numAllDevices; i++) {
jhh67 commented 1 month ago

How do I replicate this problem?

e-kayrakli commented 1 month ago

Probably the key configs are:

CHPL_LLVM: system  # for GPU support
CHPL_LOCALE_MODEL: gpu
CHPL_COMM: gasnet
  CHPL_COMM_SUBSTRATE: udp
  CHPL_GASNET_SEGMENT: everything
GASNET_SPAWNFN: L

Compiling and running the code in the OP with -nl2 should generate the incorrect result.

jhh67 commented 1 month ago

Resolved by PR https://github.com/chapel-lang/chapel/pull/26059.