flux-core filters out an allocated GPU

dongahn commented 3 years ago

Top level

rzansel16{dahn}28: env PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS -n 4 --bind=none --smpiargs="-disable_gpu_hooks" ./blueos_3_ppc64le_ib_p9/bin/flux start

From a proxy

rzansel61{dahn}27: flux mini alloc -n4 -N4 -c20 -g3
2020-11-26T05:05:44.912932Z resource.err[0]: verify: rank 0 (rzansel16) missing resources: gpu2
2020-11-26T05:05:46.531492Z resource.err[1]: verify: rank 1 (rzansel18) missing resources: gpu2
2020-11-26T05:05:46.532046Z resource.err[2]: verify: rank 2 (rzansel47) missing resources: gpu2
2020-11-26T05:05:47.439508Z resource.err[3]: verify: rank 3 (rzansel49) missing resources: gpu2
rzansel16{dahn}21: flux resource list
     STATE NNODES   NCORES    NGPUS
      free      4       80        8
 allocated      0        0        0
      down      0        0        0
rzansel16{dahn}23: echo $CUDA_VISIBLE_DEVICES
0,1,2

From the top level

flux job info fBAY1gobV R
{"version": 1, "execution": {"R_lite": [{"rank": "0-3", "children": {"core": "0-19", "gpu": "0-2"}}], "nodelist": ["rzansel[16,18,47,49]"], "starttime": 1606367143, "expiration": 1606971943}}

Apparently, the top-level scheduler creates correct resource set but one of the nest instances couldn't discover one GPU.

dongahn commented 3 years ago

Interestingly enough, the nested allocation seems to miss the GPU when it is allocated to less than or equal to 20 cores which is equal to the number of cores on a core.

rzansel61{dahn}32: flux mini alloc -n4 -N4 -c21 -g3
2020-11-26T05:22:49.937768Z resource.err[0]: verify: rank 0 (rzansel16) missing resources: gpu3
2020-11-26T05:22:51.539299Z resource.err[1]: verify: rank 1 (rzansel18) missing resources: gpu3
2020-11-26T05:22:51.541808Z resource.err[2]: verify: rank 2 (rzansel47) missing resources: gpu3
2020-11-26T05:22:52.452309Z resource.err[3]: verify: rank 3 (rzansel49) missing resources: gpu3
rzansel16{dahn}21: flux resource list
     STATE NNODES   NCORES    NGPUS
      free      4       84       12
 allocated      0        0        0
      down      0        0        0
exit
rzansel61{dahn}34: flux mini alloc -n4 -N4 -c18 -g3
2020-11-26T05:23:35.357347Z resource.err[0]: verify: rank 0 (rzansel16) missing resources: gpu[2-3]
2020-11-26T05:23:36.964271Z resource.err[2]: verify: rank 2 (rzansel47) missing resources: gpu[2-3]
2020-11-26T05:23:36.970715Z resource.err[1]: verify: rank 1 (rzansel18) missing resources: gpu[2-3]
2020-11-26T05:23:37.891596Z resource.err[3]: verify: rank 3 (rzansel49) missing resources: gpu[2-3]
rzansel16{dahn}21: flux resource list
     STATE NNODES   NCORES    NGPUS
      free      4       72        8
 allocated      0        0        0
      down      0        0        0

dongahn commented 3 years ago

I think I found the problem. W/ the process binding done at the top-level, it appears one socket is filtered out for the nested instance such a way that one GPU is also filtered out.

@grondo or @SteVwonder: do you think it is possible to not filter the socket when a GPU on it is allocated (that is, even if no core has been allocated from that socket?)

rzansel61{dahn}25: flux mini alloc -n1 -c20 -g3
2020-11-26T06:09:56.587278Z resource.err[0]: verify: rank 0 (rzansel49) missing resources: gpu[2-3]
node visited
numanode visited
socket visited
gpu visited: 2
gpu visited: 3
rzansel49{dahn}21: exit
exit
[detached: session exiting]
rzansel61{dahn}26: flux mini alloc -n1 -c21 -g3
2020-11-26T06:10:14.683571Z resource.err[0]: verify: rank 0 (rzansel49) missing resources: gpu3
node visited
numanode visited
socket visited
gpu visited: 1
numanode visited
socket visited
gpu visited: 2
gpu visited: 3

grondo commented 3 years ago

I fear I am not an hwloc expert. Currently we call hwloc_topology_restrict() with the current allowed cpuset and pass no flags. The docs state:

Topology topology is modified so as to remove all objects that are not included (or partially included) in the CPU set cpuset. All objects CPU and node sets are restricted accordingly.

Perhaps we should be using at least one of the ADAPT flags, so that objects are moved to ancestors during hwloc_topology_restrict()?

dongahn commented 3 years ago

I just confirmed that in this case, flux-core doesn't export the missing GPU in hwloc mode.

rzansel61{dahn}106: cat nest.form.xml | grep -i coproc
            <info name="CoProcType" value="CUDA"/>
            <info name="CoProcType" value="CUDA"/>

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topology SYSTEM "hwloc.dtd">
<topology>
  <object type="Machine" os_index="0" cpuset="0x0000f000,,,,,0x0" complete_cpuset="0x0000f000,,,,,0x0" online_cpuset="0x0000f000,,,,,0x0" allowed_cpuset="0x0000f000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100">
    <page_type size="65536" count="0"/>
    <page_type size="2097152" count="0"/>
    <page_type size="1073741824" count="0"/>
    <info name="PlatformName" value="PowerNV"/>
    <info name="PlatformModel" value="PowerNV 8335-GTW"/>
    <info name="Backend" value="Linux"/>
    <info name="LinuxCgroup" value="/allocation_599328"/>
    <info name="OSName" value="Linux"/>
    <info name="OSRelease" value="4.14.0-115.21.2.1chaos.ch6a.ppc64le"/>
    <info name="OSVersion" value="#1 SMP Fri May 22 11:01:06 PDT 2020"/>
    <info name="HostName" value="rzansel18"/>
    <info name="Architecture" value="ppc64le"/>
    <info name="hwlocVersion" value="1.11.10"/>
    <info name="ProcessName" value="broker"/>
    <object type="NUMANode" os_index="8" cpuset="0x0000f000,,,,,0x0" complete_cpuset="0x0000f000,,,,,0x0" online_cpuset="0x0000f000,,,,,0x0" allowed_cpuset="0x0000f000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100" local_memory="137166913536">
      <page_type size="65536" count="2093001"/>
      <page_type size="2097152" count="0"/>
      <page_type size="1073741824" count="0"/>
      <object type="Package" os_index="8" cpuset="0x0000f000,,,,,0x0" complete_cpuset="0x0000f000,,,,,0x0" online_cpuset="0x0000f000,,,,,0x0" allowed_cpuset="0x0000f000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100">
        <info name="CPUModel" value="POWER9, altivec supported"/>
        <info name="CPURevision" value="2.1 (pvr 004e 1201)"/>
        <object type="Core" os_index="2140" cpuset="0x0000f000,,,,,0x0" complete_cpuset="0x0000f000,,,,,0x0" online_cpuset="0x0000f000,,,,,0x0" allowed_cpuset="0x0000f000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100">
          <object type="PU" os_index="172" cpuset="0x00001000,,,,,0x0" complete_cpuset="0x00001000,,,,,0x0" online_cpuset="0x00001000,,,,,0x0" allowed_cpuset="0x00001000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100"/>
          <object type="PU" os_index="173" cpuset="0x00002000,,,,,0x0" complete_cpuset="0x00002000,,,,,0x0" online_cpuset="0x00002000,,,,,0x0" allowed_cpuset="0x00002000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100"/>
          <object type="PU" os_index="174" cpuset="0x00004000,,,,,0x0" complete_cpuset="0x00004000,,,,,0x0" online_cpuset="0x00004000,,,,,0x0" allowed_cpuset="0x00004000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100"/>
          <object type="PU" os_index="175" cpuset="0x00008000,,,,,0x0" complete_cpuset="0x00008000,,,,,0x0" online_cpuset="0x00008000,,,,,0x0" allowed_cpuset="0x00008000,,,,,0x0" nodeset="0x00000100" complete_nodeset="0x00000100" allowed_nodeset="0x00000100"/>
        </object>
      </object>
      <object type="Bridge" os_index="9" bridge_type="0-1" depth="0" bridge_pci="0033:[00-01]">
        <object type="PCIDev" os_index="53481472" name="Mellanox Technologies MT28800 Family [ConnectX-5 Ex]" pci_busid="0033:01:00.0" pci_type="0207 [15b3:1019] [1014:0617] 00" pci_link_speed="0.000000">
          <info name="PCIVendor" value="Mellanox Technologies"/>
          <info name="PCIDevice" value="MT28800 Family [ConnectX-5 Ex]"/>
          <object type="OSDev" name="hsi2" osdev_type="2">
            <info name="Address" value="20:00:15:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:ca:bb:d2"/>
            <info name="Port" value="1"/>
          </object>
          <object type="OSDev" name="mlx5_2" osdev_type="3">
            <info name="NodeGUID" value="ec0d:9a03:00ca:bbd2"/>
            <info name="SysImageGUID" value="ec0d:9a03:00ca:bbd0"/>
            <info name="Port1State" value="4"/>
            <info name="Port1LID" value="0xc1"/>
            <info name="Port1LMC" value="0"/>
            <info name="Port1GID0" value="fe80:0000:0000:0000:ec0d:9a03:00ca:bbd2"/>
          </object>
        </object>
        <object type="PCIDev" os_index="53481473" name="Mellanox Technologies MT28800 Family [ConnectX-5 Ex]" pci_busid="0033:01:00.1" pci_type="0207 [15b3:1019] [1014:0617] 00" pci_link_speed="0.000000">
          <info name="PCIVendor" value="Mellanox Technologies"/>
          <info name="PCIDevice" value="MT28800 Family [ConnectX-5 Ex]"/>
          <object type="OSDev" name="hsi3" osdev_type="2">
            <info name="Address" value="20:00:1d:08:fe:80:00:00:00:00:00:00:ec:0d:9a:03:00:ca:bb:d3"/>
            <info name="Port" value="1"/>
          </object>
          <object type="OSDev" name="mlx5_3" osdev_type="3">
            <info name="NodeGUID" value="ec0d:9a03:00ca:bbd3"/>
            <info name="SysImageGUID" value="ec0d:9a03:00ca:bbd0"/>
            <info name="Port1State" value="4"/>
            <info name="Port1LID" value="0xeb"/>
            <info name="Port1LMC" value="0"/>
            <info name="Port1GID0" value="fe80:0000:0000:0000:ec0d:9a03:00ca:bbd3"/>
          </object>
        </object>
      </object>
      <object type="Bridge" os_index="11" bridge_type="0-1" depth="0" bridge_pci="0035:[00-09]">
        <object type="PCIDev" os_index="55586816" name="NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]" pci_busid="0035:03:00.0" pci_type="0302 [10de:1db1] [10de:1212] a1" pci_link_speed="15.753846">
          <info name="PCIVendor" value="NVIDIA Corporation"/>
          <info name="PCIDevice" value="GV100GL [Tesla V100 SXM2 16GB]"/>
          <object type="OSDev" name="card3" osdev_type="1"/>
          <object type="OSDev" name="renderD130" osdev_type="1"/>
          <object type="OSDev" name="cuda1" osdev_type="5">
            <info name="CoProcType" value="CUDA"/>
            <info name="Backend" value="CUDA"/>
            <info name="GPUVendor" value="NVIDIA Corporation"/>
            <info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
            <info name="CUDAGlobalMemorySize" value="16515072"/>
            <info name="CUDAL2CacheSize" value="6144"/>
            <info name="CUDAMultiProcessors" value="80"/>
            <info name="CUDACoresPerMP" value="64"/>
            <info name="CUDASharedMemorySizePerMP" value="48"/>
          </object>
          <object type="OSDev" name="nvml2" osdev_type="1">
            <info name="Backend" value="NVML"/>
            <info name="GPUVendor" value="NVIDIA Corporation"/>
            <info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
            <info name="NVIDIASerial" value="0320618037406"/>
            <info name="NVIDIAUUID" value="GPU-20e492d3-d7e0-c6a3-08c7-edbd8ca6065e"/>
          </object>
        </object>
        <object type="PCIDev" os_index="55590912" name="NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]" pci_busid="0035:04:00.0" pci_type="0302 [10de:1db1] [10de:1212] a1" pci_link_speed="15.753846">
          <info name="PCIVendor" value="NVIDIA Corporation"/>
          <info name="PCIDevice" value="GV100GL [Tesla V100 SXM2 16GB]"/>
          <object type="OSDev" name="renderD131" osdev_type="1"/>
          <object type="OSDev" name="card4" osdev_type="1"/>
          <object type="OSDev" name="cuda2" osdev_type="5">
            <info name="CoProcType" value="CUDA"/>
            <info name="Backend" value="CUDA"/>
            <info name="GPUVendor" value="NVIDIA Corporation"/>
            <info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
            <info name="CUDAGlobalMemorySize" value="16515072"/>
            <info name="CUDAL2CacheSize" value="6144"/>
            <info name="CUDAMultiProcessors" value="80"/>
            <info name="CUDACoresPerMP" value="64"/>
            <info name="CUDASharedMemorySizePerMP" value="48"/>
          </object>
          <object type="OSDev" name="nvml3" osdev_type="1">
            <info name="Backend" value="NVML"/>
            <info name="GPUVendor" value="NVIDIA Corporation"/>
            <info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
            <info name="NVIDIASerial" value="0320618038035"/>
            <info name="NVIDIAUUID" value="GPU-0ec95f76-d8a3-8fc9-866f-c3bd783e1484"/>
          </object>
        </object>
      </object>
    </object>
  </object>
</topology>

dongahn commented 3 years ago

Perhaps we should be using at least one of the ADAPT flags, so that objects are moved to ancestors during hwloc_topology_restrict()?

Either flag doesn't seem to work.

dongahn commented 3 years ago

Making the flag change for both restrict callsites seems to include GPU but with some hwloc warning:

rzansel61{dahn}37: git diff
diff --git a/src/common/librlist/rhwloc.c b/src/common/librlist/rhwloc.c
index da5c92278..da1d54848 100644
--- a/src/common/librlist/rhwloc.c
+++ b/src/common/librlist/rhwloc.c
@@ -81,7 +81,7 @@ hwloc_topology_t rhwloc_local_topology_load (void)
     if (!(rset = hwloc_bitmap_alloc ())
         || (hwloc_get_cpubind (topo, rset, HWLOC_CPUBIND_PROCESS) < 0))
         goto err;
-    if (hwloc_topology_restrict (topo, rset, 0) < 0)
+    if (hwloc_topology_restrict (topo, rset, HWLOC_RESTRICT_FLAG_ADAPT_IO) < 0)
         goto err;
     hwloc_bitmap_free (rset);
     return (topo);
diff --git a/src/shell/affinity.c b/src/shell/affinity.c
index 4537dcdac..01621d1e3 100644
--- a/src/shell/affinity.c
+++ b/src/shell/affinity.c
@@ -32,7 +32,8 @@ struct shell_affinity {
  */
 static int topology_restrict (hwloc_topology_t topo, hwloc_cpuset_t set)
 {
-    if (hwloc_topology_restrict (topo, set, 0) < 0)
+    if (hwloc_topology_restrict (topo, set, HWLOC_RESTRICT_FLAG_ADAPT_IO) < 0)
         return (-1);
     return (0);
 }

rzansel32{dahn}25: env PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS -n 1 --bind=none --smpiargs="-disable_gpu_hooks" bin/flux start
node visited
numanode visited
socket visited
gpu visited: 0
gpu visited: 1
numanode visited
socket visited
gpu visited: 2
gpu visited: 3
flux mini alloc -N 1 -n1 -c 1 -g3 flux resource list
0.092s: flux-shell[0]: Jobspec does not contain data-staging attributes. No staging necessary.
****************************************************************************
* hwloc has encountered an out-of-order XML topology load.
* Object NUMANode cpuset 0x0000f000,,,,,0x0 complete 0x0000f000,,,,,0x0
* was inserted after object HostBridge with none and none.
* The error occured in hwloc 1.11.10 inside process `broker', while
* the input XML was generated by hwloc 1.11.10 inside process `broker'.
* Please check that your input topology XML file is valid.
****************************************************************************
2020-11-28T20:10:18.367473Z resource.err[0]: verify: rank 0 (rzansel32) missing resources: core0,gpu[1-3]
node visited
gpu visited: 1
numanode visited
socket visited
gpu visited: 2
gpu visited: 3
     STATE NNODES   NCORES    NGPUS
      free      1        1        3
 allocated      0        0        0
      down      0        0        0

dongahn commented 3 years ago

hwloc has encountered an out-of-order XML topology load.

Object NUMANode cpuset 0x0000f000,,,,,0x0 complete 0x0000f000,,,,,0x0

was inserted after object HostBridge with none and none.

The error occured in hwloc 1.11.10 inside process `broker', while

the input XML was generated by hwloc 1.11.10 inside process `broker'.

Please check that your input topology XML file is valid.

It turned out hwloc prints out this message when it loads a restricted hwloc xml created with HWLOC_RESTRICT_FLAG_ADAPT_IO. While this seems to solve the problem of missing GPUs, I didn't feel right to print out this error message in exchange. I did confirm that this change didn't cause any testing failure though.

We probably want to bounce this to hwloc team first before doing anything with this.

SteVwonder commented 3 years ago

@dongahn: after re-reading flux-framework/flux-sched#658, it looks like this is only a limitation of hwloc 1.x (at least for the case documented in that issue). I wonder if your use case would also be handled correctly by hwloc 2.x+.

flux-framework / flux-core

flux-core filters out an allocated GPU #3375