flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
159 stars 49 forks source link

Nested Instances not seeing all GPU resources #3193

Open FrankD412 opened 3 years ago

FrankD412 commented 3 years ago

When testing the nested instances we’re starting with GPUs allocated, the nested broker don’t seem to see the all of the GPU resources. I was able to confirm this by getting the local URI and logging in. In each of our flux mini run calls we're ask for -g 1 — with that request the first one is the only one to start, but the rest are in pending (PD) state waiting on resources (but if I remove the -g in our workflow they all run). However, the broker that was started should have all 4 GPUs on the node, so I’m confused why it thinks there are less. I’ve confirmed that the highest level Flux instance is allocating 4 GPUs via the Jobspec and that is satisfied as the job starts running.

In discussion with @dongahn -- I have the following information.

Flux version

flux version
commands:           0.17.0
libflux-core:       0.17.0
broker:         0.17.0
FLUX_URI:       local:///var/tmp/flux-SjQDZA/0/local
build-options:      +hwloc==1.11.6

Module listing

flux module list
Module                   Size Digest  Idle  S Service
job-exec              1465960 772EA46   22  S
job-manager           1530544 3AD17B6   22  S
connector-local       1240536 CAD956C    0  R 59021-shell-647365656576,59021-shell-283669168128,59021-shell-463286042624,59021-shell-99706994688
kvs-watch             1490496 57FED1B   22  S
resource              1396392 E0D79A6   11  S
barrier               1249464 D57CC8D   12  S
cron                  1391416 B0B6100    0  S
job-ingest            1410344 FB08FCB   20  S
kvs                   1802256 8D3FDB4    0  S
job-info              1624112 D5D9876   12  S
aggregator            1261616 CEEF1E2   12  S
content-sqlite        1253800 BFB45C1   22  S content-backing,kvs-checkpoint
sched-fluxion-qmanag  8837544 450DF83   22  S sched
sched-fluxion-resour 24476904 E90F9BF   22  S

The resources that the job sees from the master Flux instance (using flux job info <JOBID> R):

{"version": 1, "execution": {"R_lite": [{"rank": "3", "node": "lassen8", "children": {"core": "0-19", "gpu": "0-3"}}], "starttime": 1599092109, "expiration": 1599696909}}

It was confirmed that the nested broker only sees a single GPU:

<object type="PCIDev" os_index="4210688" name="NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]" pci_busid="0004:04:00.0" pci_type="0302 [10de:1db1] [10de:1212] a1" pci_link_speed="1.969231">
          <info name="PCIVendor" value="NVIDIA Corporation"/>
          <info name="PCIDevice" value="GV100GL [Tesla V100 SXM2 16GB]"/>
          <object type="OSDev" name="card1" osdev_type="1"/>
          <object type="OSDev" name="renderD128" osdev_type="1"/>
          <object type="OSDev" name="cuda0" osdev_type="5">
            <info name="CoProcType" value="CUDA"/>
            <info name="Backend" value="CUDA"/>
            <info name="GPUVendor" value="NVIDIA Corporation"/>
            <info name="GPUModel" value="Tesla V100-SXM2-16GB"/>
            <info name="CUDAGlobalMemorySize" value="16515072"/>
            <info name="CUDAL2CacheSize" value="6144"/>
            <info name="CUDAMultiProcessors" value="80"/>
            <info name="CUDACoresPerMP" value="64"/>
            <info name="CUDASharedMemorySizePerMP" value="48"/>
</object>
<object type="PCIDev" os_index="4214784" name="NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB]" pci_busid="0004:05:00.0" pci_type="0302 [10de:1db1] [10de:1212] a1" pci_link_speed="1.969231">
          <info name="PCIVendor" value="NVIDIA Corporation"/>
          <info name="PCIDevice" value="GV100GL [Tesla V100 SXM2 16GB]"/>
          <object type="OSDev" name="card2" osdev_type="1"/>
          <object type="OSDev" name="renderD129" osdev_type="1"/>
</object>
flux hwloc topology | grep "CoProcType"
            <info name="CoProcType" value="CUDA"/>
grondo commented 3 years ago

Are you able to try reproducing with flux-core v0.18.0?

This sounds like it could be the issue @SteVwonder fixed in 9464337e764d268a1525cc0bba828f7ed13b5932.

FrankD412 commented 3 years ago

@grondo -- An important consideration for us is whether or not the Python bindings have changed between flux-core@0.18.0 and flux-core@0.17.0. Will the JobSpecV1 class work in the newer flux-core? I'll go checkout the bindings on my own in a second, but figured it was worth an ask.

grondo commented 3 years ago

I don't believe there were any changes in the JobspecV1 class since v0.17, but let me check.

FrankD412 commented 3 years ago

Appreciated -- So far, it looks like it might be a drop in replacement from what I can tell. I know that the Maestro backed passes in attributes and other things directly, so as long as that's the case I think it might work.

grondo commented 3 years ago
$ git log  --pretty='%h %s' v0.17.0..v0.18.0 -- src/bindings/python/flux/job.py
9af79437a python: add JobID class
aafa076ea bindings/python: Fix pylint ungrouped-imports
4f23f4450 bindings/python: Fix pylint import issue
8ff300357 bindings/python: Fix pylint invalid-name
f2cf610f2 bindings/python: Fix pylint unidiomatic-typecheck
cbabe5d89 bindings/python: Fix pylint no-else-raise
59ef0e694 bindings/python: Fix pylint no-else-return
2a3cffabd python: add Jobspec interface to 'mini batch'
f57ed089d python: add stdio properties to jobspec

I think we are trying not to break existing interfaces in the Python API (as much as possible with a quickly moving target)

None of the above changes look like they would break existing use of JobspecV1. The main thing is the addition of the stdin, stdout and stderr properties.

FrankD412 commented 3 years ago

Got it -- I notice the addition of the JobID class. Prior to that, 0.17.0 just passed around integers. I'll look into that and see if I need to retrofit my existing solution.

grondo commented 3 years ago

jobids are still in essence integers. The JobID class is a convenience subclass of integer that allows encoding and decoding Flux jobids from other formats (e.g. "F58", a base58 encoding, hexadecimal, kvs path, etc.) Therefore, I don't think you have to retrofit, though it may be convenient at some point.

Also take a look at our most recent tag v0.19.0. In this version we've abstracted the JobInfo class used by flux jobs for easier access to job properties, and a JobList class for easier job listing.

dongahn commented 3 years ago

@grondo and @FrankD412: I had the same trouble with our July release but I couldn't narrow this down exactly due to other things on my plate yesterday. (Need more testing). If this can wait a few days I can take a look at this if this is okay by @FrankD412.

grondo commented 3 years ago

Yeah, if this is still reproducible in v0.18.0 then we'll need help debugging. My intuition is that it is related to what HW objects hwloc is ignoring by default when using hwloc_topology_restrict(3).

dongahn commented 3 years ago

Yeah I have a similar thought. But the fact that this is only happening with the nest instance also tells me there might be something related to how our exec system binds the flux broker process as this might affect how hwloc xml is generated for the nested instance on Lassen.

I need some time to do more testing.

grondo commented 3 years ago

Yes, the exec system will bind the flux-broker process to the core ids in R_lite. A simple interactive test would be something like taskset -c 0-19 flux start flux hwloc info on one of the GPU lassen nodes.

dongahn commented 3 years ago

A simple interactive test would be something like taskset -c 0-19 flux start flux hwloc info on one of the GPU lassen nodes.

Exactly!

dongahn commented 3 years ago

BTW, I assumed the presence of CUDA_VISIBLE_DEVICES won't affect the nested hwloc.

FrankD412 commented 3 years ago

@grondo @dongahn Just a quick question -- is this an hwloc bug? Or how flux-core is interpreting what it gets. Is there a short-term fix that can unblock us? We currently are scheduled to have a DAT tomorrow and are trying to figure out if we can still make use of it.

dongahn commented 3 years ago

@grondo @dongahn Just a quick question -- is this an hwloc bug? Or how flux-core is interpreting what it gets. Is there a short-term fix that can unblock us? We currently are scheduled to have a DAT tomorrow and are trying to figure out if we can still make use of it.

I didn't know this was tomorrow. I will try to get to it this evening then.

grondo commented 3 years ago

One workaround might be to set -o cpu-affinity=off or the equivalent in JobspecV1 class (can't remember how to set shell options off the top of my head). I'm assuming this would make all GPUs visible to the the job. But the downside is it will not be pinned to the cores it was allocated, so processes could be scheduled to any core by the OS.

FrankD412 commented 3 years ago

@grondo @dongahn Just a quick question -- is this an hwloc bug? Or how flux-core is interpreting what it gets. Is there a short-term fix that can unblock us? We currently are scheduled to have a DAT tomorrow and are trying to figure out if we can still make use of it.

I didn't know this was tomorrow. I will try to get to it this evening then.

It's one of our weekly test DATs, so it would be nice to have something by then -- however, we will have one next week. Yes, this is critical for us -- but I'm happy if there's a workaround that doesn't derail you.

One workaround might be to set -o cpu-affinity=off or the equivalent in JobspecV1 class (can't remember how to set shell options off the top of my head). I'm assuming this would make all GPUs visible to the the job. But the downside is it will not be pinned to the cores it was allocated, so processes could be scheduled to any core by the OS.

I can give this a shot. I'm assuming I'd pass this to the nested instance broker? If that's the case, then this isolates the fix just to our 20 core/4 GPU job with the rest of the workflow being none the wiser.

grondo commented 3 years ago

is this an hwloc bug? Or how flux-core is interpreting what it gets.

I'm not really sure at this time. When a flux instance starts up it gathers hwloc topology information and then calls hwloc_topology_restrict(3) so that the topology is pruned of resources that are not currently accessible due to cpu affinity or other binding. This could be pruning GPUs that are somehow children of cores that are not in the job's assigned resource set, and therefore not in the current affinity of the flux-broker process.

Perhaps there is some flag we should be passing to ensure GPU or Coproc devices aren't dropped.

BTW, I assumed the presence of CUDA_VISIBLE_DEVICES won't affect the nested hwloc.

I haven't tried this, but according to @eleon CUDA_VISIBLE_DEVICES has no effect on libhwloc.

grondo commented 3 years ago

I'm assuming I'd pass this to the nested instance broker? If that's the case, then this isolates the fix just to our 20 core/4 GPU job with the rest of the workflow being none the wiser.

Yes, it would only be a required workaround for the nested instance that needs access to all GPUs.

BTW, I still held some hope that v0.18 would magically fix this issue if you haven't tried it yet. :crossed_fingers:

dongahn commented 3 years ago

But the downside is it will not be pinned to the cores it was allocated, so processes could be scheduled to any core by the OS.

This may negatively affect performance too much? He will have a long running job occupying 20 some cores alongside 20CPU + 4 GPU flux instance. This may make all 44CPUs + 4 GPUs visible to Fluxion and ddcmd's can be over scheduled to the cores where the long running job is running.

dongahn commented 3 years ago

BTW, I still held some hope that v0.18 would magically fix this issue if you haven't tried it yet.

I was testing v0.18 and saw similar issues. Still more testing is needed though.

FrankD412 commented 3 years ago

But the downside is it will not be pinned to the cores it was allocated, so processes could be scheduled to any core by the OS.

This may negatively affect performance too much? He will have a long running job occupying 20 some cores alongside 20CPU + 4 GPU flux instance. This may make all 44CPUs + 4 GPUs visible to Fluxion and ddcmd's can be over scheduled to the cores where the long running job is running.

At this point I'm not worried about performance. We're currently just trying to make sure our workflow is operational. If it's slow, that's fine -- we just need things to run.

eleon commented 3 years ago

Shoot:

leon@pascal4:~$ lstopo-no-graphics | grep CoProc
              CoProc(OpenCL) "opencl1d0"
              CoProc(CUDA) "cuda0"
              CoProc(OpenCL) "opencl1d1"
              CoProc(CUDA) "cuda1"

leon@pascal4:~$ CUDA_VISIBLE_DEVICES=0 lstopo-no-graphics | grep CoProc
              CoProc(OpenCL) "opencl1d0"
              CoProc(CUDA) "cuda0"

leon@pascal4:~$ CUDA_VISIBLE_DEVICES=1 lstopo-no-graphics | grep CoProc
              CoProc(OpenCL) "opencl1d0"
              CoProc(CUDA) "cuda0"

leon@pascal4:~$ CUDA_VISIBLE_DEVICES=0,1 lstopo-no-graphics | grep CoProc
              CoProc(OpenCL) "opencl1d0"
              CoProc(CUDA) "cuda0"
              CoProc(OpenCL) "opencl1d1"
              CoProc(CUDA) "cuda1"

Still something weird as I cannot select the cuda1 device.

grondo commented 3 years ago

Very interesting @eleon!

dongahn commented 3 years ago

Ugh.... this may be a part of this. Interesting.

eleon commented 3 years ago

Similar issue on the ROCm side:

leon@corona107:~$ lstopo-no-graphics | grep CoProc
                CoProc(OpenCL) "opencl0d0"
                CoProc(OpenCL) "opencl0d1"
                CoProc(OpenCL) "opencl0d2"
                CoProc(OpenCL) "opencl0d3"

leon@corona107:~$ ROCR_VISIBLE_DEVICES=0 lstopo-no-graphics | grep CoProc
                CoProc(OpenCL) "opencl0d0"

leon@corona107:~$ ROCR_VISIBLE_DEVICES=1 lstopo-no-graphics | grep CoProc
                CoProc(OpenCL) "opencl0d0"

leon@corona107:~$ ROCR_VISIBLE_DEVICES=2 lstopo-no-graphics | grep CoProc
                CoProc(OpenCL) "opencl0d0"

leon@corona107:~$ ROCR_VISIBLE_DEVICES=3 lstopo-no-graphics | grep CoProc
                CoProc(OpenCL) "opencl0d0"

leon@corona107:~$ ROCR_VISIBLE_DEVICES=0,1  lstopo-no-graphics | grep CoProc
                CoProc(OpenCL) "opencl0d0"
                CoProc(OpenCL) "opencl0d1"

leon@corona107:~$ ROCR_VISIBLE_DEVICES=2,3  lstopo-no-graphics | grep CoProc
                CoProc(OpenCL) "opencl0d0"
                CoProc(OpenCL) "opencl0d1"

At this point, I wouldn't use this method to restrict the hwloc GPUs. I wish that hwloc either fully followed the environment variable guidance or not at all.

grondo commented 3 years ago

At this point, I wouldn't use this method to restrict the hwloc GPUs.

Unfortunately, flux doesn't really have much control over that. (at least for now).

@eleon, do you know if there is a way to remove specific objects from an hwloc topology?

Long term, this is why we will move away from dynamic resource discovery and instead use R from the parent.

dongahn commented 3 years ago

Long term, this is why we will move away from dynamic resource discovery and instead use R from the parent.

+1

eleon commented 3 years ago

@grondo , unfortunately, not that I am aware of. I thought about this and looked through the hwloc API, but did not find a reasonable way of removing vertices from the tree. The only alternative I can think of is pruning the XML topology file, but it may be an involved operation since other vertices may need to be updated in addition to removing the GPU vertex.

SteVwonder commented 3 years ago

Still something weird as I cannot select the cuda1 device.

I wonder if you are getting the "cuda1" device, but that lstopo is printing the logical index rather than the physical one (and the GPUs always are indexed starting at 0). Does lstopo --physical behave the way you expect?

The logical indexing of course causes all sorts of issues when you start nesting. We should look into using the full UUID of the GPUs at some point.

eleon commented 3 years ago

Good thought, @SteVwonder . Here's what we have so far:

leon@corona107:~$ ROCR_VISIBLE_DEVICES=2,3  lstopo-no-graphics -p | grep CoProc
                CoProc(OpenCL) "opencl0d0"
                CoProc(OpenCL) "opencl0d1"

leon@pascal4:~$ CUDA_VISIBLE_DEVICES=1 lstopo-no-graphics -p | grep CoProc
              CoProc(OpenCL) "opencl1d0"
              CoProc(CUDA) "cuda0"

More testing needed...

FrankD412 commented 3 years ago

I'm assuming I'd pass this to the nested instance broker? If that's the case, then this isolates the fix just to our 20 core/4 GPU job with the rest of the workflow being none the wiser.

Yes, it would only be a required workaround for the nested instance that needs access to all GPUs.

BTW, I still held some hope that v0.18 would magically fix this issue if you haven't tried it yet. 🤞

I added the ability to pass options to Flux through Maestro and ran into the following error. I'm feeling like I'm not constructing something correctly, unless the whole of the -o option needs to be in quotations?

flux job attach 1610847617024
flux-job: task(s) exited with exit code 1
2020-09-04T02:18:40.753604Z broker.err[0]: rc2.0: cpu-affinity=off /p/gpfs1/fdinatal/roots/mummi_root_20200902/workspace/cganalysis-pfpatch_000000000009_pfpatch_000000000025_pfpatch_000000000031_pfpatch_000000010858.flux.sh error starting command (rc=1) 0.0s

The jobspec looks like:

"resources": [{"type": "node", "count": 1, "with": [{"type": "slot", "count": 1, "with": [{"type": "core", "count": 20}, {"type": "gpu", "count": 4}], "label": "task"}]}], "tasks": [{"command": ["flux", "start", "-o", "cpu-affinity=off", "/p/gpfs1/fdinatal/roots/mummi_root_20200902/workspace/cganalysis-pfpatch_000000000009_pfpatch_000000000025_pfpatch_000000000031_pfpatch_000000000040.flux.sh"], "slot": "task", "count": {"per_slot": 1}}
SteVwonder commented 3 years ago

The -o cpu-affinity=off gets passed to flux mini run. You can do the equivalent thing through the Python API with:

jobspec = JobspecV1(["hostname"])
jobspec.setattr_shell_option("cpu-affinity", "off")

EDIT: just confirmed that the above code snippet is equivalent to flux mini run -o cpu-affinity=off hostname (minus the environment variables)

grondo commented 3 years ago

Sorry, I should have clarified, cpu-affinity=off should be set as a "job shell" option in the jobspec, not as an option to flux-start.

I.e. in the json jobspec attributes.system.shell.options["cpu-affinity"] should be set to "off"

There may be a convenience method in JobspecV1 to set job shell options. I can look for that tomorrow.

On Thu, Sep 3, 2020, 7:41 PM Francesco Di Natale notifications@github.com wrote:

I'm assuming I'd pass this to the nested instance broker? If that's the case, then this isolates the fix just to our 20 core/4 GPU job with the rest of the workflow being none the wiser.

Yes, it would only be a required workaround for the nested instance that needs access to all GPUs.

BTW, I still held some hope that v0.18 would magically fix this issue if you haven't tried it yet. 🤞

I added the ability to pass options to Flux through Maestro and ran into the following error. I'm feeling like I'm not constructing something correctly, unless the whole of the -o option needs to be in quotations?

flux job attach 1610847617024

flux-job: task(s) exited with exit code 1

2020-09-04T02:18:40.753604Z broker.err[0]: rc2.0: cpu-affinity=off /p/gpfs1/fdinatal/roots/mummi_root_20200902/workspace/cganalysis-pfpatch_000000000009_pfpatch_000000000025_pfpatch_000000000031_pfpatch_000000010858.flux.sh error starting command (rc=1) 0.0s

The jobspec looks like:

"resources": [{"type": "node", "count": 1, "with": [{"type": "slot", "count": 1, "with": [{"type": "core", "count": 20}, {"type": "gpu", "count": 4}], "label": "task"}]}], "tasks": [{"command": ["flux", "start", "-o", "cpu-affinity=off", "/p/gpfs1/fdinatal/roots/mummi_root_20200902/workspace/cganalysis-pfpatch_000000000009_pfpatch_000000000025_pfpatch_000000000031_pfpatch_000000000040.flux.sh"], "slot": "task", "count": {"per_slot": 1}}

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/flux-framework/flux-core/issues/3193#issuecomment-686866249, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFVEUSMUBB6RZBSEW5LSWTSEBHT3ANCNFSM4QVWYYSQ .

FrankD412 commented 3 years ago

I modified our flux mini run and the result was much the same as before. When requesting GPUs, only one would launch and even then the ddcMD job wouldn't run at all. It sits at 0% utilization and doesn't appear in nvidia-smi as utilizing a GPU.

grondo commented 3 years ago

I modified our flux mini run and the result was much the same as before. When requesting GPUs, only one would launch and even then the ddcMD job wouldn't run at all. It sits at 0% utilization and doesn't appear in nvidia-smi as utilizing a GPU.

You'll want to make sure you are disabling cpu-affinity for the subinstance and not any other jobs. However, like @dongahn mentioned, this will allow the nested instance to "discover" all the CPUs, therefore it may overschedule jobs. However, libhwloc and thus the nested scheduler should be able to see all GPUs.

FrankD412 commented 3 years ago

I modified our flux mini run and the result was much the same as before. When requesting GPUs, only one would launch and even then the ddcMD job wouldn't run at all. It sits at 0% utilization and doesn't appear in nvidia-smi as utilizing a GPU.

You'll want to make sure you are disabling cpu-affinity for the subinstance and not any other jobs. However, like @dongahn mentioned, this will allow the nested instance to "discover" all the CPUs, therefore it may overschedule jobs. However, libhwloc and thus the nested scheduler should be able to see all GPUs.

Right -- I actually realized I was still scheduling with GPUs in my mini run so I'll give that a shot here in a moment.

dongahn commented 3 years ago

FYI -- I can circle back to this early this afternoon.

FrankD412 commented 3 years ago

FYI -- I can circle back to this early this afternoon.

No worries -- I have four ddcmd processes on the node through python going, but they register as zombie processes. Trying to see if that's related.

dongahn commented 3 years ago

Ok. Talked to @FrankD412 by phone. I believe we have come up with a reasonable workaround for today's DAT so that he can quickly test the rest of the workflow. @FrankD412 will update us on that. If that's working, I will defer my testing to either this weekend or early next week.

FrankD412 commented 3 years ago

Alright, the workaround did get us past the zombie process issue. I ran into a different issue where ddcmd aborts due to a pmix error, but that's likely a different issue.

Just for reference in case (this happens when using a subprocess in Python only):

2020-09-04 11:51:00,693 - mummi.online:run:118 - INFO - cmd = /usr/gapps/kras/sierra/ddcmd-gpu8/bin/ddcMD-sierra -o object.data molecule.data
2020-09-04 11:51:05,704 - mummi.online:run:122 - INFO - Process Running? 1
2020-09-04 11:51:05,705 - mummi.online:run:124 - INFO - CUDA_VISIBLE_DEVICES=0
2020-09-04 11:51:05,705 - mummi.online:run:128 - INFO - ---------------- ddcMD stdout --------------

2020-09-04 11:51:05,705 - mummi.online:run:129 - ERROR - ---------------- ddcMD stderr --------------
[lassen13:44229] mca_base_component_repository_open: unable to open mca_schizo_flux.so: File not found (ignored)
[lassen13:44229] mca_base_component_repository_open: unable to open mca_pmix_flux.so: File not found (ignored)
[lassen13:44229] PMI_Init [../../../../../../opensrc/ompi/opal/mca/pmix/flux/pmix_flux.c:386:flux_init]: Operation failed
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  pmix init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[lassen13:44229] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
dongahn commented 3 years ago

Alright, the workaround did get us past the zombie process issue. I ran into a different issue where ddcmd aborts due to a pmix error, but that's likely a different issue.

Just for reference in case (this happens when using a subprocess in Python only):

Did you add -o mpi=spectrum to your flux mini run?

FrankD412 commented 3 years ago

Alright, the workaround did get us past the zombie process issue. I ran into a different issue where ddcmd aborts due to a pmix error, but that's likely a different issue.

Just for reference in case (this happens when using a subprocess in Python only):

Did you add -o mpi=spectrum to your flux mini run?

Yeah -- here's what it looks like: flux mini run -N 1 -n 1 -c 4 -o "mpi=spectrum" sh -c "export CUDA_VISIBLE_DEVICES=$CDEV ; /usr/gapps/kras/install/bin/autobind-12 cganalysis --simname $sim ...

eleon commented 3 years ago

@grondo , @SteVwonder , @dongahn , CUDA_VISIBLE_DEVICES for NVIDIA GPUs is playing nicely with hwloc. Completing the experiments started in this thread:

leon@lassen15:~$ lstopo-no-graphics | grep -B1 CoProc
            PCI 0004:04:00.0 (3D)
              CoProc(CUDA) "cuda0"
--
            PCI 0004:05:00.0 (3D)
              CoProc(CUDA) "cuda1"
--
            PCI 0035:03:00.0 (3D)
              CoProc(CUDA) "cuda2"
--
            PCI 0035:04:00.0 (3D)
              CoProc(CUDA) "cuda3"

leon@lassen15:~$ CUDA_VISIBLE_DEVICES=3 lstopo-no-graphics | grep -B1 CoProc
            PCI 0035:04:00.0 (3D)
              CoProc(CUDA) "cuda0"

leon@lassen15:~$ CUDA_VISIBLE_DEVICES=1,2 lstopo-no-graphics | grep -B1 CoProc
            PCI 0004:05:00.0 (3D)
              CoProc(CUDA) "cuda0"
--
            PCI 0035:03:00.0 (3D)
              CoProc(CUDA) "cuda1"
eleon commented 3 years ago

Similar, good, behavior on the AMD side:

leon@corona8:~$ lstopo-no-graphics | grep -B1 CoProc
              PCI 13:00.0 (Display)
                CoProc(OpenCL) "opencl0d0"
--
              PCI 23:00.0 (Display)
                CoProc(OpenCL) "opencl0d1"
--
              PCI 53:00.0 (Display)
                CoProc(OpenCL) "opencl0d2"
--
              PCI 73:00.0 (Display)
                CoProc(OpenCL) "opencl0d3"

leon@corona8:~$ ROCR_VISIBLE_DEVICES=2 lstopo-no-graphics | grep -B1 CoProc
              PCI 53:00.0 (Display)
                CoProc(OpenCL) "opencl0d0"

leon@corona8:~$ ROCR_VISIBLE_DEVICES=2,3 lstopo-no-graphics | grep -B1 CoProc
              PCI 53:00.0 (Display)
                CoProc(OpenCL) "opencl0d0"
--
              PCI 73:00.0 (Display)
                CoProc(OpenCL) "opencl0d1"
eleon commented 3 years ago

Important to note that the CoProc hwloc identifiers (e.g., cudax or openclxxx) are "relative" to the GPUs available to that process. To figure out which GPU(s) one is using in a multi-GPU node, use the PCI ID instead!

dongahn commented 3 years ago

Thanks @eleon.

dongahn commented 3 years ago

@eleon: Do you know if the logical id of other resources like core and socket show the same behavior with hwloc as well?

For example, if a process is pinned to two cores like core1 in socket0 and core18 in socket1, when the process fetches its hwloc xml, would they remapped be to core0 in socket0 and core1 in socket1?

grondo commented 3 years ago

I think after filtering hwloc the logical IDs of all objects are remapped:

 grondo@fluke6:~$ lstopo-no-graphics --no-io --merge --restrict binding
Machine (15GB)
  Core L#0
    PU L#0 (P#0)
    PU L#1 (P#4)
  Core L#1
    PU L#2 (P#1)
    PU L#3 (P#5)
  Core L#2
    PU L#4 (P#2)
    PU L#5 (P#6)
  Core L#3
    PU L#6 (P#3)
    PU L#7 (P#7)
grondo@fluke6:~$ taskset -c 7 lstopo-no-graphics --no-io --merge --restrict binding
Machine (15GB) + Core L#0 + PU L#0 (P#7)
dongahn commented 3 years ago

@grondo:

Very Interesting!

It seems we can use a remapping logic similar to "rank" to remap the resource IDs as well in supporting a nested instance with RV1/JGF reader. (exception for excluded ranks, that is). Did I get this right?