Open FrankD412 opened 4 years ago
Are you able to try reproducing with flux-core v0.18.0?
This sounds like it could be the issue @SteVwonder fixed in 9464337e764d268a1525cc0bba828f7ed13b5932.
@grondo -- An important consideration for us is whether or not the Python bindings have changed between flux-core@0.18.0
and flux-core@0.17.0
. Will the JobSpecV1
class work in the newer flux-core
? I'll go checkout the bindings on my own in a second, but figured it was worth an ask.
I don't believe there were any changes in the JobspecV1
class since v0.17, but let me check.
Appreciated -- So far, it looks like it might be a drop in replacement from what I can tell. I know that the Maestro backed passes in attributes and other things directly, so as long as that's the case I think it might work.
$ git log --pretty='%h %s' v0.17.0..v0.18.0 -- src/bindings/python/flux/job.py
9af79437a python: add JobID class
aafa076ea bindings/python: Fix pylint ungrouped-imports
4f23f4450 bindings/python: Fix pylint import issue
8ff300357 bindings/python: Fix pylint invalid-name
f2cf610f2 bindings/python: Fix pylint unidiomatic-typecheck
cbabe5d89 bindings/python: Fix pylint no-else-raise
59ef0e694 bindings/python: Fix pylint no-else-return
2a3cffabd python: add Jobspec interface to 'mini batch'
f57ed089d python: add stdio properties to jobspec
I think we are trying not to break existing interfaces in the Python API (as much as possible with a quickly moving target)
None of the above changes look like they would break existing use of JobspecV1
. The main thing is the addition of the stdin
, stdout
and stderr
properties.
Got it -- I notice the addition of the JobID
class. Prior to that, 0.17.0
just passed around integers. I'll look into that and see if I need to retrofit my existing solution.
jobids are still in essence integers. The JobID
class is a convenience subclass of integer that allows encoding and decoding Flux jobids from other formats (e.g. "F58", a base58 encoding, hexadecimal, kvs path, etc.) Therefore, I don't think you have to retrofit, though it may be convenient at some point.
Also take a look at our most recent tag v0.19.0. In this version we've abstracted the JobInfo
class used by flux jobs
for easier access to job properties, and a JobList
class for easier job listing.
@grondo and @FrankD412: I had the same trouble with our July release but I couldn't narrow this down exactly due to other things on my plate yesterday. (Need more testing). If this can wait a few days I can take a look at this if this is okay by @FrankD412.
Yeah, if this is still reproducible in v0.18.0 then we'll need help debugging. My intuition is that it is related to what HW objects hwloc is ignoring by default when using hwloc_topology_restrict(3)
.
Yeah I have a similar thought. But the fact that this is only happening with the nest instance also tells me there might be something related to how our exec system binds the flux broker process as this might affect how hwloc xml is generated for the nested instance on Lassen.
I need some time to do more testing.
Yes, the exec system will bind the flux-broker process to the core ids in R_lite
. A simple interactive test would be something like taskset -c 0-19 flux start flux hwloc info
on one of the GPU lassen nodes.
A simple interactive test would be something like taskset -c 0-19 flux start flux hwloc info on one of the GPU lassen nodes.
Exactly!
BTW, I assumed the presence of CUDA_VISIBLE_DEVICES
won't affect the nested hwloc.
@grondo @dongahn Just a quick question -- is this an hwloc
bug? Or how flux-core
is interpreting what it gets. Is there a short-term fix that can unblock us? We currently are scheduled to have a DAT tomorrow and are trying to figure out if we can still make use of it.
@grondo @dongahn Just a quick question -- is this an hwloc bug? Or how flux-core is interpreting what it gets. Is there a short-term fix that can unblock us? We currently are scheduled to have a DAT tomorrow and are trying to figure out if we can still make use of it.
I didn't know this was tomorrow. I will try to get to it this evening then.
One workaround might be to set -o cpu-affinity=off
or the equivalent in JobspecV1
class (can't remember how to set shell options off the top of my head). I'm assuming this would make all GPUs visible to the the job. But the downside is it will not be pinned to the cores it was allocated, so processes could be scheduled to any core by the OS.
@grondo @dongahn Just a quick question -- is this an hwloc bug? Or how flux-core is interpreting what it gets. Is there a short-term fix that can unblock us? We currently are scheduled to have a DAT tomorrow and are trying to figure out if we can still make use of it.
I didn't know this was tomorrow. I will try to get to it this evening then.
It's one of our weekly test DATs, so it would be nice to have something by then -- however, we will have one next week. Yes, this is critical for us -- but I'm happy if there's a workaround that doesn't derail you.
One workaround might be to set
-o cpu-affinity=off
or the equivalent inJobspecV1
class (can't remember how to set shell options off the top of my head). I'm assuming this would make all GPUs visible to the the job. But the downside is it will not be pinned to the cores it was allocated, so processes could be scheduled to any core by the OS.
I can give this a shot. I'm assuming I'd pass this to the nested instance broker? If that's the case, then this isolates the fix just to our 20 core/4 GPU job with the rest of the workflow being none the wiser.
is this an hwloc bug? Or how flux-core is interpreting what it gets.
I'm not really sure at this time. When a flux instance starts up it gathers hwloc topology information and then calls hwloc_topology_restrict(3)
so that the topology is pruned of resources that are not currently accessible due to cpu affinity or other binding. This could be pruning GPUs that are somehow children of cores that are not in the job's assigned resource set, and therefore not in the current affinity of the flux-broker process.
Perhaps there is some flag we should be passing to ensure GPU or Coproc devices aren't dropped.
BTW, I assumed the presence of CUDA_VISIBLE_DEVICES won't affect the nested hwloc.
I haven't tried this, but according to @eleon CUDA_VISIBLE_DEVICES has no effect on libhwloc.
I'm assuming I'd pass this to the nested instance broker? If that's the case, then this isolates the fix just to our 20 core/4 GPU job with the rest of the workflow being none the wiser.
Yes, it would only be a required workaround for the nested instance that needs access to all GPUs.
BTW, I still held some hope that v0.18 would magically fix this issue if you haven't tried it yet. :crossed_fingers:
But the downside is it will not be pinned to the cores it was allocated, so processes could be scheduled to any core by the OS.
This may negatively affect performance too much? He will have a long running job occupying 20 some cores alongside 20CPU + 4 GPU flux instance. This may make all 44CPUs + 4 GPUs visible to Fluxion and ddcmd's can be over scheduled to the cores where the long running job is running.
BTW, I still held some hope that v0.18 would magically fix this issue if you haven't tried it yet.
I was testing v0.18 and saw similar issues. Still more testing is needed though.
But the downside is it will not be pinned to the cores it was allocated, so processes could be scheduled to any core by the OS.
This may negatively affect performance too much? He will have a long running job occupying 20 some cores alongside 20CPU + 4 GPU flux instance. This may make all 44CPUs + 4 GPUs visible to Fluxion and ddcmd's can be over scheduled to the cores where the long running job is running.
At this point I'm not worried about performance. We're currently just trying to make sure our workflow is operational. If it's slow, that's fine -- we just need things to run.
Shoot:
leon@pascal4:~$ lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl1d0"
CoProc(CUDA) "cuda0"
CoProc(OpenCL) "opencl1d1"
CoProc(CUDA) "cuda1"
leon@pascal4:~$ CUDA_VISIBLE_DEVICES=0 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl1d0"
CoProc(CUDA) "cuda0"
leon@pascal4:~$ CUDA_VISIBLE_DEVICES=1 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl1d0"
CoProc(CUDA) "cuda0"
leon@pascal4:~$ CUDA_VISIBLE_DEVICES=0,1 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl1d0"
CoProc(CUDA) "cuda0"
CoProc(OpenCL) "opencl1d1"
CoProc(CUDA) "cuda1"
Still something weird as I cannot select the cuda1 device.
Very interesting @eleon!
Ugh.... this may be a part of this. Interesting.
Similar issue on the ROCm side:
leon@corona107:~$ lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl0d0"
CoProc(OpenCL) "opencl0d1"
CoProc(OpenCL) "opencl0d2"
CoProc(OpenCL) "opencl0d3"
leon@corona107:~$ ROCR_VISIBLE_DEVICES=0 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl0d0"
leon@corona107:~$ ROCR_VISIBLE_DEVICES=1 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl0d0"
leon@corona107:~$ ROCR_VISIBLE_DEVICES=2 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl0d0"
leon@corona107:~$ ROCR_VISIBLE_DEVICES=3 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl0d0"
leon@corona107:~$ ROCR_VISIBLE_DEVICES=0,1 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl0d0"
CoProc(OpenCL) "opencl0d1"
leon@corona107:~$ ROCR_VISIBLE_DEVICES=2,3 lstopo-no-graphics | grep CoProc
CoProc(OpenCL) "opencl0d0"
CoProc(OpenCL) "opencl0d1"
At this point, I wouldn't use this method to restrict the hwloc GPUs. I wish that hwloc either fully followed the environment variable guidance or not at all.
At this point, I wouldn't use this method to restrict the hwloc GPUs.
Unfortunately, flux doesn't really have much control over that. (at least for now).
@eleon, do you know if there is a way to remove specific objects from an hwloc topology?
Long term, this is why we will move away from dynamic resource discovery and instead use R from the parent.
Long term, this is why we will move away from dynamic resource discovery and instead use R from the parent.
+1
@grondo , unfortunately, not that I am aware of. I thought about this and looked through the hwloc API, but did not find a reasonable way of removing vertices from the tree. The only alternative I can think of is pruning the XML topology file, but it may be an involved operation since other vertices may need to be updated in addition to removing the GPU vertex.
Still something weird as I cannot select the cuda1 device.
I wonder if you are getting the "cuda1" device, but that lstopo is printing the logical index rather than the physical one (and the GPUs always are indexed starting at 0). Does lstopo --physical
behave the way you expect?
The logical indexing of course causes all sorts of issues when you start nesting. We should look into using the full UUID of the GPUs at some point.
Good thought, @SteVwonder . Here's what we have so far:
leon@corona107:~$ ROCR_VISIBLE_DEVICES=2,3 lstopo-no-graphics -p | grep CoProc
CoProc(OpenCL) "opencl0d0"
CoProc(OpenCL) "opencl0d1"
leon@pascal4:~$ CUDA_VISIBLE_DEVICES=1 lstopo-no-graphics -p | grep CoProc
CoProc(OpenCL) "opencl1d0"
CoProc(CUDA) "cuda0"
More testing needed...
I'm assuming I'd pass this to the nested instance broker? If that's the case, then this isolates the fix just to our 20 core/4 GPU job with the rest of the workflow being none the wiser.
Yes, it would only be a required workaround for the nested instance that needs access to all GPUs.
BTW, I still held some hope that v0.18 would magically fix this issue if you haven't tried it yet. 🤞
I added the ability to pass options to Flux through Maestro and ran into the following error. I'm feeling like I'm not constructing something correctly, unless the whole of the -o
option needs to be in quotations?
flux job attach 1610847617024
flux-job: task(s) exited with exit code 1
2020-09-04T02:18:40.753604Z broker.err[0]: rc2.0: cpu-affinity=off /p/gpfs1/fdinatal/roots/mummi_root_20200902/workspace/cganalysis-pfpatch_000000000009_pfpatch_000000000025_pfpatch_000000000031_pfpatch_000000010858.flux.sh error starting command (rc=1) 0.0s
The jobspec looks like:
"resources": [{"type": "node", "count": 1, "with": [{"type": "slot", "count": 1, "with": [{"type": "core", "count": 20}, {"type": "gpu", "count": 4}], "label": "task"}]}], "tasks": [{"command": ["flux", "start", "-o", "cpu-affinity=off", "/p/gpfs1/fdinatal/roots/mummi_root_20200902/workspace/cganalysis-pfpatch_000000000009_pfpatch_000000000025_pfpatch_000000000031_pfpatch_000000000040.flux.sh"], "slot": "task", "count": {"per_slot": 1}}
The -o cpu-affinity=off
gets passed to flux mini run
. You can do the equivalent thing through the Python API with:
jobspec = JobspecV1(["hostname"])
jobspec.setattr_shell_option("cpu-affinity", "off")
EDIT: just confirmed that the above code snippet is equivalent to flux mini run -o cpu-affinity=off hostname
(minus the environment variables)
Sorry, I should have clarified, cpu-affinity=off should be set as a "job shell" option in the jobspec, not as an option to flux-start.
I.e. in the json jobspec attributes.system.shell.options["cpu-affinity"]
should be set to "off"
There may be a convenience method in JobspecV1 to set job shell options. I can look for that tomorrow.
On Thu, Sep 3, 2020, 7:41 PM Francesco Di Natale notifications@github.com wrote:
I'm assuming I'd pass this to the nested instance broker? If that's the case, then this isolates the fix just to our 20 core/4 GPU job with the rest of the workflow being none the wiser.
Yes, it would only be a required workaround for the nested instance that needs access to all GPUs.
BTW, I still held some hope that v0.18 would magically fix this issue if you haven't tried it yet. 🤞
I added the ability to pass options to Flux through Maestro and ran into the following error. I'm feeling like I'm not constructing something correctly, unless the whole of the -o option needs to be in quotations?
flux job attach 1610847617024
flux-job: task(s) exited with exit code 1
2020-09-04T02:18:40.753604Z broker.err[0]: rc2.0: cpu-affinity=off /p/gpfs1/fdinatal/roots/mummi_root_20200902/workspace/cganalysis-pfpatch_000000000009_pfpatch_000000000025_pfpatch_000000000031_pfpatch_000000010858.flux.sh error starting command (rc=1) 0.0s
The jobspec looks like:
"resources": [{"type": "node", "count": 1, "with": [{"type": "slot", "count": 1, "with": [{"type": "core", "count": 20}, {"type": "gpu", "count": 4}], "label": "task"}]}], "tasks": [{"command": ["flux", "start", "-o", "cpu-affinity=off", "/p/gpfs1/fdinatal/roots/mummi_root_20200902/workspace/cganalysis-pfpatch_000000000009_pfpatch_000000000025_pfpatch_000000000031_pfpatch_000000000040.flux.sh"], "slot": "task", "count": {"per_slot": 1}}
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/flux-framework/flux-core/issues/3193#issuecomment-686866249, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFVEUSMUBB6RZBSEW5LSWTSEBHT3ANCNFSM4QVWYYSQ .
I modified our flux mini run
and the result was much the same as before. When requesting GPUs, only one would launch and even then the ddcMD
job wouldn't run at all. It sits at 0% utilization and doesn't appear in nvidia-smi
as utilizing a GPU.
I modified our flux mini run and the result was much the same as before. When requesting GPUs, only one would launch and even then the ddcMD job wouldn't run at all. It sits at 0% utilization and doesn't appear in nvidia-smi as utilizing a GPU.
You'll want to make sure you are disabling cpu-affinity for the subinstance and not any other jobs. However, like @dongahn mentioned, this will allow the nested instance to "discover" all the CPUs, therefore it may overschedule jobs. However, libhwloc and thus the nested scheduler should be able to see all GPUs.
I modified our flux mini run and the result was much the same as before. When requesting GPUs, only one would launch and even then the ddcMD job wouldn't run at all. It sits at 0% utilization and doesn't appear in nvidia-smi as utilizing a GPU.
You'll want to make sure you are disabling cpu-affinity for the subinstance and not any other jobs. However, like @dongahn mentioned, this will allow the nested instance to "discover" all the CPUs, therefore it may overschedule jobs. However, libhwloc and thus the nested scheduler should be able to see all GPUs.
Right -- I actually realized I was still scheduling with GPUs in my mini run
so I'll give that a shot here in a moment.
FYI -- I can circle back to this early this afternoon.
FYI -- I can circle back to this early this afternoon.
No worries -- I have four ddcmd
processes on the node through python going, but they register as zombie processes. Trying to see if that's related.
Ok. Talked to @FrankD412 by phone. I believe we have come up with a reasonable workaround for today's DAT so that he can quickly test the rest of the workflow. @FrankD412 will update us on that. If that's working, I will defer my testing to either this weekend or early next week.
Alright, the workaround did get us past the zombie process issue. I ran into a different issue where ddcmd
aborts due to a pmix
error, but that's likely a different issue.
Just for reference in case (this happens when using a subprocess in Python only):
2020-09-04 11:51:00,693 - mummi.online:run:118 - INFO - cmd = /usr/gapps/kras/sierra/ddcmd-gpu8/bin/ddcMD-sierra -o object.data molecule.data
2020-09-04 11:51:05,704 - mummi.online:run:122 - INFO - Process Running? 1
2020-09-04 11:51:05,705 - mummi.online:run:124 - INFO - CUDA_VISIBLE_DEVICES=0
2020-09-04 11:51:05,705 - mummi.online:run:128 - INFO - ---------------- ddcMD stdout --------------
2020-09-04 11:51:05,705 - mummi.online:run:129 - ERROR - ---------------- ddcMD stderr --------------
[lassen13:44229] mca_base_component_repository_open: unable to open mca_schizo_flux.so: File not found (ignored)
[lassen13:44229] mca_base_component_repository_open: unable to open mca_pmix_flux.so: File not found (ignored)
[lassen13:44229] PMI_Init [../../../../../../opensrc/ompi/opal/mca/pmix/flux/pmix_flux.c:386:flux_init]: Operation failed
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
pmix init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[lassen13:44229] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
Alright, the workaround did get us past the zombie process issue. I ran into a different issue where ddcmd aborts due to a pmix error, but that's likely a different issue.
Just for reference in case (this happens when using a subprocess in Python only):
Did you add -o mpi=spectrum
to your flux mini run
?
Alright, the workaround did get us past the zombie process issue. I ran into a different issue where ddcmd aborts due to a pmix error, but that's likely a different issue.
Just for reference in case (this happens when using a subprocess in Python only):
Did you add
-o mpi=spectrum
to yourflux mini run
?
Yeah -- here's what it looks like: flux mini run -N 1 -n 1 -c 4 -o "mpi=spectrum" sh -c "export CUDA_VISIBLE_DEVICES=$CDEV ; /usr/gapps/kras/install/bin/autobind-12 cganalysis --simname $sim ...
@grondo , @SteVwonder , @dongahn , CUDA_VISIBLE_DEVICES for NVIDIA GPUs is playing nicely with hwloc. Completing the experiments started in this thread:
leon@lassen15:~$ lstopo-no-graphics | grep -B1 CoProc
PCI 0004:04:00.0 (3D)
CoProc(CUDA) "cuda0"
--
PCI 0004:05:00.0 (3D)
CoProc(CUDA) "cuda1"
--
PCI 0035:03:00.0 (3D)
CoProc(CUDA) "cuda2"
--
PCI 0035:04:00.0 (3D)
CoProc(CUDA) "cuda3"
leon@lassen15:~$ CUDA_VISIBLE_DEVICES=3 lstopo-no-graphics | grep -B1 CoProc
PCI 0035:04:00.0 (3D)
CoProc(CUDA) "cuda0"
leon@lassen15:~$ CUDA_VISIBLE_DEVICES=1,2 lstopo-no-graphics | grep -B1 CoProc
PCI 0004:05:00.0 (3D)
CoProc(CUDA) "cuda0"
--
PCI 0035:03:00.0 (3D)
CoProc(CUDA) "cuda1"
Similar, good, behavior on the AMD side:
leon@corona8:~$ lstopo-no-graphics | grep -B1 CoProc
PCI 13:00.0 (Display)
CoProc(OpenCL) "opencl0d0"
--
PCI 23:00.0 (Display)
CoProc(OpenCL) "opencl0d1"
--
PCI 53:00.0 (Display)
CoProc(OpenCL) "opencl0d2"
--
PCI 73:00.0 (Display)
CoProc(OpenCL) "opencl0d3"
leon@corona8:~$ ROCR_VISIBLE_DEVICES=2 lstopo-no-graphics | grep -B1 CoProc
PCI 53:00.0 (Display)
CoProc(OpenCL) "opencl0d0"
leon@corona8:~$ ROCR_VISIBLE_DEVICES=2,3 lstopo-no-graphics | grep -B1 CoProc
PCI 53:00.0 (Display)
CoProc(OpenCL) "opencl0d0"
--
PCI 73:00.0 (Display)
CoProc(OpenCL) "opencl0d1"
Important to note that the CoProc hwloc identifiers (e.g., cudax or openclxxx) are "relative" to the GPUs available to that process. To figure out which GPU(s) one is using in a multi-GPU node, use the PCI ID instead!
Thanks @eleon.
@eleon: Do you know if the logical id of other resources like core
and socket
show the same behavior with hwloc
as well?
For example, if a process is pinned to two cores like core1
in socket0
and core18
in socket1
, when the process fetches its hwloc xml, would they remapped be to core0
in socket0
and core1
in socket1
?
I think after filtering hwloc the logical IDs of all objects are remapped:
grondo@fluke6:~$ lstopo-no-graphics --no-io --merge --restrict binding
Machine (15GB)
Core L#0
PU L#0 (P#0)
PU L#1 (P#4)
Core L#1
PU L#2 (P#1)
PU L#3 (P#5)
Core L#2
PU L#4 (P#2)
PU L#5 (P#6)
Core L#3
PU L#6 (P#3)
PU L#7 (P#7)
grondo@fluke6:~$ taskset -c 7 lstopo-no-graphics --no-io --merge --restrict binding
Machine (15GB) + Core L#0 + PU L#0 (P#7)
@grondo:
Very Interesting!
It seems we can use a remapping logic similar to "rank" to remap the resource IDs as well in supporting a nested instance with RV1/JGF reader. (exception for excluded ranks, that is). Did I get this right?
When testing the nested instances we’re starting with GPUs allocated, the nested broker don’t seem to see the all of the GPU resources. I was able to confirm this by getting the local URI and logging in. In each of our
flux mini run
calls we're ask for-g 1
— with that request the first one is the only one to start, but the rest are in pending (PD
) state waiting on resources (but if I remove the-g
in our workflow they all run). However, the broker that was started should have all 4 GPUs on the node, so I’m confused why it thinks there are less. I’ve confirmed that the highest level Flux instance is allocating 4 GPUs via the Jobspec and that is satisfied as the job starts running.In discussion with @dongahn -- I have the following information.
Flux version
Module listing
The resources that the job sees from the master Flux instance (using
flux job info <JOBID> R
):It was confirmed that the nested broker only sees a single GPU: