MuMMI workflows cannot co-schedule jobs on a node

dongahn commented 3 years ago

In preparation for hero runs on ORNL Summit, MuMMI team refreshed their workflows using the job spec generated with Python API from flux-core@0.17.0. They found the jobs cannot be scheduled.

Franc provided two example jobspec created.

jobspec_createsim.txt jobspec_ddcmd.txt

The resource requests of each spec:

createsim:

[
  {
    "type": "node",
    "count": 1,
    "with": [
      {
        "type": "slot",
        "count": 1,
        "with": [
          {
            "type": "core",
            "count": 24
          }
        ],
        "label": "task"
      }
    ]
  }
]

ddcmd:

[
  {
    "type": "node",
    "count": 1,
    "with": [
      {
        "type": "slot",
        "count": 1,
        "with": [
          {
            "type": "core",
            "count": 20
          }
        ],
        "label": "task"
      }
    ]
  }
]

dongahn commented 3 years ago

Tagging: @bhatiaharsh and @FrankD412.

SteVwonder commented 3 years ago

@FrankD412 and others, when you get the allocation on Summit, are any of the 44 cores reserved as system cores? If so, I believe a cgroup gets setup that limits Flux (and its jobs) to less than 44 cores, which would prevent co-scheduling in this scenario.

I think one way you could verify this is to run flux hwloc info in your allocation.

dongahn commented 3 years ago

@SteVwonder: that's exactly what I thought. I was doing some testing on Lassen to confirm. I also want to see if there is a way to get rid of core isolation to expose all 44.

dongahn commented 3 years ago

On Lassen, I just confirmed that only 40 cores per compute nodes are exposed because of core isolations so the above job specs cannot be scheduled.

lassen708{dahn}21: flux hwloc info
2 Machines, 80 Cores, 320 PUs

Let me see if there are ways to get rid of core isolation next.

SteVwonder commented 3 years ago

The Summit docs make it appear that you cannot change it: https://docs.olcf.ornl.gov/systems/summit_user_guide.html#system-service-core-isolation

I know on Lassen you can control it via lrun and the other LLNL-specific wrappers (which probably means there is a hook into bsub/jsrun).

dongahn commented 3 years ago

Yeah, I think this has to be changed via bsub for us. If this cannot be changed on ORNL systems, the MuMMI workflow should just schedule the "user-visible" cores or work with the facility to turn of the isolated cores for their bsub jobs.

dongahn commented 3 years ago

From an announcement early this year:

This morning, Tuesday 1/28/2020 ~10am, we switched the default core_isolation from 0 to 2 if a core isolation is not specified when using bsub on SIERRA. If you encounter issues with this default change, please let us know. The old behavior can be restored by either adding -core_isolation 0 to your bsub line or adding a line “#BSUB -core_isolation 0 “ to your bsub script.

dongahn commented 3 years ago

Ok. I confirmed this is due to the core isolation and this should be addressed with one of my above recommendations above: https://github.com/flux-framework/flux-sched/issues/728#issuecomment-674139797

lassen709{dahn}37: bsub -Is -XF -nnodes 2 -core_isolation 0 -qpdebug /usr/bin/bash
Job <1346013> is submitted to queue <pdebug>.
<<ssh X11 forwarding job>>
<<Waiting for dispatch ...>>
<<Starting on lassen710>>
bash-4.2$ module use /usr/global/tools/flux/blueos_3_ppc64le_ib/modulefiles
bash-4.2$ module load pmi-shim
bash-4.2$ PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 2 /usr/global/tools/flux/blueos_3_ppc64le_ib/flux-c0.18.0-s0.10.0/bin/flux start
2020-08-14T15:54:43.206740Z broker.err[0]: rc2.0: /bin/tcsh Interrupt (rc=130) 9.5s
WARNING: exiting due to 2 SIGINT's within 1 second.  Job step may still be running and must be managed manually with jskillbash-4.2$ PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 2 /usr/global/tools/flux/bl3_ppc64le_ib/flux-c0.18.0-s0.10.0/bin/flux start ~/ip.sh
ssh://lassen3/var/tmp/flux-tbjlBX/0

lassen708{dahn}23: /usr/global/tools/flux/blueos_3_ppc64le_ib/flux-c0.18.0-s0.10.0/bin/flux proxy ssh://lassen3/var/tmp/flux-tbjlBX/0/local
lassen708{dahn}21: flux hwloc info
2 Machines, 88 Cores, 352 PU

Whereas without the -core_isolation 0 flag:

lassen708{dahn}22: /usr/global/tools/flux/blueos_3_ppc64le_ib/flux-c0.18.0-s0.10.0/bin/flux proxy ssh://lassen6/var/tmp/flux-aR3vLf/0/local
lassen708{dahn}21: flux hwloc info
2 Machines, 80 Cores, 320 PUs

dongahn commented 3 years ago

I will keep this ticket open a bit in case @bhatiaharsh or @FrankD412 have more questions.

FrankD412 commented 3 years ago

@dongahn and @SteVwonder -- Thanks for looking into this. You two flagged exactly the same issue that I thought it might be when I saw this issue go up. We'll definitely try and go back to the old isolation settings, as we need all the cores we can get. I also noticed something in the jobspec that I'd like to verify on our end, as well. Thanks for the pointers! -- and I'll definitely post back here with any other questions we might have.

dongahn commented 3 years ago

Thanks. Onto your other issue now.

FrankD412 commented 3 years ago

Referencing to this comment in #729

Alright -- sorry about the delay, power outages and other things managed to keep me from posting; it looks like we got this sorted out. I think in a previous instantiation we may have wrapped the ddcMD jobs in wrapper scripts in order to get the naming the way we wanted. Those may have included brokers, but I honestly couldn't tell you at this point it was long enough ago. I had initially thought that the flux mini run calls would be caught by the master broker on the node, but that doesn't appear to be the case. Once we scheduled under a broker, ddcMD successfully ran and was named appropriately with my recent changes to Maestro's Flux adapter -- so that now gets rid of the middle man wrapper script.

We did in the course of figuring this out run into our Flux instance telling us that requests made with GPUs were unsatisfiable. We're currently unsure if this is an installation issue or a bigger issue. @SteVwonder recommended starting a new issue for that error.

The co-located job specifications worked with a sub-broker; it turns out that scheduling a script that isn't under a broker that calls flux mini run will result in the jobs being fed back up to the master broker. This was a misunderstanding in what layer would pick up the mini calls. That allowed us to run ddcMD and the new call will request the GPUs through the Flux backend API; however, when calling flux mini run from the command line in the course of trying to debug ddcMD calls and re-submit Maestro generated specifications, we get a resource unsatisfiable error which flags that the GPUs cannot be allocated. As soon as we remove the -g from the mini command goes through. I'll have to reproduce/dig up the error, but that's how we end up getting the error to show up. We were able to reproduce the error consistently, so we could try and reproduce it for you on one of our weekly DATs if you'd like.

dongahn commented 3 years ago

The co-located job specifications worked with a sub-broker;

Is a sub-broker a nested Flux instance?

it turns out that scheduling a script that isn't under a broker that calls flux mini run will result in the jobs being fed back up to the master broker.

Is the master broker the outer-most Flux instance? Could you elaborate what you mean by "scheduling a script that isn't under a broker"? I am not clear how that could get fed to the outer-most Flux instance.

however, when calling flux mini run from the command line in the course of trying to debug ddcMD calls and re-submit Maestro generated specifications, we get a resource unsatisfiable error which flags that the GPUs cannot be allocated.

Which instance does flux mini run request goes in? The outer-most Flux instance or the nested instance? My guess is that is one of the nested instance that isn't allocated to any GPU.

FrankD412 commented 3 years ago

The co-located job specifications worked with a sub-broker;

Is a sub-broker a nested Flux instance?

Sorry, I'm using incorrect terminology. Yeah -- I'm referring to nested Flux instances.

it turns out that scheduling a script that isn't under a broker that calls flux mini run will result in the jobs being fed back up to the master broker.

Is the master broker the outer-most Flux instance? Could you elaborate what you mean by "scheduling a script that isn't under a broker"? I am not clear how that could get fed to the outer-most Flux instance.

So in our 10 node DATs, we would find that when the MuMMI workflow wasn't using a nested instance of Flux to schedule the bundles of ddcMD jobs, the flux mini run calls that the script made would show in the outer-most Flux instance. The script would simply loop over four systems to simulate and call flux mini run on each system. The individual runs would show up in the outer-most Flux instance. From checking the processes on each node and my own debugging, I know that a Flux broker (or process) is on each node, so I expected the mini calls to bubble up to the node instance and not the outer-most. I guess in my mind the node-level instances were a sub-level under the master instance.

however, when calling flux mini run from the command line in the course of trying to debug ddcMD calls and re-submit Maestro generated specifications, we get a resource unsatisfiable error which flags that the GPUs cannot be allocated.

Which instance does flux mini run request goes in? The outer-most Flux instance or the nested instance? My guess is that is one of the nested instance that isn't allocated to any GPU.

Sorry, I misspoke -- I meant to say flux mini submit here. We would run a flux mini submit with the generate script from the outer-most instance. We would allocate it all the GPUs , but it would come back as unsatisfiable. We even cherry-picked single systems to run using a flux mini run and ran into the same unsatisfiable error.

FrankD412 commented 3 years ago

For reference from a quick allocation I spun up:

flux hwloc info
10 Machines, 440 Cores, 1760 PUs

dongahn commented 3 years ago

Are you able to schedule any GPU resources? I remember there is a problem with the default hwloc such a way that Flux cannot automatically discover GPU resources. Did you use the module we put together that allows us to use the right version of hwloc?

It is documented at https://flux-framework.readthedocs.io/en/latest/coral.html.

dongahn commented 3 years ago

Could you run a very small instance like at 2 Lassen nodes and run the following tests?

initial_program.sh:

#! /bin/bash

JOBID=$(flux mini submit -N 2 -n 2 -c 2 -g 2 sleep 60)
flux job info ${JOBID} R > JOBID.${JOBID}.R
flux queue drain

jsrun ... /your/path/flux start ./initial_program.sh

I am also curious how you launch node-level Flux instances to see if each of them gets GPU resources. Can you elaborate?

dongahn commented 3 years ago

The script would simply loop over four systems to simulate and call flux mini run on each system.

What are those "four" systems?

FrankD412 commented 3 years ago

Are you able to schedule any GPU resources? I remember there is a problem with the default hwloc such a way that Flux cannot automatically discover GPU resources. Did you use the module we put together that allows us to use the right version of hwloc?

It is documented at https://flux-framework.readthedocs.io/en/latest/coral.html.

We do not appear to be explicitly loading the hwloc as per the documentation. Below is what flux --version prints and what our module loading looks like. I'll try introducing the module load hwloc/1.11.10-cuda command to our environment loading. What's interesting here though, is that we do have ddcMD simulations starting under a nested Flux instance, which implies that the inner instance is capable of seeing the GPUs. Is that just a side-effect of the inner instance having "ownership" of the GPUs via the allocation?

flux --version
commands:               0.17.0
libflux-core:           0.17.0
build-options:          +hwloc==1.11.6

if [[ $HOST == lassen* ]]; then

   source /etc/profile.d/z00_lmod.sh
   MODULE_FILE=/usr/global/tools/flux/blueos_3_ppc64le_ib/modulefiles
   SHIM_MODULE=pmi-shim
   MPI_MODULE=spectrum-mpi/2019.06.24-flux

   # FLUX=`which flux`
   # Seemed to work without the spectrum load, but will keep here for documentation
   # SHIM_MODULE="spectrum-mpi/2019.06.24-flux pmi-shim"

elif [[ $HOST == *summit* ]]; then

    # MODULE_FILE="/sw/summit/modulefiles/ums/gen007flux/Core"
    # SHIM_MODULE="pmi-shim"
   echo '> ERROR: need shim module for' $HOST

else
   echo '> ERROR: Unidentified host '$HOST
   return
fi

module use $MODULE_FILE
module load $SHIM_MODULE
module load $MPI_MODULE

Could you run a very small instance like at 2 Lassen nodes and run the following tests?

initial_program.sh:
#! /bin/bash

JOBID=$(flux mini submit -N 2 -n 2 -c 2 -g 2 sleep 60)
flux job info ${JOBID} R > JOBID.${JOBID}.R
flux queue drain
jsrun ... /your/path/flux start ./initial_program.sh
I am also curious how you launch node-level Flux instances to see if each of them gets GPU resources. Can you elaborate?

I'll add the fix I mentioned above and use this as a test. If my hunch is right and the module load above with CUDA enabled hwloc fixes the issue.

The script would simply loop over four systems to simulate and call flux mini run on each system.

What are those "four" systems?

The four systems are just individual ddcMD systems that we simulate, one per GPU.

dongahn commented 3 years ago

If you used a later release (core v0.18 and sched v0.10), you would have getten better listing of resources with flux resource list.

What's interesting here though, is that we do have ddcMD simulations starting under a nested Flux instance, which implies that the inner instance is capable of seeing the GPUs. Is that just a side-effect of the inner instance having "ownership" of the GPUs via the allocation?

I don't think so. It seems we may need more investigation.

FrankD412 commented 3 years ago

If you used a later release (core v0.18 and sched v0.10), you would have getten better listing of resources with flux resource list.

What's interesting here though, is that we do have ddcMD simulations starting under a nested Flux instance, which implies that the inner instance is capable of seeing the GPUs. Is that just a side-effect of the inner instance having "ownership" of the GPUs via the allocation?

I don't think so. It seems we may need more investigation.

Sounds good -- I'm loading up the hwloc specified in the documentation and will run a quick test here momentarily. I can then revert back to not loading it and see how it behaves.

SteVwonder commented 3 years ago

@FrankD412: typically, MuMMi uses a spack-install flux-core/flux-sched, right? Were they built with the +cuda variant enabled? You can check by getting the hash of either the flux-core/flux-sched spack package that you are loading and running spack spec -ldv /hash-of-package.

FrankD412 commented 3 years ago

@dongahn -- I tried to launch and get the following error. It produces a single JOBID file that's empty.

----> Launching Flux
 > flux   : /usr/gapps/kras/spack6/opt/spack/linux-rhel7-power9le/gcc-7.3.1/flux-core-0.17.0-nq5is7d55stj3sjjk3hceqk2ql63fgcb/bin/flux
 > version:
commands:               0.17.0
libflux-core:           0.17.0
build-options:          +hwloc==1.11.6
 > NUM_NODES      = 3
 > FLUX_ROOT      = /p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux
 > FLUX_INFO      = /p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/flux.info
 > FLUX_BOOTSTRAP = /p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/initial_program.sh
> Loading flux environment (lassen5.coral.llnl.gov)
 > Launching Flux using jsrun
flux-start: /usr/gapps/kras/spack6/opt/spack/linux-rhel7-power9le/gcc-7.3.1/flux-core-0.17.0-nq5is7d55stj3sjjk3hceqk2ql63fgcb/libexec/flux/cmd/flux-broker -S log-filename=/p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/flux.log /p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/initial_program.sh
flux-start: /usr/gapps/kras/spack6/opt/spack/linux-rhel7-power9le/gcc-7.3.1/flux-core-0.17.0-nq5is7d55stj3sjjk3hceqk2ql63fgcb/libexec/flux/cmd/flux-broker -S log-filename=/p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/flux.log /p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/initial_program.sh
flux-start: /usr/gapps/kras/spack6/opt/spack/linux-rhel7-power9le/gcc-7.3.1/flux-core-0.17.0-nq5is7d55stj3sjjk3hceqk2ql63fgcb/libexec/flux/cmd/flux-broker -S log-filename=/p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/flux.log /p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/initial_program.sh
flux-job: job 69893881856 id or key not found
 > Flux launched using jsrun

The log produced by Flux has the following:

2020-08-19T21:04:35.807510Z job-manager.debug[0]: scheduler: ready single
2020-08-19T21:04:35.807768Z sched-simple.debug[0]: ready: 132 of 132 cores: rank[0-2]/core[0-43]
2020-08-19T21:04:35.903650Z broker.debug[1]: insmod job-ingest
2020-08-19T21:04:35.904963Z job-ingest.debug[1]: fluid ts=362ms
2020-08-19T21:04:35.913668Z broker.debug[2]: insmod job-ingest
2020-08-19T21:04:35.914898Z job-ingest.debug[2]: fluid ts=372ms
2020-08-19T21:04:35.920002Z broker.info[0]: rc1.0: running /usr/gapps/kras/spack6/opt/spack/linux-rhel7-power9le/gcc-7.3.1/flux-core-0.17.0-nq5is7d55stj3sjjk3hceqk2ql63fgcb/etc/flux/rc1.d/01-enclosing-instance
2020-08-19T21:04:36.029929Z broker.info[0]: rc1.0: running /usr/gapps/kras/spack6/opt/spack/linux-rhel7-power9le/gcc-7.3.1/flux-sched-0.9.0-ts7rizygn3sxu3a4wn4tnkb7x74ohe66/etc/flux/rc1.d/sched-fluxion-qmanager-start
2020-08-19T21:04:36.196230Z broker.debug[0]: rmmod sched-simple
2020-08-19T21:04:36.196466Z sched-simple.debug[0]: service_unregister
2020-08-19T21:04:36.196672Z broker.debug[0]: module sched-simple exited
2020-08-19T21:04:36.196797Z resource.debug[0]: acquire_disconnect: resource.acquire aborted
2020-08-19T21:04:36.560999Z broker.debug[0]: insmod sched-fluxion-qmanager
2020-08-19T21:04:36.561817Z sched-fluxion-qmanager.debug[0]: enforced policy (queue=default): fcfs
2020-08-19T21:04:36.561841Z sched-fluxion-qmanager.debug[0]: effective queue params (queue=default): default
2020-08-19T21:04:36.561849Z sched-fluxion-qmanager.debug[0]: effective policy params (queue=default): default
2020-08-19T21:04:36.562204Z sched-fluxion-qmanager.debug[0]: service_register
2020-08-19T21:04:36.562379Z job-manager.debug[0]: scheduler: hello
2020-08-19T21:04:36.562759Z job-manager.debug[0]: scheduler: ready unlimited
2020-08-19T21:04:36.563957Z broker.info[0]: rc1.0: running /usr/gapps/kras/spack6/opt/spack/linux-rhel7-power9le/gcc-7.3.1/flux-sched-0.9.0-ts7rizygn3sxu3a4wn4tnkb7x74ohe66/etc/flux/rc1.d/sched-fluxion-resource-start
2020-08-19T21:04:36.970007Z broker.debug[0]: insmod sched-fluxion-resource
2020-08-19T21:04:36.970803Z sched-fluxion-resource.debug[0]: mod_main: resource module starting
2020-08-19T21:04:37.018494Z sched-fluxion-resource.info[0]: populate_resource_db: loaded resources from hwloc in the KVS
2020-08-19T21:04:37.020163Z sched-fluxion-resource.debug[0]: mod_main: resource graph database loaded
2020-08-19T21:04:38.021926Z broker.info[0]: rc1.0: /bin/zsh -c /usr/gapps/kras/spack6/opt/spack/linux-rhel7-power9le/gcc-7.3.1/flux-core-0.17.0-nq5is7d55stj3sjjk3hceqk2ql63fgcb/etc/flux/rc1 Exited (rc=0) 4.5s
2020-08-19T21:04:38.022018Z broker.info[0]: rc1-success: init->run
2020-08-19T21:04:38.024795Z broker.err[0]: rc2.0: /p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/initial_program.sh /p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/flux.info error starting command (rc=1) 0.0s
2020-08-19T21:04:38.024889Z broker.info[0]: rc2-fail: run->cleanup
2020-08-19T21:04:38.204138Z broker.info[0]: cleanup.0: /bin/zsh -c flux queue stop --quiet Exited (rc=0) 0.2s
2020-08-19T21:04:38.378076Z broker.info[0]: cleanup.1: /bin/zsh -c flux job cancelall --user=all --quiet -f --states RUN Exited (rc=0) 0.2s
2020-08-19T21:04:38.562958Z broker.info[0]: cleanup.2: /bin/zsh -c flux queue idle --quiet Exited (rc=0) 0.2s
2020-08-19T21:04:38.563017Z broker.info[0]: cleanup-success: cleanup->finalize

@SteVwonder -- The spack tree is as follows.

spack find -dvl /ts7rizy
==> 1 installed package
-- linux-rhel7-power9le / gcc@7.3.1 -----------------------------
ts7rizy flux-sched@0.9.0~cuda
vvov2d3     boost@1.72.0+atomic+chrono~clanglibcpp~context~coroutine cxxstd=98 +date_time~debug+exception~fiber+filesystem+graph~icu+iostreams+locale+log+math~mpi+multithreaded~numpy~pic+program_options~python+random+regex+serialization+shared+signals~singlethreaded+system~taggedlayout+test+thread+timer~versionedlayout visibility=hidden +wave
kredonm         bzip2@1.0.8+shared
wup2hw6         zlib@1.2.11+optimize+pic+shared
nq5is7d     flux-core@0.17.0~cuda~docs
d75mvn5         czmq@4.1.1
frwhvqe             libuuid@1.0.3
5crsyl3             libzmq@4.3.2+libsodium
6g3m3f5                 libsodium@1.0.17
6sbfjil         hwloc@1.11.11~cairo~cuda~gl+libxml2~nvml+pci+shared
wtudf3r             libpciaccess@0.13.5
g3dcod5             libxml2@2.9.9~python
4b7rg3o                 libiconv@1.16
xcurmxx                 xz@5.2.4
w6ncfte             numactl@2.0.12
7rh2ree         jansson@2.9 build_type=RelWithDebInfo +shared
aaoiw74         lua@5.2.4
aduzbso             ncurses@6.1~symlinks~termlib
2ujhpjz             readline@8.0
656zrir             unzip@6.0
gbzgv7h         lua-luaposix@33.4.0
vj67x5s         lz4@1.9.2
zcwgmes         pkgconf@1.6.3
4hpdg4j         py-cffi@1.13.0
4vehugn             libffi@3.2.1
63ld7ym             py-pycparser@2.19
4plz5ft                 python@3.7.3+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4~uuid+zlib
enrjzup                     expat@2.2.9+libbsd
6huorkj                         libbsd@0.10.0
6i4b2ms                     gdbm@1.18.1
6rekqlv                     gettext@0.20.1+bzip2+curses+git~libunistring+libxml2+tar+xz
fodcl7k                         tar@1.32
72igzz4                     openssl@1.1.1d+systemcerts
7p7c7xr                     sqlite@3.30.1~column_metadata+fts~functions~rtree
5mnm3ui         py-jsonschema@3.2.0
joaaqkp             py-attrs@19.2.0
72mwgbb             py-pyrsistent@0.16.0
67vupme                 py-hypothesis@4.41.2
ywgown5                 py-memory-profiler@0.47
ygudpfs                     py-psutil@2.1.1
vrcy3c2                 py-pytest@5.2.1
zrptpo4                     py-atomicwrites@1.3.0
q6fruqd                     py-importlib-metadata@1.2.0
kxb54oc                         py-zipp@0.6.0
lhckoew                             py-more-itertools@7.2.0
tplfrvu                     py-packaging@19.2
o3pp35u                         py-pyparsing@2.4.2
ech4awl                         py-six@1.12.0
7axqoif                     py-pluggy@0.13.0
2nz7d4u                     py-py@1.8.0
hxsc2qt                     py-setuptools@41.4.0
wfeguvt                     py-wcwidth@0.1.7
eejdro5                 py-sphinx@2.2.0
o3rqwf3                     py-alabaster@0.7.12
pfddoj6                     py-babel@2.7.0
at75jbz                         py-pytz@2019.3
ep5si3c                     py-docutils@0.15.2
5a7yu4r                     py-imagesize@1.1.0
jgytqop                     py-jinja2@2.10.3
r7pwgkj                         py-markupsafe@1.1.1
4mniign                     py-pygments@2.4.2
h5c54kf                     py-requests@2.22.0
nqey7gd                         py-certifi@2019.9.11
mxweks6                         py-chardet@3.0.4
umhq3zi                         py-idna@2.8
vjrkeur                         py-urllib3@1.25.6
gx6p5yp                     py-snowballstemmer@2.0.0
v3hjmuw                     py-sphinxcontrib-applehelp@1.0.1
hexucsh                     py-sphinxcontrib-devhelp@1.0.1
ahhpqpr                     py-sphinxcontrib-htmlhelp@1.0.2
v4ava3e                     py-sphinxcontrib-jsmath@1.0.1
xbit3qv                     py-sphinxcontrib-qthelp@1.0.2
ytuv3hh                     py-sphinxcontrib-serializinghtml@1.1.3
cmkz6tc                 py-sphinx-rtd-theme@0.1.5
dvtwsme                 py-tox@3.14.2
q2yf6rf                     py-filelock@3.0.4
inh52uq                     py-toml@0.10.0
zd2itpq                     py-virtualenv@16.7.6
y5exivb             py-vcversioner@2.16.0.0
6wul6c3         py-pyyaml@5.1.2
w7skmmz             libyaml@0.2.2
lo2ptf5     yaml-cpp@0.6.3 build_type=RelWithDebInfo +pic+shared~tests

Sorry, about the delay -- still working on figuring out if I'm doing something wrong. I may spin up and instance and submit the script manually to all ranks, but I don't know if that'll have the desired effect you're after.

@SteVwonder -- I notice that we use ~cuda which is specifying an explicit dependency. That may not be triggering the variant clause of +cuda?

FrankD412 commented 3 years ago

Also for more reference information, how we're standing up Flux:

source $FLUX_ENV

echo " > Launching Flux using jsrun"

mkdir -p $FLUX_ROOT
cd $FLUX_ROOT

unset OMP_NUM_THREADS
#PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n $NNODES flux start -o,-S,log-filename=$FLUX_LOG -v $FLUX_BOOTSTRAP $FLUX_INFO
PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n $NNODES flux start -o,-S,log-filename=$FLUX_LOG -v $FLUX_BOOTSTRAP
#PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n $1 echo $(which flux)

echo " > Flux launched using jsrun"

The $FLUX_INFO variable is normally included, but for the sake of the test it was disabled and doesn't change anything.

SteVwonder commented 3 years ago

I notice that we use ~cuda which is specifying an explicit dependency. That may not be triggering the variant clause of +cuda?

Yeah, ~cuda means that the cuda variant is not enabled. If you launch that Spack-installed Flux, it will not be able to detect GPUs. I don't think module load hwloc/1.11.10-cuda will cause any affect for a Spack-installed Flux since Spack rpath's everything. If you do spack install flux-sched +cuda that will install a CUDA-enabled Flux via spack: https://flux-framework.readthedocs.io/en/latest/quickstart.html#spack-recommended-for-curious-users

You can compose the variant with any other constraints too. For example: spack install flux-sched@0.9.0 +cuda ^flux-core@0.17.0 ^lua@5.2.4

FrankD412 commented 3 years ago

Got it -- will give that a shot and get back to you. Thanks for all the information!

FrankD412 commented 3 years ago

Oh -- the install solves being able to schedule GPUs from the command line, but it still doesn't entirely answer being able to run a GPU enabled ddcMD in nested instances. Those were submitted via the Maestro API found here.

FrankD412 commented 3 years ago

Some progress -- we got the +cuda variant installed, and even though the script crashed, it did get one file out. It's no longer empty:

cat flux/JOBID.117608284160.R
{"version": 1, "execution": {"R_lite": [{"rank": "0", "node": "lassen3", "children": {"core": "42-43", "gpu": "2-3"}}, {"rank": "1", "node": "lassen4", "children": {"core": "42-43", "gpu": "2-3"}}], "starttime": 1597888776, "expiration": 1598493576}}

I'm now running our workflow to make sure things still work, but Flux does appear to be finding the GPUs now.

SteVwonder commented 3 years ago

but it still doesn't entirely answer being able to run a GPU enabled ddcMD in nested instances. Those were submitted via the Maestro API found here.

Speculative guess: when the jobs were submitted in a nested instance, was the number of GPUs specified? If the API equivalent of flux mini run -n1 -c2 ddcMD was used (i.e., no -g 1), then I could see how that might work. I think we only set CUDA_VISIBLE_DEVICES when -g is specified (i.e., a gpu resource appears in the jobspec). So it may be the case that when -g is not specified, jobs running can see all of the GPUs (and of course the scheduler won't worry about finding them if they aren't asked for).

SteVwonder commented 3 years ago

I think we only set CUDA_VISIBLE_DEVICES when -g is specified (i.e., a gpu resource appears in the jobspec).

Confirmed that is indeed the case:

ƒ(s=1,d=0) fluxuser@23763266b859:/src$ flux mini run -n1 -c1 printenv | grep CUDA
ƒ(s=1,d=0) fluxuser@23763266b859:/src$ flux mini run -n1 -c1 -g1 printenv | grep CUDA
CUDA_DEVICE_ORDER=PCI_BUS_ID
CUDA_VISIBLE_DEVICES=0

FrankD412 commented 3 years ago

but it still doesn't entirely answer being able to run a GPU enabled ddcMD in nested instances. Those were submitted via the Maestro API found here.

Speculative guess: when the jobs were submitted in a nested instance, was the number of GPUs specified? If the API equivalent of flux mini run -n1 -c2 ddcMD was used (i.e., no -g 1), then I could see how that might work. I think we only set CUDA_VISIBLE_DEVICES when -g is specified (i.e., a gpu resource appears in the jobspec). So it may be the case that when -g is not specified, jobs running can see all of the GPUs (and of course the scheduler won't worry about finding them if they aren't asked for).

So, I confirmed we are setting GPUs in the MuMMI workflow code and Maestro does pass it through here. That does get passed to the API via the call here.

The jobspecs above don't include GPUs even though we are passing them. I can check the new ones I just generated though.

EDIT: Though I'll have to wait for ddcMD to be spun up in about 40 minutes.

dongahn commented 3 years ago

Confirmed that is indeed the case:

If the job is using GPU even though it is not scheduled, that's wrong :-) At some point, we will need cgroup support for better resource containment.

FrankD412 commented 3 years ago

Alright, I found a small but very easy to miss bug in Maestro which forced GPUs to 0. However, this makes it even more interesting that ddcMD was able to access GPU resources (albeit, it seems to me that they all may have been on the same GPU). I just made the fix in Maestro and will test again, but there's definitely something interesting happening with jobs being able to get on even a single GPU.

FrankD412 commented 3 years ago

Tracked down the new job spec. The job spec below was provided create by Maestro using the JobSpecV1 class from flux-core@0.17.0+cuda. GPUs now show up since they aren't being forced to zero. I'm not sure if Flux without CUDA would have thrown an exception (it's looking like it wouldn't as @SteVwonder pointed out he was able to schedule, but would simply have CUDA_VISIBLE_DEVICES set to 0).

I suspect that some of our issues previously were from things being launched to the same GPU, so I'm curious to see if this fix alleviates that problem.

"resources":[
    {
        "type":"node",
        "count":1,
        "with":[
            {
                "type":"slot",
                "count":1,
                "with":[
                    {
                        "type":"core",
                        "count":20
                    },
                    {
                        "type":"gpu",
                        "count":4
                    }
                ],
                "label":"task"
            }
        ]
    }
],
"tasks":[
    {
        "command":["flux", "start", "/p/gpfs1/fdinatal/roots/mummi_root_20200819/workspace/run_ddcmd_analysis-pfpatch_000000000001_pfpatch_000000000101_pfpatch_000000000298_pfpatch_000000000299.flux.sh"],
        "slot":"task",
        "count":{
            "per_slot":1
        }
    }
]

SteVwonder commented 3 years ago

If the job is using GPU even though it is not scheduled, that's wrong :-) At some point, we will need cgroup support for better resource containment.

Opened an issue over in flux-core to track better soft-containment of GPUs using CUDA_VISIBLE_DEVICES when no GPUs are requested: https://github.com/flux-framework/flux-core/issues/3154

SteVwonder commented 3 years ago

Tracked down the new job spec. The job spec below was provided create by Maestro using the JobSpecV1 class from flux-core@0.17.0+cuda. GPUs now show up since they aren't being forced to zero.

Awesome! LGTM!

I'm not sure if Flux without CUDA would have thrown an exception (it's looking like it wouldn't as @SteVwonder pointed out he was able to schedule, but would simply have CUDA_VISIBLE_DEVICES set to 0).

Yep! As you suggest, despite Flux not knowing about GPUs on the system, it will happily run your GPU jobs if their jobspec does not request any GPUs. If the number of GPUs pass to the jobspec constructor is set to 0, then the jobspec will not insert a GPU resource and CUDA_VISIBLE_DEVICES won't be set at all. Counter-intuitively, this means your jobs will see all of the GPUs.

I suspect that some of our issues previously were from things being launched to the same GPU, so I'm curious to see if this fix alleviates that problem.

I think you are probably right that this was resulting in all of the jobs using the same GPU, since they will presumably all use GPU 0 automatically.

Let us know if you run into any other issues with the GPU jobs. You are able to co-schedule the GPU and CPU jobs on the same node now after removing the system-reserved cores, correct? If so, are there any remaining "gotchas" mentioned in this issue that haven't been addressed?

FrankD412 commented 3 years ago

I suspect that some of our issues previously were from things being launched to the same GPU, so I'm curious to see if this fix alleviates that problem.

I think you are probably right that this was resulting in all of the jobs using the same GPU, since they will presumably all use GPU 0 automatically.

Let us know if you run into any other issues with the GPU jobs. You are able to co-schedule the GPU and CPU jobs on the same node now after removing the system-reserved cores, correct? If so, are there any remaining "gotchas" mentioned in this issue that haven't been addressed?

Yeah, core_isolation fixed our issue with the extra cores being shaved off, so we now have ddcMD and createsim jobs co-locating happily (need to verify pinning, but at least for now they're being put on the same node). I'm running a test on Lassen to see if the ddcMD runs work correctly -- there was an exception that came up related to a CUDA free that might be due to a GPU being over-scheduled. So far, I've managed to confirm that the new install of Flux hasn't broken anything functionality-wise. Do you happen to know how to interrogate what GPU a process is running on?

SteVwonder commented 3 years ago

Do you happen to know how to interrogate what GPU a process is running on?

Not sure the "best way", but two thoughts are:

SSH'ing into the node and running nvidia-smi to look at GPU utilization is one way
Adding printenv CUDA_VISIBLE_DEVICES to your job script, so that it runs before each ddcMD invocation to see what "soft restrictions" are in place

FrankD412 commented 3 years ago

Alright... so far it looks like one set of simulations started and two crashed. However, the two remaining simulations are in fact running on two GPUs. So this looks like it's solved -- the previously mentioned error came up again, so it looks like it's not related to overscheduling onto one GPU. I think that's the last outstanding question in this thread since @SteVwonder already started an issue about jobs using GPUs that aren't theirs.

nvidia-smi
Thu Aug 20 13:23:56 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.116.00   Driver Version: 418.116.00   CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   28C    P0    36W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   27C    P0    35W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   41C    P0   163W / 300W |    595MiB / 16130MiB |     93%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   40C    P0   165W / 300W |    595MiB / 16130MiB |     89%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    2    109206      C   ...kras/sierra/ddcmd-gpu7/bin/ddcMD-sierra   583MiB |
|    3    109372      C   ...kras/sierra/ddcmd-gpu7/bin/ddcMD-sierra   583MiB |
+-----------------------------------------------------------------------------+

SteVwonder commented 3 years ago

he previously mentioned error came up again, so it looks like it's not related to overscheduling onto one GPU. I think that's the last outstanding question in this thread since @SteVwonder already started an issue about jobs using GPUs that aren't theirs.

What was the error that it produced?

FrankD412 commented 3 years ago

The error that we're seeing is: Cuda failure gpuMemUtils.cu:212: 'invalid argument'

Xiaohua tracked that down to an error related to a CUDA memory free operation, but we're unsure why it fails currently. The way it was described (and the only reason I know of) for the free to fail is that the memory pointer is for some reason invalid. I initially thought that maybe the multiple processes were causing the code to "misbehave" from not being in a state which it expected. That was disproved by the recent test today that showed them running on independent GPUs.

dongahn commented 3 years ago

Maybe the recent support for totalview can come in handy for this.

You can run one of the failed job under the control of totalview with "CUDA memcheck" and see if the tool detect the source of error automatically.

Basic totalview howto is described here https://flux-framework.readthedocs.io/en/latest/debugging.html

dongahn commented 3 years ago

Most of what's described here should have been fixed w/ better nesting support with resource readers in PR #787

flux-framework / flux-sched

MuMMI workflows cannot co-schedule jobs on a node #728