Closed dongahn closed 3 years ago
Tagging: @bhatiaharsh and @FrankD412.
@FrankD412 and others, when you get the allocation on Summit, are any of the 44 cores reserved as system cores? If so, I believe a cgroup gets setup that limits Flux (and its jobs) to less than 44 cores, which would prevent co-scheduling in this scenario.
I think one way you could verify this is to run flux hwloc info
in your allocation.
@SteVwonder: that's exactly what I thought. I was doing some testing on Lassen to confirm. I also want to see if there is a way to get rid of core isolation to expose all 44.
On Lassen, I just confirmed that only 40 cores per compute nodes are exposed because of core isolations so the above job specs cannot be scheduled.
lassen708{dahn}21: flux hwloc info
2 Machines, 80 Cores, 320 PUs
Let me see if there are ways to get rid of core isolation next.
The Summit docs make it appear that you cannot change it: https://docs.olcf.ornl.gov/systems/summit_user_guide.html#system-service-core-isolation
I know on Lassen you can control it via lrun and the other LLNL-specific wrappers (which probably means there is a hook into bsub/jsrun).
Yeah, I think this has to be changed via bsub
for us. If this cannot be changed on ORNL systems, the MuMMI workflow should just schedule the "user-visible" cores or work with the facility to turn of the isolated cores for their bsub jobs.
From an announcement early this year:
This morning, Tuesday 1/28/2020 ~10am, we switched the default core_isolation from 0 to 2 if a core isolation is not specified when using bsub on SIERRA. If you encounter issues with this default change, please let us know. The old behavior can be restored by either adding -core_isolation 0 to your bsub line or adding a line “#BSUB -core_isolation 0 “ to your bsub script.
Ok. I confirmed this is due to the core isolation and this should be addressed with one of my above recommendations above: https://github.com/flux-framework/flux-sched/issues/728#issuecomment-674139797
lassen709{dahn}37: bsub -Is -XF -nnodes 2 -core_isolation 0 -qpdebug /usr/bin/bash
Job <1346013> is submitted to queue <pdebug>.
<<ssh X11 forwarding job>>
<<Waiting for dispatch ...>>
<<Starting on lassen710>>
bash-4.2$ module use /usr/global/tools/flux/blueos_3_ppc64le_ib/modulefiles
bash-4.2$ module load pmi-shim
bash-4.2$ PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 2 /usr/global/tools/flux/blueos_3_ppc64le_ib/flux-c0.18.0-s0.10.0/bin/flux start
2020-08-14T15:54:43.206740Z broker.err[0]: rc2.0: /bin/tcsh Interrupt (rc=130) 9.5s
WARNING: exiting due to 2 SIGINT's within 1 second. Job step may still be running and must be managed manually with jskillbash-4.2$ PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 2 /usr/global/tools/flux/bl3_ppc64le_ib/flux-c0.18.0-s0.10.0/bin/flux start ~/ip.sh
ssh://lassen3/var/tmp/flux-tbjlBX/0
lassen708{dahn}23: /usr/global/tools/flux/blueos_3_ppc64le_ib/flux-c0.18.0-s0.10.0/bin/flux proxy ssh://lassen3/var/tmp/flux-tbjlBX/0/local
lassen708{dahn}21: flux hwloc info
2 Machines, 88 Cores, 352 PU
Whereas without the -core_isolation 0
flag:
lassen708{dahn}22: /usr/global/tools/flux/blueos_3_ppc64le_ib/flux-c0.18.0-s0.10.0/bin/flux proxy ssh://lassen6/var/tmp/flux-aR3vLf/0/local
lassen708{dahn}21: flux hwloc info
2 Machines, 80 Cores, 320 PUs
I will keep this ticket open a bit in case @bhatiaharsh or @FrankD412 have more questions.
@dongahn and @SteVwonder -- Thanks for looking into this. You two flagged exactly the same issue that I thought it might be when I saw this issue go up. We'll definitely try and go back to the old isolation settings, as we need all the cores we can get. I also noticed something in the jobspec that I'd like to verify on our end, as well. Thanks for the pointers! -- and I'll definitely post back here with any other questions we might have.
Thanks. Onto your other issue now.
Referencing to this comment in #729
Alright -- sorry about the delay, power outages and other things managed to keep me from posting; it looks like we got this sorted out. I think in a previous instantiation we may have wrapped the
ddcMD
jobs in wrapper scripts in order to get the naming the way we wanted. Those may have included brokers, but I honestly couldn't tell you at this point it was long enough ago. I had initially thought that theflux mini run
calls would be caught by the master broker on the node, but that doesn't appear to be the case. Once we scheduled under a broker,ddcMD
successfully ran and was named appropriately with my recent changes to Maestro's Flux adapter -- so that now gets rid of the middle man wrapper script.We did in the course of figuring this out run into our Flux instance telling us that requests made with GPUs were unsatisfiable. We're currently unsure if this is an installation issue or a bigger issue. @SteVwonder recommended starting a new issue for that error.
The co-located job specifications worked with a sub-broker; it turns out that scheduling a script that isn't under a broker that calls flux mini run
will result in the jobs being fed back up to the master broker. This was a misunderstanding in what layer would pick up the mini
calls. That allowed us to run ddcMD
and the new call will request the GPUs through the Flux backend API; however, when calling flux mini run
from the command line in the course of trying to debug ddcMD
calls and re-submit Maestro generated specifications, we get a resource unsatisfiable error which flags that the GPUs cannot be allocated. As soon as we remove the -g
from the mini
command goes through. I'll have to reproduce/dig up the error, but that's how we end up getting the error to show up. We were able to reproduce the error consistently, so we could try and reproduce it for you on one of our weekly DATs if you'd like.
The co-located job specifications worked with a sub-broker;
Is a sub-broker a nested Flux instance?
it turns out that scheduling a script that isn't under a broker that calls flux mini run will result in the jobs being fed back up to the master broker.
Is the master broker the outer-most Flux instance? Could you elaborate what you mean by "scheduling a script that isn't under a broker"? I am not clear how that could get fed to the outer-most Flux instance.
however, when calling flux mini run from the command line in the course of trying to debug ddcMD calls and re-submit Maestro generated specifications, we get a resource unsatisfiable error which flags that the GPUs cannot be allocated.
Which instance does flux mini run
request goes in? The outer-most Flux instance or the nested instance? My guess is that is one of the nested instance that isn't allocated to any GPU.
The co-located job specifications worked with a sub-broker;
Is a sub-broker a nested Flux instance?
Sorry, I'm using incorrect terminology. Yeah -- I'm referring to nested Flux instances.
it turns out that scheduling a script that isn't under a broker that calls flux mini run will result in the jobs being fed back up to the master broker.
Is the master broker the outer-most Flux instance? Could you elaborate what you mean by "scheduling a script that isn't under a broker"? I am not clear how that could get fed to the outer-most Flux instance.
So in our 10 node DATs, we would find that when the MuMMI workflow wasn't using a nested instance of Flux to schedule the bundles of ddcMD
jobs, the flux mini run
calls that the script made would show in the outer-most Flux instance. The script would simply loop over four systems to simulate and call flux mini run
on each system. The individual runs would show up in the outer-most Flux instance. From checking the processes on each node and my own debugging, I know that a Flux broker (or process) is on each node, so I expected the mini
calls to bubble up to the node instance and not the outer-most. I guess in my mind the node-level instances were a sub-level under the master instance.
however, when calling flux mini run from the command line in the course of trying to debug ddcMD calls and re-submit Maestro generated specifications, we get a resource unsatisfiable error which flags that the GPUs cannot be allocated.
Which instance does
flux mini run
request goes in? The outer-most Flux instance or the nested instance? My guess is that is one of the nested instance that isn't allocated to any GPU.
Sorry, I misspoke -- I meant to say flux mini submit
here. We would run a flux mini submit
with the generate script from the outer-most instance. We would allocate it all the GPUs , but it would come back as unsatisfiable. We even cherry-picked single systems to run using a flux mini run
and ran into the same unsatisfiable error.
For reference from a quick allocation I spun up:
flux hwloc info
10 Machines, 440 Cores, 1760 PUs
Are you able to schedule any GPU resources? I remember there is a problem with the default hwloc such a way that Flux cannot automatically discover GPU resources. Did you use the module we put together that allows us to use the right version of hwloc?
It is documented at https://flux-framework.readthedocs.io/en/latest/coral.html.
Could you run a very small instance like at 2 Lassen nodes and run the following tests?
initial_program.sh
:
#! /bin/bash
JOBID=$(flux mini submit -N 2 -n 2 -c 2 -g 2 sleep 60)
flux job info ${JOBID} R > JOBID.${JOBID}.R
flux queue drain
jsrun ... /your/path/flux start ./initial_program.sh
I am also curious how you launch node-level Flux instances to see if each of them gets GPU resources. Can you elaborate?
The script would simply loop over four systems to simulate and call flux mini run on each system.
What are those "four" systems?
Are you able to schedule any GPU resources? I remember there is a problem with the default hwloc such a way that Flux cannot automatically discover GPU resources. Did you use the module we put together that allows us to use the right version of hwloc?
It is documented at https://flux-framework.readthedocs.io/en/latest/coral.html.
We do not appear to be explicitly loading the hwloc
as per the documentation. Below is what flux --version
prints and what our module loading looks like. I'll try introducing the module load hwloc/1.11.10-cuda
command to our environment loading. What's interesting here though, is that we do have ddcMD
simulations starting under a nested Flux instance, which implies that the inner instance is capable of seeing the GPUs. Is that just a side-effect of the inner instance having "ownership" of the GPUs via the allocation?
flux --version
commands: 0.17.0
libflux-core: 0.17.0
build-options: +hwloc==1.11.6
if [[ $HOST == lassen* ]]; then
source /etc/profile.d/z00_lmod.sh
MODULE_FILE=/usr/global/tools/flux/blueos_3_ppc64le_ib/modulefiles
SHIM_MODULE=pmi-shim
MPI_MODULE=spectrum-mpi/2019.06.24-flux
# FLUX=`which flux`
# Seemed to work without the spectrum load, but will keep here for documentation
# SHIM_MODULE="spectrum-mpi/2019.06.24-flux pmi-shim"
elif [[ $HOST == *summit* ]]; then
# MODULE_FILE="/sw/summit/modulefiles/ums/gen007flux/Core"
# SHIM_MODULE="pmi-shim"
echo '> ERROR: need shim module for' $HOST
else
echo '> ERROR: Unidentified host '$HOST
return
fi
module use $MODULE_FILE
module load $SHIM_MODULE
module load $MPI_MODULE
Could you run a very small instance like at 2 Lassen nodes and run the following tests?
initial_program.sh
:#! /bin/bash JOBID=$(flux mini submit -N 2 -n 2 -c 2 -g 2 sleep 60) flux job info ${JOBID} R > JOBID.${JOBID}.R flux queue drain
jsrun ... /your/path/flux start ./initial_program.sh
I am also curious how you launch node-level Flux instances to see if each of them gets GPU resources. Can you elaborate?
I'll add the fix I mentioned above and use this as a test. If my hunch is right and the module load
above with CUDA enabled hwloc
fixes the issue.
The script would simply loop over four systems to simulate and call flux mini run on each system.
What are those "four" systems?
The four systems are just individual ddcMD
systems that we simulate, one per GPU.
If you used a later release (core v0.18 and sched v0.10), you would have getten better listing of resources with flux resource list
.
What's interesting here though, is that we do have ddcMD simulations starting under a nested Flux instance, which implies that the inner instance is capable of seeing the GPUs. Is that just a side-effect of the inner instance having "ownership" of the GPUs via the allocation?
I don't think so. It seems we may need more investigation.
If you used a later release (core v0.18 and sched v0.10), you would have getten better listing of resources with
flux resource list
.What's interesting here though, is that we do have ddcMD simulations starting under a nested Flux instance, which implies that the inner instance is capable of seeing the GPUs. Is that just a side-effect of the inner instance having "ownership" of the GPUs via the allocation?
I don't think so. It seems we may need more investigation.
Sounds good -- I'm loading up the hwloc
specified in the documentation and will run a quick test here momentarily. I can then revert back to not loading it and see how it behaves.
@FrankD412: typically, MuMMi uses a spack-install flux-core/flux-sched, right? Were they built with the +cuda
variant enabled? You can check by getting the hash of either the flux-core/flux-sched spack package that you are loading and running spack spec -ldv /hash-of-package
.
@dongahn -- I tried to launch and get the following error. It produces a single JOBID
file that's empty.
----> Launching Flux
> flux : /usr/gapps/kras/spack6/opt/spack/linux-rhel7-power9le/gcc-7.3.1/flux-core-0.17.0-nq5is7d55stj3sjjk3hceqk2ql63fgcb/bin/flux
> version:
commands: 0.17.0
libflux-core: 0.17.0
build-options: +hwloc==1.11.6
> NUM_NODES = 3
> FLUX_ROOT = /p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux
> FLUX_INFO = /p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/flux.info
> FLUX_BOOTSTRAP = /p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/initial_program.sh
> Loading flux environment (lassen5.coral.llnl.gov)
> Launching Flux using jsrun
flux-start: /usr/gapps/kras/spack6/opt/spack/linux-rhel7-power9le/gcc-7.3.1/flux-core-0.17.0-nq5is7d55stj3sjjk3hceqk2ql63fgcb/libexec/flux/cmd/flux-broker -S log-filename=/p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/flux.log /p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/initial_program.sh
flux-start: /usr/gapps/kras/spack6/opt/spack/linux-rhel7-power9le/gcc-7.3.1/flux-core-0.17.0-nq5is7d55stj3sjjk3hceqk2ql63fgcb/libexec/flux/cmd/flux-broker -S log-filename=/p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/flux.log /p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/initial_program.sh
flux-start: /usr/gapps/kras/spack6/opt/spack/linux-rhel7-power9le/gcc-7.3.1/flux-core-0.17.0-nq5is7d55stj3sjjk3hceqk2ql63fgcb/libexec/flux/cmd/flux-broker -S log-filename=/p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/flux.log /p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/initial_program.sh
flux-job: job 69893881856 id or key not found
> Flux launched using jsrun
The log produced by Flux has the following:
2020-08-19T21:04:35.807510Z job-manager.debug[0]: scheduler: ready single
2020-08-19T21:04:35.807768Z sched-simple.debug[0]: ready: 132 of 132 cores: rank[0-2]/core[0-43]
2020-08-19T21:04:35.903650Z broker.debug[1]: insmod job-ingest
2020-08-19T21:04:35.904963Z job-ingest.debug[1]: fluid ts=362ms
2020-08-19T21:04:35.913668Z broker.debug[2]: insmod job-ingest
2020-08-19T21:04:35.914898Z job-ingest.debug[2]: fluid ts=372ms
2020-08-19T21:04:35.920002Z broker.info[0]: rc1.0: running /usr/gapps/kras/spack6/opt/spack/linux-rhel7-power9le/gcc-7.3.1/flux-core-0.17.0-nq5is7d55stj3sjjk3hceqk2ql63fgcb/etc/flux/rc1.d/01-enclosing-instance
2020-08-19T21:04:36.029929Z broker.info[0]: rc1.0: running /usr/gapps/kras/spack6/opt/spack/linux-rhel7-power9le/gcc-7.3.1/flux-sched-0.9.0-ts7rizygn3sxu3a4wn4tnkb7x74ohe66/etc/flux/rc1.d/sched-fluxion-qmanager-start
2020-08-19T21:04:36.196230Z broker.debug[0]: rmmod sched-simple
2020-08-19T21:04:36.196466Z sched-simple.debug[0]: service_unregister
2020-08-19T21:04:36.196672Z broker.debug[0]: module sched-simple exited
2020-08-19T21:04:36.196797Z resource.debug[0]: acquire_disconnect: resource.acquire aborted
2020-08-19T21:04:36.560999Z broker.debug[0]: insmod sched-fluxion-qmanager
2020-08-19T21:04:36.561817Z sched-fluxion-qmanager.debug[0]: enforced policy (queue=default): fcfs
2020-08-19T21:04:36.561841Z sched-fluxion-qmanager.debug[0]: effective queue params (queue=default): default
2020-08-19T21:04:36.561849Z sched-fluxion-qmanager.debug[0]: effective policy params (queue=default): default
2020-08-19T21:04:36.562204Z sched-fluxion-qmanager.debug[0]: service_register
2020-08-19T21:04:36.562379Z job-manager.debug[0]: scheduler: hello
2020-08-19T21:04:36.562759Z job-manager.debug[0]: scheduler: ready unlimited
2020-08-19T21:04:36.563957Z broker.info[0]: rc1.0: running /usr/gapps/kras/spack6/opt/spack/linux-rhel7-power9le/gcc-7.3.1/flux-sched-0.9.0-ts7rizygn3sxu3a4wn4tnkb7x74ohe66/etc/flux/rc1.d/sched-fluxion-resource-start
2020-08-19T21:04:36.970007Z broker.debug[0]: insmod sched-fluxion-resource
2020-08-19T21:04:36.970803Z sched-fluxion-resource.debug[0]: mod_main: resource module starting
2020-08-19T21:04:37.018494Z sched-fluxion-resource.info[0]: populate_resource_db: loaded resources from hwloc in the KVS
2020-08-19T21:04:37.020163Z sched-fluxion-resource.debug[0]: mod_main: resource graph database loaded
2020-08-19T21:04:38.021926Z broker.info[0]: rc1.0: /bin/zsh -c /usr/gapps/kras/spack6/opt/spack/linux-rhel7-power9le/gcc-7.3.1/flux-core-0.17.0-nq5is7d55stj3sjjk3hceqk2ql63fgcb/etc/flux/rc1 Exited (rc=0) 4.5s
2020-08-19T21:04:38.022018Z broker.info[0]: rc1-success: init->run
2020-08-19T21:04:38.024795Z broker.err[0]: rc2.0: /p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/initial_program.sh /p/gpfs1/fdinatal/roots/flux_testing/hwloc/flux/flux.info error starting command (rc=1) 0.0s
2020-08-19T21:04:38.024889Z broker.info[0]: rc2-fail: run->cleanup
2020-08-19T21:04:38.204138Z broker.info[0]: cleanup.0: /bin/zsh -c flux queue stop --quiet Exited (rc=0) 0.2s
2020-08-19T21:04:38.378076Z broker.info[0]: cleanup.1: /bin/zsh -c flux job cancelall --user=all --quiet -f --states RUN Exited (rc=0) 0.2s
2020-08-19T21:04:38.562958Z broker.info[0]: cleanup.2: /bin/zsh -c flux queue idle --quiet Exited (rc=0) 0.2s
2020-08-19T21:04:38.563017Z broker.info[0]: cleanup-success: cleanup->finalize
@SteVwonder -- The spack
tree is as follows.
spack find -dvl /ts7rizy
==> 1 installed package
-- linux-rhel7-power9le / gcc@7.3.1 -----------------------------
ts7rizy flux-sched@0.9.0~cuda
vvov2d3 boost@1.72.0+atomic+chrono~clanglibcpp~context~coroutine cxxstd=98 +date_time~debug+exception~fiber+filesystem+graph~icu+iostreams+locale+log+math~mpi+multithreaded~numpy~pic+program_options~python+random+regex+serialization+shared+signals~singlethreaded+system~taggedlayout+test+thread+timer~versionedlayout visibility=hidden +wave
kredonm bzip2@1.0.8+shared
wup2hw6 zlib@1.2.11+optimize+pic+shared
nq5is7d flux-core@0.17.0~cuda~docs
d75mvn5 czmq@4.1.1
frwhvqe libuuid@1.0.3
5crsyl3 libzmq@4.3.2+libsodium
6g3m3f5 libsodium@1.0.17
6sbfjil hwloc@1.11.11~cairo~cuda~gl+libxml2~nvml+pci+shared
wtudf3r libpciaccess@0.13.5
g3dcod5 libxml2@2.9.9~python
4b7rg3o libiconv@1.16
xcurmxx xz@5.2.4
w6ncfte numactl@2.0.12
7rh2ree jansson@2.9 build_type=RelWithDebInfo +shared
aaoiw74 lua@5.2.4
aduzbso ncurses@6.1~symlinks~termlib
2ujhpjz readline@8.0
656zrir unzip@6.0
gbzgv7h lua-luaposix@33.4.0
vj67x5s lz4@1.9.2
zcwgmes pkgconf@1.6.3
4hpdg4j py-cffi@1.13.0
4vehugn libffi@3.2.1
63ld7ym py-pycparser@2.19
4plz5ft python@3.7.3+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4~uuid+zlib
enrjzup expat@2.2.9+libbsd
6huorkj libbsd@0.10.0
6i4b2ms gdbm@1.18.1
6rekqlv gettext@0.20.1+bzip2+curses+git~libunistring+libxml2+tar+xz
fodcl7k tar@1.32
72igzz4 openssl@1.1.1d+systemcerts
7p7c7xr sqlite@3.30.1~column_metadata+fts~functions~rtree
5mnm3ui py-jsonschema@3.2.0
joaaqkp py-attrs@19.2.0
72mwgbb py-pyrsistent@0.16.0
67vupme py-hypothesis@4.41.2
ywgown5 py-memory-profiler@0.47
ygudpfs py-psutil@2.1.1
vrcy3c2 py-pytest@5.2.1
zrptpo4 py-atomicwrites@1.3.0
q6fruqd py-importlib-metadata@1.2.0
kxb54oc py-zipp@0.6.0
lhckoew py-more-itertools@7.2.0
tplfrvu py-packaging@19.2
o3pp35u py-pyparsing@2.4.2
ech4awl py-six@1.12.0
7axqoif py-pluggy@0.13.0
2nz7d4u py-py@1.8.0
hxsc2qt py-setuptools@41.4.0
wfeguvt py-wcwidth@0.1.7
eejdro5 py-sphinx@2.2.0
o3rqwf3 py-alabaster@0.7.12
pfddoj6 py-babel@2.7.0
at75jbz py-pytz@2019.3
ep5si3c py-docutils@0.15.2
5a7yu4r py-imagesize@1.1.0
jgytqop py-jinja2@2.10.3
r7pwgkj py-markupsafe@1.1.1
4mniign py-pygments@2.4.2
h5c54kf py-requests@2.22.0
nqey7gd py-certifi@2019.9.11
mxweks6 py-chardet@3.0.4
umhq3zi py-idna@2.8
vjrkeur py-urllib3@1.25.6
gx6p5yp py-snowballstemmer@2.0.0
v3hjmuw py-sphinxcontrib-applehelp@1.0.1
hexucsh py-sphinxcontrib-devhelp@1.0.1
ahhpqpr py-sphinxcontrib-htmlhelp@1.0.2
v4ava3e py-sphinxcontrib-jsmath@1.0.1
xbit3qv py-sphinxcontrib-qthelp@1.0.2
ytuv3hh py-sphinxcontrib-serializinghtml@1.1.3
cmkz6tc py-sphinx-rtd-theme@0.1.5
dvtwsme py-tox@3.14.2
q2yf6rf py-filelock@3.0.4
inh52uq py-toml@0.10.0
zd2itpq py-virtualenv@16.7.6
y5exivb py-vcversioner@2.16.0.0
6wul6c3 py-pyyaml@5.1.2
w7skmmz libyaml@0.2.2
lo2ptf5 yaml-cpp@0.6.3 build_type=RelWithDebInfo +pic+shared~tests
Sorry, about the delay -- still working on figuring out if I'm doing something wrong. I may spin up and instance and submit the script manually to all ranks, but I don't know if that'll have the desired effect you're after.
@SteVwonder -- I notice that we use ~cuda
which is specifying an explicit dependency. That may not be triggering the variant clause of +cuda
?
Also for more reference information, how we're standing up Flux:
source $FLUX_ENV
echo " > Launching Flux using jsrun"
mkdir -p $FLUX_ROOT
cd $FLUX_ROOT
unset OMP_NUM_THREADS
#PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n $NNODES flux start -o,-S,log-filename=$FLUX_LOG -v $FLUX_BOOTSTRAP $FLUX_INFO
PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n $NNODES flux start -o,-S,log-filename=$FLUX_LOG -v $FLUX_BOOTSTRAP
#PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n $1 echo $(which flux)
echo " > Flux launched using jsrun"
The $FLUX_INFO
variable is normally included, but for the sake of the test it was disabled and doesn't change anything.
I notice that we use ~cuda which is specifying an explicit dependency. That may not be triggering the variant clause of +cuda?
Yeah, ~cuda
means that the cuda
variant is not enabled. If you launch that Spack-installed Flux, it will not be able to detect GPUs. I don't think module load hwloc/1.11.10-cuda
will cause any affect for a Spack-installed Flux since Spack rpath's everything. If you do spack install flux-sched +cuda
that will install a CUDA-enabled Flux via spack: https://flux-framework.readthedocs.io/en/latest/quickstart.html#spack-recommended-for-curious-users
You can compose the variant with any other constraints too. For example: spack install flux-sched@0.9.0 +cuda ^flux-core@0.17.0 ^lua@5.2.4
Got it -- will give that a shot and get back to you. Thanks for all the information!
Oh -- the install solves being able to schedule GPUs from the command line, but it still doesn't entirely answer being able to run a GPU enabled ddcMD
in nested instances. Those were submitted via the Maestro API found here.
Some progress -- we got the +cuda
variant installed, and even though the script crashed, it did get one file out. It's no longer empty:
cat flux/JOBID.117608284160.R
{"version": 1, "execution": {"R_lite": [{"rank": "0", "node": "lassen3", "children": {"core": "42-43", "gpu": "2-3"}}, {"rank": "1", "node": "lassen4", "children": {"core": "42-43", "gpu": "2-3"}}], "starttime": 1597888776, "expiration": 1598493576}}
I'm now running our workflow to make sure things still work, but Flux does appear to be finding the GPUs now.
but it still doesn't entirely answer being able to run a GPU enabled ddcMD in nested instances. Those were submitted via the Maestro API found here.
Speculative guess: when the jobs were submitted in a nested instance, was the number of GPUs specified? If the API equivalent of flux mini run -n1 -c2 ddcMD
was used (i.e., no -g 1
), then I could see how that might work. I think we only set CUDA_VISIBLE_DEVICES
when -g
is specified (i.e., a gpu
resource appears in the jobspec). So it may be the case that when -g
is not specified, jobs running can see all of the GPUs (and of course the scheduler won't worry about finding them if they aren't asked for).
I think we only set CUDA_VISIBLE_DEVICES when -g is specified (i.e., a gpu resource appears in the jobspec).
Confirmed that is indeed the case:
ƒ(s=1,d=0) fluxuser@23763266b859:/src$ flux mini run -n1 -c1 printenv | grep CUDA
ƒ(s=1,d=0) fluxuser@23763266b859:/src$ flux mini run -n1 -c1 -g1 printenv | grep CUDA
CUDA_DEVICE_ORDER=PCI_BUS_ID
CUDA_VISIBLE_DEVICES=0
but it still doesn't entirely answer being able to run a GPU enabled ddcMD in nested instances. Those were submitted via the Maestro API found here.
Speculative guess: when the jobs were submitted in a nested instance, was the number of GPUs specified? If the API equivalent of
flux mini run -n1 -c2 ddcMD
was used (i.e., no-g 1
), then I could see how that might work. I think we only setCUDA_VISIBLE_DEVICES
when-g
is specified (i.e., agpu
resource appears in the jobspec). So it may be the case that when-g
is not specified, jobs running can see all of the GPUs (and of course the scheduler won't worry about finding them if they aren't asked for).
So, I confirmed we are setting GPUs in the MuMMI workflow code and Maestro does pass it through here. That does get passed to the API via the call here.
The jobspecs above don't include GPUs even though we are passing them. I can check the new ones I just generated though.
EDIT: Though I'll have to wait for ddcMD to be spun up in about 40 minutes.
Confirmed that is indeed the case:
If the job is using GPU even though it is not scheduled, that's wrong :-) At some point, we will need cgroup support for better resource containment.
Alright, I found a small but very easy to miss bug in Maestro which forced GPUs to 0. However, this makes it even more interesting that ddcMD
was able to access GPU resources (albeit, it seems to me that they all may have been on the same GPU). I just made the fix in Maestro and will test again, but there's definitely something interesting happening with jobs being able to get on even a single GPU.
Tracked down the new job spec. The job spec below was provided create by Maestro using the JobSpecV1
class from flux-core@0.17.0+cuda
. GPUs now show up since they aren't being forced to zero. I'm not sure if Flux without CUDA would have thrown an exception (it's looking like it wouldn't as @SteVwonder pointed out he was able to schedule, but would simply have CUDA_VISIBLE_DEVICES
set to 0).
I suspect that some of our issues previously were from things being launched to the same GPU, so I'm curious to see if this fix alleviates that problem.
"resources":[
{
"type":"node",
"count":1,
"with":[
{
"type":"slot",
"count":1,
"with":[
{
"type":"core",
"count":20
},
{
"type":"gpu",
"count":4
}
],
"label":"task"
}
]
}
],
"tasks":[
{
"command":["flux", "start", "/p/gpfs1/fdinatal/roots/mummi_root_20200819/workspace/run_ddcmd_analysis-pfpatch_000000000001_pfpatch_000000000101_pfpatch_000000000298_pfpatch_000000000299.flux.sh"],
"slot":"task",
"count":{
"per_slot":1
}
}
]
If the job is using GPU even though it is not scheduled, that's wrong :-) At some point, we will need cgroup support for better resource containment.
Opened an issue over in flux-core to track better soft-containment of GPUs using CUDA_VISIBLE_DEVICES when no GPUs are requested: https://github.com/flux-framework/flux-core/issues/3154
Tracked down the new job spec. The job spec below was provided create by Maestro using the JobSpecV1 class from flux-core@0.17.0+cuda. GPUs now show up since they aren't being forced to zero.
Awesome! LGTM!
I'm not sure if Flux without CUDA would have thrown an exception (it's looking like it wouldn't as @SteVwonder pointed out he was able to schedule, but would simply have CUDA_VISIBLE_DEVICES set to 0).
Yep! As you suggest, despite Flux not knowing about GPUs on the system, it will happily run your GPU jobs if their jobspec does not request any GPUs. If the number of GPUs pass to the jobspec constructor is set to 0, then the jobspec will not insert a GPU
resource and CUDA_VISIBLE_DEVICES
won't be set at all. Counter-intuitively, this means your jobs will see all of the GPUs.
I suspect that some of our issues previously were from things being launched to the same GPU, so I'm curious to see if this fix alleviates that problem.
I think you are probably right that this was resulting in all of the jobs using the same GPU, since they will presumably all use GPU 0 automatically.
Let us know if you run into any other issues with the GPU jobs. You are able to co-schedule the GPU and CPU jobs on the same node now after removing the system-reserved cores, correct? If so, are there any remaining "gotchas" mentioned in this issue that haven't been addressed?
I suspect that some of our issues previously were from things being launched to the same GPU, so I'm curious to see if this fix alleviates that problem.
I think you are probably right that this was resulting in all of the jobs using the same GPU, since they will presumably all use GPU 0 automatically.
Let us know if you run into any other issues with the GPU jobs. You are able to co-schedule the GPU and CPU jobs on the same node now after removing the system-reserved cores, correct? If so, are there any remaining "gotchas" mentioned in this issue that haven't been addressed?
Yeah, core_isolation
fixed our issue with the extra cores being shaved off, so we now have ddcMD
and createsim
jobs co-locating happily (need to verify pinning, but at least for now they're being put on the same node). I'm running a test on Lassen to see if the ddcMD
runs work correctly -- there was an exception that came up related to a CUDA free that might be due to a GPU being over-scheduled. So far, I've managed to confirm that the new install of Flux hasn't broken anything functionality-wise. Do you happen to know how to interrogate what GPU a process is running on?
Do you happen to know how to interrogate what GPU a process is running on?
Not sure the "best way", but two thoughts are:
nvidia-smi
to look at GPU utilization is one wayprintenv CUDA_VISIBLE_DEVICES
to your job script, so that it runs before each ddcMD
invocation to see what "soft restrictions" are in placeAlright... so far it looks like one set of simulations started and two crashed. However, the two remaining simulations are in fact running on two GPUs. So this looks like it's solved -- the previously mentioned error came up again, so it looks like it's not related to overscheduling onto one GPU. I think that's the last outstanding question in this thread since @SteVwonder already started an issue about jobs using GPUs that aren't theirs.
nvidia-smi
Thu Aug 20 13:23:56 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.116.00 Driver Version: 418.116.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 28C P0 36W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 27C P0 35W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 41C P0 163W / 300W | 595MiB / 16130MiB | 93% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 40C P0 165W / 300W | 595MiB / 16130MiB | 89% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 109206 C ...kras/sierra/ddcmd-gpu7/bin/ddcMD-sierra 583MiB |
| 3 109372 C ...kras/sierra/ddcmd-gpu7/bin/ddcMD-sierra 583MiB |
+-----------------------------------------------------------------------------+
he previously mentioned error came up again, so it looks like it's not related to overscheduling onto one GPU. I think that's the last outstanding question in this thread since @SteVwonder already started an issue about jobs using GPUs that aren't theirs.
What was the error that it produced?
The error that we're seeing is: Cuda failure gpuMemUtils.cu:212: 'invalid argument'
Xiaohua tracked that down to an error related to a CUDA memory free operation, but we're unsure why it fails currently. The way it was described (and the only reason I know of) for the free to fail is that the memory pointer is for some reason invalid. I initially thought that maybe the multiple processes were causing the code to "misbehave" from not being in a state which it expected. That was disproved by the recent test today that showed them running on independent GPUs.
Maybe the recent support for totalview can come in handy for this.
You can run one of the failed job under the control of totalview with "CUDA memcheck" and see if the tool detect the source of error automatically.
Basic totalview howto is described here https://flux-framework.readthedocs.io/en/latest/debugging.html
Most of what's described here should have been fixed w/ better nesting support with resource readers in PR #787
In preparation for hero runs on ORNL Summit, MuMMI team refreshed their workflows using the job spec generated with Python API from flux-core@0.17.0. They found the jobs cannot be scheduled.
Franc provided two example jobspec created.
jobspec_createsim.txt jobspec_ddcmd.txt
The resource requests of each spec:
createsim:
ddcmd: