Open lpottier opened 1 year ago
I'm not remembering the details here, but have you tried not loading the pmi-shim and instead using module load flux
as noted this FAQ entry. The problem you describe means that the flux brokers are not properly bootstrapping with the launcher PMI and are just acting as singletons.
I know that several users have had success with recent flux-core versions using the instructions in the FAQ.
I am experiencing a strange bug with Flux when bootstrapping it on IBM LSF machine (Lassen TOSS3). I have a script which bootstrap a flux instance with N=3 nodes. That script works perfectly fine on Slurm-based machines with any flux-core (tested with 0.52) and works on Lassen with
flux-core=0.45
but fails with anyflux-core>=0.46
.When it “fails”, it actually creates
N
brokers on N nodes but flux only sees one node no matter what, when runningflux resource list
we only one node andflux overlay status
returnsIt looks like only one broker is being recognized but 3 are running.
Note that only one
jsrun -X 1 -a 1 -c ALL_CPUS -g ALL_GPUS -n 3 --bind=none --smpiargs=-disable_gpu_hooks flux start -o,-S,log-filename=ams-flux.log -v ./flux-wrapper.bYYUfq.sh ams-uri.log
is running on one node. When the bootstrapping is correct (withflux-core<=0.45
), I have noticed that I have 2 jsrun running on two different nodes (in the caseN=3
).To reproduce the issue: (on Lassen with
flux-core>=0.46
,flux-sched==0.28
):I have created a GitHub Gist with a log from one faulty run on Lassen and the testing script
test-flux.sh
: https://gist.github.com/lpottier/d5bfa347b958b39aaa48f415c3668cbf