flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
166 stars 49 forks source link

Flux bootstrapping error on IBM LSF machines (Lassen) with flux-core>=0.46 #5412

Open lpottier opened 1 year ago

lpottier commented 1 year ago

I am experiencing a strange bug with Flux when bootstrapping it on IBM LSF machine (Lassen TOSS3). I have a script which bootstrap a flux instance with N=3 nodes. That script works perfectly fine on Slurm-based machines with any flux-core (tested with 0.52) and works on Lassen with flux-core=0.45 but fails with any flux-core>=0.46.

When it “fails”, it actually creates N brokers on N nodes but flux only sees one node no matter what, when running flux resource list we only one node and flux overlay status returns

0 lassen29: full

It looks like only one broker is being recognized but 3 are running.

Note that only one jsrun -X 1 -a 1 -c ALL_CPUS -g ALL_GPUS -n 3 --bind=none --smpiargs=-disable_gpu_hooks flux start -o,-S,log-filename=ams-flux.log -v ./flux-wrapper.bYYUfq.sh ams-uri.log is running on one node. When the bootstrapping is correct (with flux-core<=0.45), I have noticed that I have 2 jsrun running on two different nodes (in the case N=3).

To reproduce the issue: (on Lassen with flux-core>=0.46, flux-sched==0.28):

bsub -nnodes 3 -W 15 -q pdebug ./test-flux.sh 3

I have created a GitHub Gist with a log from one faulty run on Lassen and the testing script test-flux.sh: https://gist.github.com/lpottier/d5bfa347b958b39aaa48f415c3668cbf

grondo commented 1 year ago

I'm not remembering the details here, but have you tried not loading the pmi-shim and instead using module load flux as noted this FAQ entry. The problem you describe means that the flux brokers are not properly bootstrapping with the launcher PMI and are just acting as singletons.

I know that several users have had success with recent flux-core versions using the instructions in the FAQ.