Open garlick opened 2 months ago
When this came up before I was playing with a broker --progress
option (on my broker_progress
branch which has one commit, 36ff1e2a4285636641294b85d46c6b1091d4ba7c).
It prints:
flux-broker: waiting for remaining brokers to join: 63 of 64
with the numbers rewritten in place. That works in addition to the "quorum delayed" message described above which call out the missing hostnames every 5s.
I stalled out on this before because I wasn't really sure how to integrate this into the overall system.
One idea is to expose the quorum progress via an RPC, then responsibility for indicating progress can be handled by flux job attach
(or any other tool that is interested). The tool can open a handle to the instance as soon as the uri
attribute is posted to the eventlog and monitor progress. This is currently how flux alloc --bg
works, but it just monitors state-machine.wait
Problem: need better feedback to users when brokers are slow to start up in a big instance (like a large
flux alloc
).If not all the brokers enter the PMI barrier, there is no feedback. To reproduce, run
If a node completes PMI bootstrap but then fails to wire up, messages like this appear every 5s
To reproduce that I added the following patch