Closed garlick closed 3 weeks ago
Mark noted offline that the proposed message obscures which node was the root cause. I agree with this although it's a bit tricky to highlight the failed node in one log message due to the way the code is currently structured. Would two log messages be acceptable? E.g. how's this?
Jun 04 12:00:09.522326 PDT broker.err[1]: lost contact with system76-pc (rank 3) and 1 other nodes it was connected to
Jun 04 12:00:09.723159 PDT broker.err[0]: dead to Flux: system76-pc,system76-pc (rank 3,7)
I think that's probably reasonable. Though I'm sure sysadmins and users would prefer something like "system76-pc (rank 3) failed." to make it explicit that Flux considers that the node has "failed" since they called out "lost contact" as confusing. However, I think this is fine and maybe people will just get it after awhile?
We could also ask some sysadmins for suggestions.
I had forgotten about that earlier comment. How about this then?
Jun 04 13:02:56.388276 PDT broker.err[1]: system76-pc (rank 3) failed and severed contact with 1 other nodes
Jun 04 13:02:56.589051 PDT broker.err[0]: dead to Flux: system76-pc,system76-pc (rank 3,7)
Thanks! Setting MWP
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 83.26%. Comparing base (
96e416d
) to head (99b98d6
).
Problem: as noted in #6021, the log messages like this one are confusing to users:
Moreover, when the TBON is not flat, only the subtree root is called out. The other nodes in the subtree are also lost but not called out.
For your consideration: this PR
broker.online
group during broker RUN and CLEANUP states.Example, if I kill -9 rank 1 of a kary:2 8 node instance, I get the following on stderr:
In a system instance, if I power off rank 2 which leads the subtree consisting of 2,6-7 I get the following in the logs.
A downside of this approach for the system instance is it's a bit chatty if you shut down a non-leaf node that is not the root. Here I stop flux on picl1 which causes its TBON children (3-5) to shut down nicely, triggering mutiple log entries:
(That won't happen when rank 0 is shut down because that's shutting down the whole instance and we don't log these messages in SHUTDOWN state)
Anyway thought I'd try this and see what it looks like. Thoughts?