improve broker messages when nodes die

garlick commented 3 weeks ago

Problem: as noted in #6021, the log messages like this one are confusing to users:

broker[0]: picl2 (rank 2) has disconnected unexpectedly. Marking it LOST.

Moreover, when the TBON is not flat, only the subtree root is called out. The other nodes in the subtree are also lost but not called out.

For your consideration: this PR

demotes the LOST messages to LOG_DEBUG
adds a new LOG_ERR message that tracks brokers leaving the broker.online group during broker RUN and CLEANUP states.

Example, if I kill -9 rank 1 of a kary:2 8 node instance, I get the following on stderr:

 $ flux start -s8 -o,-Stbon.topo=kary:2
Jun 04 09:21:45.016921 PDT broker.err[0]: dead to Flux: system76-pc,system76-pc,system76-pc,system76-pc (rank 1,3-4,7)

In a system instance, if I power off rank 2 which leads the subtree consisting of 2,6-7 I get the following in the logs.

[ +21.998384] broker[0]: picl2 (rank 2) has disconnected unexpectedly. Marking it LOST.
[ +22.100710] broker[0]: dead to Flux: picl[2,6-7] (rank 2,6-7)

A downside of this approach for the system instance is it's a bit chatty if you shut down a non-leaf node that is not the root. Here I stop flux on picl1 which causes its TBON children (3-5) to shut down nicely, triggering mutiple log entries:

[Jun04 09:36] broker[0]: dead to Flux: picl[3-4] (rank 3-4)
[  +0.127260] broker[0]: dead to Flux: picl5 (rank 5)
[  +0.285764] broker[0]: dead to Flux: picl1 (rank 1)

(That won't happen when rank 0 is shut down because that's shutting down the whole instance and we don't log these messages in SHUTDOWN state)

Anyway thought I'd try this and see what it looks like. Thoughts?

garlick commented 3 weeks ago

Mark noted offline that the proposed message obscures which node was the root cause. I agree with this although it's a bit tricky to highlight the failed node in one log message due to the way the code is currently structured. Would two log messages be acceptable? E.g. how's this?

Jun 04 12:00:09.522326 PDT broker.err[1]: lost contact with system76-pc (rank 3) and 1 other nodes it was connected to
Jun 04 12:00:09.723159 PDT broker.err[0]: dead to Flux: system76-pc,system76-pc (rank 3,7)

grondo commented 3 weeks ago

I think that's probably reasonable. Though I'm sure sysadmins and users would prefer something like "system76-pc (rank 3) failed." to make it explicit that Flux considers that the node has "failed" since they called out "lost contact" as confusing. However, I think this is fine and maybe people will just get it after awhile?

We could also ask some sysadmins for suggestions.

garlick commented 3 weeks ago

I had forgotten about that earlier comment. How about this then?

Jun 04 13:02:56.388276 PDT broker.err[1]: system76-pc (rank 3) failed and severed contact with 1 other nodes
Jun 04 13:02:56.589051 PDT broker.err[0]: dead to Flux: system76-pc,system76-pc (rank 3,7)

garlick commented 3 weeks ago

Thanks! Setting MWP

codecov[bot] commented 3 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 83.26%. Comparing base (96e416d) to head (99b98d6).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #6032 +/- ## ========================================== - Coverage 83.28% 83.26% -0.03% ========================================== Files 519 519 Lines 83658 83674 +16 ========================================== - Hits 69678 69672 -6 - Misses 13980 14002 +22 ``` | [Files](https://app.codecov.io/gh/flux-framework/flux-core/pull/6032?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework) | Coverage Δ | | |---|---|---| | [src/broker/overlay.c](https://app.codecov.io/gh/flux-framework/flux-core/pull/6032?src=pr&el=tree&filepath=src%2Fbroker%2Foverlay.c&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL2Jyb2tlci9vdmVybGF5LmM=) | `83.61% <100.00%> (+0.07%)` | :arrow_up: | | [src/broker/state\_machine.c](https://app.codecov.io/gh/flux-framework/flux-core/pull/6032?src=pr&el=tree&filepath=src%2Fbroker%2Fstate_machine.c&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL2Jyb2tlci9zdGF0ZV9tYWNoaW5lLmM=) | `81.64% <100.00%> (+0.35%)` | :arrow_up: | | [src/modules/job-exec/exec.c](https://app.codecov.io/gh/flux-framework/flux-core/pull/6032?src=pr&el=tree&filepath=src%2Fmodules%2Fjob-exec%2Fexec.c&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL21vZHVsZXMvam9iLWV4ZWMvZXhlYy5j) | `80.81% <ø> (ø)` | | ... and [11 files with indirect coverage changes](https://app.codecov.io/gh/flux-framework/flux-core/pull/6032/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework)

flux-framework / flux-core

improve broker messages when nodes die #6032

Codecov Report