flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
168 stars 50 forks source link

Suggestion: improve error messaging with more context #6392

Open kkier opened 1 month ago

kkier commented 1 month ago

Reference https://github.com/flux-framework/flux-core/issues/6389 and https://github.com/flux-framework/flux-core/issues/6391:

In this case, the error on the node side was:

Oct 24 19:44:54 tuolumne17 flux[120979]: broker.crit[1]: tuolumne1 (rank 0) sent disconnect control message
Oct 24 19:44:54 tuolumne17 flux[120979]: broker.info[1]: shutdown: run->cleanup 37.6744s
Oct 24 19:44:54 tuolumne17 flux[120979]: broker.info[1]: cleanup-none: cleanup->shutdown 0.042097ms
Oct 24 19:44:54 tuolumne17 flux[120979]: broker.err[1]: state-machine.monitor: No route to host

So of course I ended up down a rabbit hole checking the switch configs, firewalls, interface configs, etc.

My impression is that we have some error messages that reference symptoms (no route to host in this case) but don't really get into causes or give as much context as I'd wish for. Another example I just noticed is imp kill: flux-imp: Fatal: kill: failed to initialize pid info: No such file or directory which doesn't tell me what file or directory it was looking for or why. In the example at the top of this post, an error message referencing that rank 0 saw two nodes trying to be rank 1 and this node was one of them would have been huge.

Even if those errors only show up on rank 0 (rejecting connection from cluster1234 (rank 1): rank 1 already references cluster1235 or similar), they'll still help with the troubleshooting since we know what nodes aren't connecting.