Oct 24 19:44:54 tuolumne17 flux[120979]: broker.crit[1]: tuolumne1 (rank 0) sent disconnect control message
Oct 24 19:44:54 tuolumne17 flux[120979]: broker.info[1]: shutdown: run->cleanup 37.6744s
Oct 24 19:44:54 tuolumne17 flux[120979]: broker.info[1]: cleanup-none: cleanup->shutdown 0.042097ms
Oct 24 19:44:54 tuolumne17 flux[120979]: broker.err[1]: state-machine.monitor: No route to host
So of course I ended up down a rabbit hole checking the switch configs, firewalls, interface configs, etc.
My impression is that we have some error messages that reference symptoms (no route to host in this case) but don't really get into causes or give as much context as I'd wish for. Another example I just noticed is imp kill: flux-imp: Fatal: kill: failed to initialize pid info: No such file or directory which doesn't tell me what file or directory it was looking for or why. In the example at the top of this post, an error message referencing that rank 0 saw two nodes trying to be rank 1 and this node was one of them would have been huge.
Even if those errors only show up on rank 0 (rejecting connection from cluster1234 (rank 1): rank 1 already references cluster1235 or similar), they'll still help with the troubleshooting since we know what nodes aren't connecting.
Reference https://github.com/flux-framework/flux-core/issues/6389 and https://github.com/flux-framework/flux-core/issues/6391:
In this case, the error on the node side was:
So of course I ended up down a rabbit hole checking the switch configs, firewalls, interface configs, etc.
My impression is that we have some error messages that reference symptoms (
no route to host
in this case) but don't really get into causes or give as much context as I'd wish for. Another example I just noticed isimp kill: flux-imp: Fatal: kill: failed to initialize pid info: No such file or directory
which doesn't tell me what file or directory it was looking for or why. In the example at the top of this post, an error message referencing that rank 0 saw two nodes trying to be rank 1 and this node was one of them would have been huge.Even if those errors only show up on rank 0 (
rejecting connection from cluster1234 (rank 1): rank 1 already references cluster1235
or similar), they'll still help with the troubleshooting since we know what nodes aren't connecting.