flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
159 stars 49 forks source link

user feedback on error messages #6021

Open garlick opened 1 month ago

garlick commented 1 month ago

Cray is reporting some issues with Flux error messages. Specifically, they tend to interpret some messages as flux failures rather than job or system failures.

Some comments:

When flux loses contact with a broker, it generates a message that says I lost contact with a broker. When slurm loses contact with a slurmd it says: node fail. Flux is more technically correct... but slurm leads more directly to correct interpretation by the user.

Clarification: they mean the "hostname (rank N) has disconnected unexpectedly, marking it LOST" message

FWIW the "lost contact with job shell" message is the one that seems to trip people up constantly, along with generally things like "job.exception type=exec severity=0 rank 2 exited and exit-timeout=30s has expired flux-job: task(s) Segmentation fault"

People seem to read that second one as "the flux-job task segfaulted" rather than "flux is reporting that the underlying task segfaulted"