flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

more detailed task exit status reporting #6054

Open grondo opened 3 months ago

grondo commented 3 months ago

Users have reported that there is not enough detail from flux-job attach when a job fails. That is, we currently report:

flux-job: task(s) exited with exit code 1

However, if only one task exited with a nonzero status, this is not reported. If tasks coredump or segfault, the affected tasks are not reported. In both cases, it would be useful to also include the affected hostnames if possible. This would allow users and admins to quickly come to conclusions about bad hosts.

This may require stashing a compact aggregate representation of the exit status of all tasks in the KVS or the eventlog (this may be too large for the eventlog though). This in combination with the job taskmap and assigned hostlist could allow users and the flux job attach command to create a detailed summary of how every task in a job exited.