flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

flux-job: attach status line is incorrect on exception with severity > 0 #6314

Open wihobbs opened 1 month ago

wihobbs commented 1 month ago

On Slack @grondo seemed to think this might be a bug in flux-job attach's status line:

(s=2,d=1)  tioga39 ~/ext4-blocks-in-lustre/flux (flux-work)$ flux run -n1 -c1 --urgency=0 sleep inf
18.197s: job.exception type=cancel severity=2 hi hobbs       00:00:18
## ... here it posted a "cancelling" message
^Cflux-job: one more ctrl-C within 2s to cancel or ctrl-Z to detach   00:10:45
^C646.329s: job.exception type=cancel severity=0 interrupted by ctrl-C    00:10:46

(s=2,d=1)  tioga39 ~/ext4-blocks-in-lustre/flux (flux-work)$ flux run -n1 -c1 --urgency=0 sleep inf
11.393s: job.exception type=cancel severity=7 hi hobbs   00:00:11
20.914s: job.exception type=cancel severity=1 hi hobbs   00:00:20
27.250s: job.exception type=cancel severity=0 hi hobbs   00:00:27

Posting an exception with severity>0 should notify the user, but not cancel the job or hang.

wihobbs commented 1 month ago

Actually...this may just be a misnamed warning message. For all exceptions, even ones that don't cancel, the warning that posts says:

    { "exception",
      "canceling due to exception",
      0,
    },

I never actually released the jobs above that were submitted with --urgency=0 but if I release one, the flux-job started message appears. I'll need to go back to the prolog (my original use case for this) and see if it behaves as expected there, but we may just want to think about renaming the message...

wihobbs commented 1 month ago

I think the ideal case would be to have a separate message for fatal vs. non-fatal exceptions, but that might be tricky given the way the "switch" case works in src/cmd/job/attach.c

wihobbs commented 1 month ago

It's friday and clearly my brain is turning to mush. @grondo suggested earlier today that for non-fatal exceptions, continue to print the exception but not the "canceling" message, and proceed to "waiting for prolog." That's cleaner.