Open grondo opened 2 months ago
When a node is lost, the job shell is lost too so the exit timer is started.
I had to go look at code to find out why this is true because the primary mechanism in the doom plugin is an RPC that rank > 0 shells send to the rank 0 shell when a task exits, and if the node dies that wouldn't happen.
But the doom plugin also registers a shell.lost
handler which is driven by lost-shell
job exceptions posted by the job-exec module when it loses contact with a shell. When that handler is called, it also starts the doom timer.
If I am reading the code correctly, this exception is only raised as a severity=0 (fatal) exception on critical nodes, but the severity is ignored in the shell exception handler.
Would it make sense to pass the severity along as an argument to the shell.lost
handler, and change the doom plugin to only take action on severity=0?
As I reread my statement above, maybe it is correct the way it is. tasks did "exit" after all. So disabling the doom plugin is probably the correct approach.
But the doom plugin also registers a shell.lost handler which is driven by lost-shell job exceptions posted by the job-exec module when it loses contact with a shell. When that handler is called, it also starts the doom timer.
Yes, otherwise normal jobs that lose a shell for some reason (unexpected shell exit for example) would not be able to take advantage of the exit timer.
Would it make sense to pass the severity along as an argument to the shell.lost handler, and change the doom plugin to only take action on severity=0?
This is an interesting idea. Maybe this would handle the general case of jobs that modify their critical-ranks set rather than only batch/alloc jobs? (For instance flux run OPTIONS... flux start
....)
As I reread my statement above, maybe it is correct the way it is. tasks did "exit" after all. So disabling the doom plugin is probably the correct approach.
Ah, sorry, didn't see this comment until after I posted.
I'd actually like to explore your idea because it may be able to handle other methods of Flux instance launch, and also other jobs that may modify their critical-ranks set. I guess what it would mean is that tasks lost on non-critical ranks in general would not start the exit-timer.. does that still make sense?
For reference, this is the commit that added the shell.lost
handler to the doom plugin:
commit 59c8c3a08104bf713cb990bc9aa74d3000083dbf Author: Mark A. Grondona mark.grondona@gmail.com Date: Fri Mar 8 19:45:06 2024 +0000
shell: doom: support exit-on-error and timeout for lost shells
Problem: The doom plugin supports raising a job exception when one
or more tasks exit early or with an error, but the plugin does not
track lost shells. This may cause jobs that unexpectedly lose a job
shell to hang indefinitely.
Add a `shell.lost` callback to the doom plugin and treat this the same
as an abnormal task exit. Modify the exception note in this case to
indicate that a shell and not an individual task exited.
Pondering :-)
I guess two concerns, not show stoppers at all:
Good points :thinking:
This is an interesting idea. Maybe this would handle the general case of jobs that modify their critical-ranks set rather than only batch/alloc jobs?
Oh I see we have a job-exec.critical-ranks
RPC that allows the critical ranks to be reset at runtime. Should we just have the doom shell plugin fetch that set and unconditionally ignore non-critical tasks that exit? That would address my minor consistency concerns above
Edit: er, I guess it would have to watch that value since it could be set/reset at any time.
The job-exec module can already send notifications to the rank 0 shell, maybe it could send an RPC when the critical-ranks set is modified? The job shell could add a new callback so multiple plugins can subscribe to the updates (though I can't think of other use cases right now, this would prevent the need to add a new service in the job-exec module, and possibly multiple watch RPCs from different plugins in the shell for every job)
That sounds good. It would need to get a starting value though.
Unless I'm missing something, the starting value for critical-ranks is the set of all ranks. This is only updated if the job manually issues the critical-ranks RPC to the job-exec module, which would then notify the job shell. Though this seems slightly convoluted, it is nice in that the shell doesn't have to do anything and there are no extra RPCs for the common case (no critical ranks update).
Hm, I thought of a counter-example here that suggests ignoring exit-timeout for non-critical ranks should be opt-in only.
If a batch script runs a series of full-size jobs, and a node goes down during one of them, the next job would get stuck in the SCHED state until timeout if the instance persisted by default. So, perhaps instance persistence should actually require a flag and not be automatic :thinking: There may be other solutions I haven't thought of as well (for example, if there was a way to tell the scheduler that a down resource was unlikely to come back, or some other way to reject jobs in this case..)
Flux instances as jobs have a certain level of resilience -- they can lose compute nodes that are leaves in the TBON and will not be terminated. The idea here is that the node failure within the batch/alloc job will terminate any job using that node. If the instance was running just one full-size job, then termination of that job will cause the batch script to exit and the instance will terminate. If the instance is running SCR or has many small jobs, though, it can continue to get work done.
However, it seems like
exit-timeout=30s
is getting in the way of this. When a node is lost, the job shell is lost too so the exit timer is started. In the first case, 30s may not be enough time for the parallel job within the instance to terminate, batch script finish, and instance normally exit. So users see a "doom: first task exited 30s ago" message that is pretty confusing after a node failure. In the second case, users of SCR or who want their job to continue have to remember to set-o exit-timeout=none
to get the benefits.Perhaps it would be best to just set
exit-timeout=none
influx batch
andflux alloc
.