flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
168 stars 50 forks source link

set `exit-timeout=none` for `flux batch` and `flux alloc` jobs #6304

Open grondo opened 2 months ago

grondo commented 2 months ago

Flux instances as jobs have a certain level of resilience -- they can lose compute nodes that are leaves in the TBON and will not be terminated. The idea here is that the node failure within the batch/alloc job will terminate any job using that node. If the instance was running just one full-size job, then termination of that job will cause the batch script to exit and the instance will terminate. If the instance is running SCR or has many small jobs, though, it can continue to get work done.

However, it seems like exit-timeout=30s is getting in the way of this. When a node is lost, the job shell is lost too so the exit timer is started. In the first case, 30s may not be enough time for the parallel job within the instance to terminate, batch script finish, and instance normally exit. So users see a "doom: first task exited 30s ago" message that is pretty confusing after a node failure. In the second case, users of SCR or who want their job to continue have to remember to set -o exit-timeout=none to get the benefits.

Perhaps it would be best to just set exit-timeout=none in flux batch and flux alloc.

garlick commented 2 months ago

When a node is lost, the job shell is lost too so the exit timer is started.

I had to go look at code to find out why this is true because the primary mechanism in the doom plugin is an RPC that rank > 0 shells send to the rank 0 shell when a task exits, and if the node dies that wouldn't happen.

But the doom plugin also registers a shell.lost handler which is driven by lost-shell job exceptions posted by the job-exec module when it loses contact with a shell. When that handler is called, it also starts the doom timer.

If I am reading the code correctly, this exception is only raised as a severity=0 (fatal) exception on critical nodes, but the severity is ignored in the shell exception handler.

Would it make sense to pass the severity along as an argument to the shell.lost handler, and change the doom plugin to only take action on severity=0?

garlick commented 2 months ago

As I reread my statement above, maybe it is correct the way it is. tasks did "exit" after all. So disabling the doom plugin is probably the correct approach.

grondo commented 2 months ago

But the doom plugin also registers a shell.lost handler which is driven by lost-shell job exceptions posted by the job-exec module when it loses contact with a shell. When that handler is called, it also starts the doom timer.

Yes, otherwise normal jobs that lose a shell for some reason (unexpected shell exit for example) would not be able to take advantage of the exit timer.

Would it make sense to pass the severity along as an argument to the shell.lost handler, and change the doom plugin to only take action on severity=0?

This is an interesting idea. Maybe this would handle the general case of jobs that modify their critical-ranks set rather than only batch/alloc jobs? (For instance flux run OPTIONS... flux start....)

grondo commented 2 months ago

As I reread my statement above, maybe it is correct the way it is. tasks did "exit" after all. So disabling the doom plugin is probably the correct approach.

Ah, sorry, didn't see this comment until after I posted.

I'd actually like to explore your idea because it may be able to handle other methods of Flux instance launch, and also other jobs that may modify their critical-ranks set. I guess what it would mean is that tasks lost on non-critical ranks in general would not start the exit-timer.. does that still make sense?

grondo commented 2 months ago

For reference, this is the commit that added the shell.lost handler to the doom plugin:

commit 59c8c3a08104bf713cb990bc9aa74d3000083dbf Author: Mark A. Grondona mark.grondona@gmail.com Date: Fri Mar 8 19:45:06 2024 +0000

shell: doom: support exit-on-error and timeout for lost shells

Problem: The doom plugin supports raising a job exception when one
or more tasks exit early or with an error, but the plugin does not
track lost shells. This may cause jobs that unexpectedly lose a job
shell to hang indefinitely.

Add a `shell.lost` callback to the doom plugin and treat this the same
as an abnormal task exit. Modify the exception note in this case to
indicate that a shell and not an individual task exited.
garlick commented 2 months ago

Pondering :-)

I guess two concerns, not show stoppers at all:

grondo commented 2 months ago

Good points :thinking:

garlick commented 2 months ago

This is an interesting idea. Maybe this would handle the general case of jobs that modify their critical-ranks set rather than only batch/alloc jobs?

Oh I see we have a job-exec.critical-ranks RPC that allows the critical ranks to be reset at runtime. Should we just have the doom shell plugin fetch that set and unconditionally ignore non-critical tasks that exit? That would address my minor consistency concerns above

Edit: er, I guess it would have to watch that value since it could be set/reset at any time.

grondo commented 2 months ago

The job-exec module can already send notifications to the rank 0 shell, maybe it could send an RPC when the critical-ranks set is modified? The job shell could add a new callback so multiple plugins can subscribe to the updates (though I can't think of other use cases right now, this would prevent the need to add a new service in the job-exec module, and possibly multiple watch RPCs from different plugins in the shell for every job)

garlick commented 2 months ago

That sounds good. It would need to get a starting value though.

grondo commented 2 months ago

Unless I'm missing something, the starting value for critical-ranks is the set of all ranks. This is only updated if the job manually issues the critical-ranks RPC to the job-exec module, which would then notify the job shell. Though this seems slightly convoluted, it is nice in that the shell doesn't have to do anything and there are no extra RPCs for the common case (no critical ranks update).

grondo commented 2 months ago

Hm, I thought of a counter-example here that suggests ignoring exit-timeout for non-critical ranks should be opt-in only.

If a batch script runs a series of full-size jobs, and a node goes down during one of them, the next job would get stuck in the SCHED state until timeout if the instance persisted by default. So, perhaps instance persistence should actually require a flag and not be automatic :thinking: There may be other solutions I haven't thought of as well (for example, if there was a way to tell the scheduler that a down resource was unlikely to come back, or some other way to reject jobs in this case..)