job-exec: support module restart on rank 0

flux-framework / flux-core

core services for the Flux resource management framework

GNU Lesser General Public License v3.0

167 stars 50 forks source link

job-exec: support module restart on rank 0 #2874

Open grondo opened 4 years ago

grondo commented 4 years ago

The job-exec module lives only on rank 0 and uses flux_rexec() to launch job-shells (or flux-imp). This presents a problem if the rank 0 broker needs to be restarted, since all running jobs will be lost.

The job-exec module should be rewritten such that rank 0 (assuming it is running no jobs) can be restarted. The job-exec module should be loaded on all ranks, with the rank 0 module implementing the exec:left_right_arrow:job-manager protocol, but forwarding requests to execute the job shell off to a subset of broker ranks. In order to support restartability, the protocol between rank 0 job-exec module and interior rank modules could be modeled after the job-manager/sched/exec protocol. Maybe on restart job-manager could notify job-exec of what jobs it thinks is running, and job-exec on rank 0 could in turn query other job-exec modules to get the current status of currently executing jobs.

At the same time, these current issues with job-exec should also be addressed

partial release (#2204)
properly handle epilog and cleanup support (#2317)
enforce job time limits (#2735)

The design should also take into account future needs to shrink a job (forcibly terminate a subset of job-shells, cleanup and release resources, but continue running), and grow (launch new job-shells and "attach" them to an existing job. This won't work until job shell supports dynamic resizing, but at least job-exec should allow launching new work dynamically)

garlick commented 4 years ago

I wonder if it would be helpful for recovery if we split the job-manager<->exec start request into a start and a wait or similar? Then when rank 0 job-exec restarts, the job-manager could effectively tell it what it still requires state updates on by waiting for it? Possibly that could percolate "downstream"?

Edit: I guess I'm just stating that restarting a "start" seems semantically challenged :-)

grondo commented 4 years ago

Edit: I guess I'm just stating that restarting a "start" seems semantically challenged :-)

I still need to review the current protocol before I have any cogent thoughts ;-)

However, I wasn't considering that the job-manager start requests were restarted, but the protocol between job-manager<->exec could be restarted. So when the exec module sends its hello request, the job-manager could reply with the list of jobs that have outstanding start requests, instead of failing if there are any active jobs (as I think it does now).

The job-exec module could then easily restart the protocol by picking up with the next exec event that occurs, e.g. a release or finish event, with some additional protocol to notify job-manager of "lost" jobs somehow.

Alternately, the job-exec module could implement its own hello protocol between rank 0 and the rest of the modules on startup. Interior ranks that are "managing" currently active jobs could note the current job stat in the hello response, and rank 0 could collect these and send along its idea of currently active jobs to the job-manager in the job-manager hello rpc.

grondo commented 4 years ago

We should also consider job throughput and scalability when doing this redesign. E.g. perhaps any events posted to the exec eventlog should still be deferred to rank 0 so that events from multiple jobs coming in quickly can be batched together.

Deferring job start to interior nodes actually associated with the job may also help job launch throughput (a single module won't be managing flux_subprocess_t for every job shell in the system, for example)

grondo commented 4 years ago

Then when rank 0 job-exec restarts, the job-manager could effectively tell it what it still requires state updates on by waiting for it? Possibly that could percolate "downstream"?

BTW, @garlick, I didn't meant to dismiss this idea, but was presenting where my head was at. Actually your idea above sounds like a more considered version of what I was thinking. However, I don't really follow how the wait RPC would function. (I was thinking the job-manager<->exec hello protocol would function similar to the scheduler hello protocol and job-manager would reply to hello with an array of job objects that have been started.)

garlick commented 4 years ago

NP!

I was getting hung up on the idea that responses for a start request could resume without resending the request. There's a start_pending flag on each job in the job manager. If it worked like the scheduler, all the <request>_pending flags would be cleared upon receiving an ENOSYS for any request, and then would be resent after hello. The exec module doesn't have to work like that though - start_pending could be state that is used during hello.

My thought on wait resulted from thinking about resending a start and the exec needing to tell if it's really a request to start a new job, or a request to resume an old one (effectively a wait for resources to be released). It was not a well considered thought :-)

grondo commented 4 years ago

Sorry, I forgot to consider how the job-manager worked internally.

I'm really open to anything at this point, as no work has been started in job-exec yet. I think things could work either way. The key design points seem to be:

job-exec system keeps the non-restartable state of jobs (e.g. flux_subprocess_t handles) as distributed as possible so that maximum amount of the system stays up when a broker is lost.
a central module (e.g. job-exec on rank 0) implements the job-manager<->exec protocol and issues exec.eventlog events.
exec->job-manger hello protocol is extended to notify a starting job-exec module of jobids with outstanding start request. Protocol extension tbd. job-exec may need to read the exec eventlog to determine if any resources have already been released in the event of partial release.
job-exec module on rank 0 should have some way to query the remaining job-exec modules to see if they think they're running anything, to identify orphan or missing jobs.