Open grondo opened 4 years ago
I wonder if it would be helpful for recovery if we split the job-manager<->exec start
request into a start
and a wait
or similar? Then when rank 0 job-exec restarts, the job-manager could effectively tell it what it still requires state updates on by waiting for it? Possibly that could percolate "downstream"?
Edit: I guess I'm just stating that restarting a "start" seems semantically challenged :-)
Edit: I guess I'm just stating that restarting a "start" seems semantically challenged :-)
I still need to review the current protocol before I have any cogent thoughts ;-)
However, I wasn't considering that the job-manager start
requests were restarted, but the protocol between job-manager<->exec could be restarted. So when the exec
module sends its hello
request, the job-manager could reply with the list of jobs that have outstanding start
requests, instead of failing if there are any active jobs (as I think it does now).
The job-exec module could then easily restart the protocol by picking up with the next exec event that occurs, e.g. a release
or finish
event, with some additional protocol to notify job-manager of "lost" jobs somehow.
Alternately, the job-exec module could implement its own hello
protocol between rank 0 and the rest of the modules on startup. Interior ranks that are "managing" currently active jobs could note the current job stat in the hello
response, and rank 0 could collect these and send along its idea of currently active jobs to the job-manager in the job-manager hello
rpc.
We should also consider job throughput and scalability when doing this redesign. E.g. perhaps any events posted to the exec eventlog should still be deferred to rank 0 so that events from multiple jobs coming in quickly can be batched together.
Deferring job start to interior nodes actually associated with the job may also help job launch throughput (a single module won't be managing flux_subprocess_t for every job shell in the system, for example)
Then when rank 0 job-exec restarts, the job-manager could effectively tell it what it still requires state updates on by waiting for it? Possibly that could percolate "downstream"?
BTW, @garlick, I didn't meant to dismiss this idea, but was presenting where my head was at. Actually your idea above sounds like a more considered version of what I was thinking. However, I don't really follow how the wait
RPC would function. (I was thinking the job-manager<->exec hello
protocol would function similar to the scheduler hello
protocol and job-manager would reply to hello
with an array of job objects that have been started.)
NP!
I was getting hung up on the idea that responses for a start
request could resume without resending the request. There's a start_pending
flag on each job in the job manager. If it worked like the scheduler, all the <request>_pending
flags would be cleared upon receiving an ENOSYS for any request, and then would be resent after hello
. The exec module doesn't have to work like that though - start_pending
could be state that is used during hello
.
My thought on wait
resulted from thinking about resending a start
and the exec needing to tell if it's really a request to start a new job, or a request to resume an old one (effectively a wait for resources to be released). It was not a well considered thought :-)
Sorry, I forgot to consider how the job-manager worked internally.
I'm really open to anything at this point, as no work has been started in job-exec yet. I think things could work either way. The key design points seem to be:
flux_subprocess_t
handles) as distributed as possible so that maximum amount of the system stays up when a broker is lost.hello
protocol is extended to notify a starting job-exec module of jobids with outstanding start
request. Protocol extension tbd. job-exec
may need to read the exec eventlog to determine if any resources have already been released in the event of partial release.
The job-exec module lives only on rank 0 and uses
flux_rexec()
to launch job-shells (or flux-imp). This presents a problem if the rank 0 broker needs to be restarted, since all running jobs will be lost.The job-exec module should be rewritten such that rank 0 (assuming it is running no jobs) can be restarted. The job-exec module should be loaded on all ranks, with the rank 0 module implementing the exec:left_right_arrow:job-manager protocol, but forwarding requests to execute the job shell off to a subset of broker ranks. In order to support restartability, the protocol between rank 0 job-exec module and interior rank modules could be modeled after the job-manager/sched/exec protocol. Maybe on restart job-manager could notify job-exec of what jobs it thinks is running, and job-exec on rank 0 could in turn query other job-exec modules to get the current status of currently executing jobs.
At the same time, these current issues with job-exec should also be addressed
The design should also take into account future needs to shrink a job (forcibly terminate a subset of job-shells, cleanup and release resources, but continue running), and grow (launch new job-shells and "attach" them to an existing job. This won't work until job shell supports dynamic resizing, but at least job-exec should allow launching new work dynamically)