create RFC for job manager states

garlick commented 6 years ago

States known by job manager, separate from per-scheduler or per-job shell states which go in their own state log are (proposed):

Submitted
Held
Allocated
Running
Completed
Failed
Canceled
Inactive

dongahn commented 6 years ago

Great start!

Do you plan to add what component (e.g., job-ingest, scheduler and job shell) the job manager will interact with to transition these job states?

Do you expect the state log will only be used and updated by the job manager?

SteVwonder commented 6 years ago

Do you plan to add what component (e.g., job-ingest, scheduler and job shell) the job manager will interact with to transition these job states?

That's a good idea. Maybe we can include the "job lifecycle" diagram in the RFC. Although maybe all the interactions is a separate RFC. Thoughts @garlick and @grondo?

Do you expect the state log will only be used and updated by the job manager?

For these specific states, the job manager will be the authority. It will be the only entity updating the "top-level state log". Other modules like the scheduler and job shell will have their own state logs that they can append to. For example, if a job is spawned using SCR, the job shell might record that the job cycles between checkpointing, running, failed, and restarting, as SCR helps the job prepare for and recover from failure.

dongahn commented 6 years ago

For these specific states, the job manager will be the authority. It will be the only entity updating the "top-level state log". Other modules like the scheduler and job shell will have their own state logs that they can append to. For example, if a job is spawned using SCR, the job shell might record that the job cycles between checkpointing, running, failed, and restarting, as SCR helps the job prepare for and recover from failure.

@SteVwonder: Thanks. It seems it will also be useful to map out the valid state transitions for the scheduler somewhere. (Not as part of this RFC though).

I do hope that the scheduler will only have to deal with one type of event set (which would be a superset of the job manager states). This will make writing of a finite state machine logic much easier -- Another reason to have a higher level job status and control API...

SteVwonder commented 6 years ago

I do hope that the scheduler will only have to deal with one type of event set (which would be a superset of the job manager states).

Yeah, that is the idea. The scheduler should only need to "subscribe" to events/state changes from the job manager. And if the scheduler needs to record finer granularity state changes besides Submitted -> Allocated, it can do so in its own state log.

For user-facing tools, they will start at the job manager state log level, and if more information is required as to what phase of "running" the job is in, the tool can query the job shell/exec module state log.

SteVwonder commented 6 years ago

It seems it will also be useful to map out the valid state transitions for the scheduler somewhere.

That does make a lot of sense to include in this RFC. A simple state machine diagram would be nice. We had a sketch of one on a whiteboard that I could replicate with tikz pretty easily.

garlick commented 5 years ago

I am planning on submitting a job state RFC soon but wanted to touch base here.

Is there a preferred way to include a state diagram image in an asciidoc RFC? I see RFC 4 has some examples. I was thinking of using graphviz to generate it, although tikz was mentioned above. Any preference?

I was going suggest the following properties in the RFC

events logged to event log (documented in RFC 16) trigger state transitions
current job state can be recreated by replaying the job eventlog (e.g. job manager restart)
but normally state would be obtained by job-manager query
job-ingest sets initial state, job-manager makes all subsequent state changes
job-manager supports state synchronization (notification for one job or for all jobs)
job state at this level does not reflect task detail known only to the job shell, except as reported back to the job-manager via the exec system

The states I had been pondering were

NEW initial state
DEPEND waiting on dependency
SCHED waiting for resources
RUN exec system start issued
CLEANUP job is terminating (through exceptional condition or naturally)
INACTIVE end state

Some obvious areas for feedback/discussion:

naming I deliberately didn't stick to the SLURM job states because I didn't want to imply anything that might not be true in the new system, but maybe it's useful to be more compatible (for principle of least astonishment?) @grondo had some concerns about that when we were chatting the other day. If a simple rename of any of these states smooths our transition, I'm for it.

synchronization I was thinking job state transitions would be the only "waitable" states. We could also propose that the events, some of which drive state transitions, could be optionally be available for synchronization.

exec system It may be useful for a debugger to know when user tasks are started, or or if any single task has exited. Since tasks are managed by the job shell, do we need to call out states or events that should be available/waitable at this higher level (e.g. by shell -> exec -> job-manager notification), or can we defer until we get into the exec design and maybe expect that a different mechansim could be used?

garlick commented 5 years ago

Incidentally, I was assuming that there would be a way to obtain the exit status (or other exceptional conditions) as a query on the inactive job, or as payload when when synchronizing on job completion. The fact that FAILED is missing shouldn't be a red flag (at least...I think!)

grondo commented 5 years ago

I was thinking of using graphviz to generate it, although tikz was mentioned above. Any preference?

IMO, TikZ is greatly preferable (when you care about aesthetics at all). Here is not only a description why, but also a pointer to dot2tex tool that could let you declare the graph in dot, but generate TeX output.

grondo commented 5 years ago

Since tasks are managed by the job shell, do we need to call out states or events that should be available/waitable at this higher level (e.g. by shell -> exec -> job-manager notification), or can we defer until we get into the exec design and maybe expect that a different mechansim could be used?

This is a good question. My opinion is that it is ok to wait for now. I kind of like the very simple set of job-manager states you have at this point. It would be interesting later to determine if and how other services could extend or enhance these states without creating an undue burden for tools that want to monitor these states.

Besides the obvious execution states between RUN and CLEANUP, future job "states" might include

A state for when job inputs or other dependencies are being staged to the resources of the job (this could be handled by the dependency system and a sub-state of DEPEND)
- Similar to above, but resources are being brought online for the job (booting powered down nodes)
- Job is in RUN state but is also waiting on additional resources (grow)
- Job is in RUN state but has released some resources (shrink)

I don't think these necessarily should be job-manager states, but I do think we would eventually want to log or be able to wait on these conditions (and I'm sure there are many other examples of interior states like these.)

I was thinking job state transitions would be the only "waitable" states. We could also propose that the events, some of which drive state transitions, could be optionally be available for synchronization.

It is certainly tempting to limit the scope of the wait() implementation. Maybe waiting on events uses a different mechanism (wasn't that the purpose of kvs eventlogs anyway?), and could be a bit more flexible, but the job-manager wait api is limited and necessarily only allows waiting on job-manager states.

A more complete synchronization api could always be built on top of the job-manager wait implementation, which seems like the right first step. (for example, a tool waiting on RUN state probably wants to wait until all tasks are running, a tool waiting for a job to complete probably wants to always get the "status" of the execution (and do they wait on CLEANUP or INACTIVE, it might not be obvious), etc.)

(Now that I write that, maybe the more important synchronization mechanism long-term is implemented in the exec service...)

grondo commented 5 years ago

@grondo had some concerns about that when we were chatting the other day. If a simple rename of any of these states smooths our transition, I'm for it.

I actually don't have any concerns after reading through your proposal above. I think porcelain tools will probably be able to display more meaningful sub-states if and when it makes sense, so the names of the internal job-manager states are less important than I was initially thinking. I really like the set of states you've come up with here.

garlick commented 5 years ago

Thank you! Good comments that I'll ponder on my long drive today.

SteVwonder commented 5 years ago

Is there a preferred way to include a state diagram image in an asciidoc RFC? I see RFC 4 has some examples. I was thinking of using graphviz to generate it, although tikz was mentioned above.

I don't know if they are helpful at this point, but ~6 months ago I made tikz diagrams based on some of our whiteboard diagrams: https://github.com/SteVwonder/planning/tree/master/diagrams

I was thinking job state transitions would be the only "waitable" states.

Does that mean a user waits on the transition into a particular state, or do they wait on the transition from State A to State B?

A state for when job inputs or other dependencies are being staged to the resources of the job (this could be handled by the dependency system and a sub-state of DEPEND)

That's a great point. Just to clarify your suggestion: the job would go from DEPEND to SCHED when the resources are allocated and then back to DEPEND while the inputs are being staged in? I guess the states don't have to be linear, but that is what I initially thought when I read the list. We should clarify that they aren't linear (the diagram will help with that).

garlick commented 5 years ago

Ah those pics are pretty. Definitely helpful. Thanks!

Does that mean a user waits on the transition into a particular state, or do they wait on the transition from State A to State B?

I was thinking one would wait on a mask of states, then get a callback when any state in the mask is entered. I thought a common one would be to wait on INACTIVE, and then retrieve a result of some kind. Is there a use case for only getting a callback for a particular transition (as opposed to state)?

the job would go from DEPEND to SCHED when the resources are allocated and then back to DEPEND while the inputs are being staged in?

Hmm, if it happens after SCHED, maybe it's part of RUN? (Or if staging occurs before resources are allocated then during SCHED?) . I was thinking that semantically DEPEND would indicate some user imposed constraint, and staging feels more like a detail of execution, though maybe not.

My initial (possibly naive) design was fairly linear:

NEW -> DEPEND or SCHED
DEPEND -> SCHED or CLEANUP
SCHED -> RUN or CLEANUP
RUN -> CLEANUP
CLEANUP -> INACTIVE

The non-linear portion driven by exceptional conditions or by the lack of dependencies, but it doesn't have to be that way.

SteVwonder commented 5 years ago

I was thinking one would wait on a mask of states, then get a callback when any state in the mask is entered. I thought a common one would be to wait on INACTIVE, and then retrieve a result of some kind.

:+1:

Is there a use case for only getting a callback for a particular transition (as opposed to state)?

No, I don't have a use-case. I actually prefer your suggestion/idea of waiting for a state (or mask of states) to be entered. I just wanted to clarify because when I first read job state transitions would be the only "waitable" states, I thought you were proposing to only support waiting on a particular transitions.

garlick commented 5 years ago

Oh indeed I was unclear there. Thanks for extracting a clarification.

grondo commented 5 years ago

I was thinking one would wait on a mask of states, then get a callback when any state in the mask is entered. I thought a common one would be to wait on INACTIVE, and then retrieve a result of some kind. Is there a use case for only getting a callback for a particular transition (as opposed to state)?

I wonder if the common case is going to be to wait for INACTIVE? The CLEANUP state could potentially last a long time, and I would assume in most cases you want to wait for when the global exit code of the job is available, i.e. all tasks have complete. (e.g. a dependency system would probably want to release a job for scheduling immediately when the all the dependencies have completed, not wait until resources are cleaned up?)

Hmm, if it happens after SCHED, maybe it's part of RUN? (Or if staging occurs before resources are allocated then during SCHED?) . I was thinking that semantically DEPEND would indicate some user imposed constraint, and staging feels more like a detail of execution, though maybe not.

(I think you're right, the sort of staging or setup activities belong under RUN not DEPEND, so I guess that means they are performed by the execution system)

This is why I brought up examples of other "states" that other resource managers represent, sorry I realize I kind of got us a bit off track, but the enum style job flags just (unnecessarily) raised a bit of red flag for me.

Probably it is good idea to support only these coarse-grained linear states at first, but there seem to exist use cases for representing other sub-states as well. It does seem reasonable to say each of these services the job-manager communicates with has its own state machine, and an API could be provided to tools that first checks the job-manager for its current state, then finds the sub-state from the associated service (though this could be racy, so a bit unsatisfying). It would be nice if we had a nice story for this up front, but I realize we just need to get something down and move forward at this point.

(Sorry for the somewhat unhelpful comments)

garlick commented 5 years ago

This was thought provoking, thanks @grondo.

What do you envision running between the user getting the global exit code of their parallel program and the INACTIVE state? Epilog? (I hadn't carefully considered how the epilog fits into the job life cycle, but should have!)

grondo commented 5 years ago

(I hadn't carefully considered how the epilog fits into the job life cycle, but should have!)

Yes, I had envisioned the exec service could optionally be configured to run something after the job shell exits. Only after this epilog completes would the exec service release that shard of R to the job-manager. Maybe since this occurs under the exec service the job-manager state is still RUN? (however, when I saw CLEANUP I assumed this would be the state after all job shells have exited). What actually drives the transition from RUN to CLEANUP at this point? (exec system via direct message, first shard of R released, all shards of R released?)

Also, I'm not saying we must handle the epilog functionality this way, and a job epilog only makes sense when we have mult-user support, so maybe this discussion is for another time?

garlick commented 5 years ago

What actually drives the transition from RUN to CLEANUP at this point? (exec system via direct message, first shard of R released, all shards of R released?)

I was thinking it would jump to CLEANUP either when exec system says normal termination has begun, or when there is some exceptional condition like cancellation. Then waiting for stragglers, freeing resources, etc. would occur in CLEANUP.

It seems like maybe I'm being too stingy with the states. Maybe CLEANUP should be reserved for post-execution stuff like freeing resources, and we should add a RESULT state or similar once we have a global exit code or exception information that the user could consume? And then enter CLEANUP. ?

grondo commented 5 years ago

I was thinking it would jump to CLEANUP either when exec system says normal termination has begun, or when there is some exceptional condition like cancellation.

This makes sense to me. Maybe we should not get too hung up on getting it perfect for now. What you have here seems easy to reason about which is good.

I'm not sure a RESULT state is needed, presence of a "result" that can be consumed is actually a property of a job even in INACTIVE right?

grondo commented 5 years ago

Another thing that just occurred to me: it may not always be clear to the exec system when "normal termination" of a job has begun. The job shell is the only thing that is cognizant of individual task exit, and (at least in our current sketch of the design) the exec system isn't aware of anything until the first job shell exits (which may be long after the first task exits). A job shell could be configured to keep running other tasks even after the first task exits, or even restart tasks on spare resources that were allocated to the job. Since the job shell is a user replaceable component, there is no way services in flux core can be reliably be made aware of these changes in the execution state of the job.

I'm not sure what to conclude from that as far as the RUN->CLEANUP transition goes, except to say that your choice to have a broad RUN state which implies "waiting on exec service" is pretty wise.

garlick commented 5 years ago

Thanks for the encouragement! OK well let's get something done and see how it turns out and we can adjust later.

garlick commented 5 years ago

As I'm thinking about RUN->CLEANUP a bit more, it may be clearest if CLEANUP is not entered until the global process exit is complete and available, unless an exception occurs. And if gathering the exit state during RUN takes too long, that could trigger an exception.

It would then be possible to wait on CLEANUP state (as you intuited above @grondo) for "normal" process exit, and be able to retrieve the exit status at that point, or find out that an exception has occurred. Either way you probably have all you care about as a workflow tool, unless you want to wait for resources to be reclaimed before launching something new, in which case you'd wait for INACTIVE.

Does that make sense?

garlick commented 5 years ago

Another thing that just occurred to me: it may not always be clear to the exec system when "normal termination" of a job has begun. The job shell is the only thing that is cognizant of individual task exit

Great point, "normal termination" probably should include most application failures where the job shell is able to clean up and capture the signal/exit code "normally". Possibly that's the wrong phrase to use when communicating to users.

SteVwonder commented 5 years ago

So can this be closed now that #157 is merged?

garlick commented 5 years ago

Right! Closing.

flux-framework / rfc

create RFC for job manager states #131