add API for reducing eventlog events to job states

garlick commented 6 years ago

As mentioned in #1654, we need an API for interpreting well defined event sequences in the main job eventlog for consumption by user-facing tols that list the queue or report job status.

garlick commented 6 years ago

Here is where I start to get lost: user-facing tools may also need to interpret events from the scheduler eventlog, the job shell eventlog, etc.. But one goal of the multiple event logs was for a separation of concerns, so that a job shell or scheduler could be free to define events that make sense for its particular implementation. I think we may have said that there should be some well defined events for schedulers and job shells as well. But then why not just put those in the main job eventlog as they become well defined?

dongahn commented 6 years ago

But then why not just put those in the main job eventlog as they become well defined?

My thoughts after talking about this with @SteVwonder the other day:

From the perspective of each user (e.g., scheduler), that would be what they want conceptually. They would need to be notified of a set of events correctly as if they are coming from one event log. (In particular those events that form a total order; there may be some events that from a partial order though).

But from the mechanism's point of view, I thought I makes a lot of sense to separate out where some classes of events are kept. This can make it easy to hide certain events hidden from some users without having to impact the performance.

Generally it will allow us to tune the scalability and performance of the underlying mechanisms.

Stated differently, I see an event log as a safer mechanism with respect to synchronization needs of producers and consumers. And I see domain specific event logs as performance optimization and usability enhancement.

But from the API point of view, it may be best to abstract these things out as much as possible.

At least for the main schedulers I will want to hide these mechanisms behind an API like JSC. If this new API can replace JSC, that is great but for me it will be best if the new interface has a similar look and feel.

If a certain user needs to get to the mechanism details, they can simple use the underlying low level API.

Just my $.02

dongahn commented 6 years ago

@garlick: I saw your other posting. Thank you.

We had 2 focused discussion sessions on R last two weeks, and we have made really good headway.

This may be the next topic to have a focused discussion on. Maybe starting early next week?

garlick commented 6 years ago

I'll be away this week - maybe next week? It's OK with me if you talk about it in my absence also.

I hope @grondo or @SteVwonder will jump in here and correct me if I've forgotten where we ended up, but I think the idea was that major job life cycle events, that that might be interpreted by the API posited here, would all be in the main eventlog. The API would not be piecing this together from multiple eventlogs.

I think the scheduler would both consume the main eventlog (e.g. new job ingested), and contribute to it (e.g. resources assigned).

Independent of that, a particular scheduler could have its own eventlog for detail it might want to place there for tools to report out (job will run in 2h) or other uses not yet envisioned such as synchronizing distributed scheduler components.

However, possibly I'm getting confused!

SteVwonder commented 6 years ago

@garlick: that is my general understanding of what we converged to.

Here is where I start to get lost: user-facing tools may also need to interpret events from the scheduler eventlog, the job shell eventlog, etc.. why not just put those in the main job eventlog as they become well defined?

At one point we threw out the idea of using the main eventlog as a record of the job transitioning between modules/entities (submitted moves the job from the user to the sched, allocated moves the jobs from the sched to the exec, cancelled moves it from the current owner to the job manager, etc). Having a 1:1 mapping between events in the main eventlog and transitions between modules would be "clean" IMO, and it would avoid a "chatty"/"noisy" main event log (which could have performance implications). And in the 1:1 scenario, you could first use information from the eventlog to figure out which module currently "possesses" the job, and then you could read that module's event log for more information. An end user tool wouldn't necessarily have to know how to interpret every event within that module's event log. It could just grab the last event from the log to display, or maybe it just grabs the last well-defined event that it does know how to interpret (and ignores all the undefined, custom events).

Ultimately though, I don't see a reason why additional events couldn't be added, especially if they are well defined. Only having to visit one log might make it easier when writing end-user tools (or when writing the API for said tools).

At least for the main schedulers I will want to hide these mechanisms behind an API like JSC. If this new API can replace JSC, that is great but for me it will be best if the new interface has a similar look and feel.

I think, from the perspective of the scheduler, a subscription to the events in the main event log would be very similar to the current jsc_notify_status. The scheduler should only care about the submitted, cancelled, and inactive job events, which will all appear in the main event log.

dongahn commented 6 years ago

I think, from the perspective of the scheduler, a subscription to the events in the main event log would be very similar to the current jsc_notify_status. The scheduler should only care about the submitted, cancelled, and inactive job events, which will all appear in the main event log.

Good to hear that the interface will be similar.

I will have to think about this some more, but it is somewhat unclear to me at this point if we can write a clean, extendable event-based scheduler code with only those events, though. Probably time to look at the finite state machine more closely.

If additional scheduler specific events are needed, it would be good to use the same event log mechanism.

dongahn commented 6 years ago

Independent of that, a particular scheduler could have its own eventlog for detail it might want to place there for tools to report out (job will run in 2h) or other uses not yet envisioned such as synchronizing distributed scheduler components.

Maybe we are talking about yet another interface ar the scheduler level that multiplexs the main event log through the proposed interface and domain specific event log.

Again I will want to look at the finite state main code to see how many events are needed to write the current use case and then future use cases.

dongahn commented 6 years ago

The scheduler should only care about the submitted, cancelled, and inactive job events, which will all appear in the main event log.

How about pending, completing, complete? In the future growing and shrinking?

I vaguely remember I needed selected and sched_request as internal scheduler states. Maybe those are okay not to go all the up to the event log though.

garlick commented 6 years ago

On completing, growing, and shrinking, would it make sense to think about those in terms of resource allocate and free requests, and put the requests and responses in the main eventlog?

We know we have to support the partial return of resources from a job to the scheduler when a node or job shell hangs. Instead of a completing or shrinking state, could we have a free event, with an argument that can indicate a subset of the assigned resources? (Not sure what that argument looks like - pointer to another R in KVS?) Then a freed response from the scheduler.

Then could we have an alloc event that would allow a job to request additional resources (pointing to a new J), and an allocated event responding to that request (pointing to a new R)?

A tool could look at the number of alloc and allocated events, and if they aren't balanced, the job is "growing". If the number of free and freed (or whatever) events is unbalanced, the job is "shrinking".

I don't know if I'm muddling up the eventlog concept here and it would be better to handle these "resource allocation requests" out of band some way?

dongahn commented 6 years ago

it would be better to handle these "resource allocation requests" out of band some way?

Yes this makes a lot of sense. I think this will make the events certainly more normalized.

These are actually very similar to the set of operations that I identified very early on in my section of the vision paper.

As a side note, from the scheduler implementation point of view, we will need to include jobid to some of the events (free) where it makes sense. We discussed this last Friday as part of R.

There are still discussions that need to happen whether the scheduler loop module code can get by with only those normalized states.

It seems a reasonable way to push forward this dialogue is for @StevWonder and me to create an FSM strawman. What are the key events that needs to be seen; what are the actions that need to be invoked in the new sched loop module.

We will probably need two FSMs. 1. For our immediate target. 2. For future extension to support fully elasticity.

BTW I am mostly looking at this problem from the interface point of view: how clean of our planned scheduler loop service code we can write. Plus how extendable the scheme is for future use cases.

In a sense the main scheduler code doesn't care much about the specific mechanism by which an event is tracked.

dongahn commented 6 years ago

These are actually very similar to the set of operations that I identified very early on in my section of the vision paper.

In page 37. Now that I look at this ancient doc, there seems also an implication about partial process destruction and partial bootstrapping. I don't think it matters much for this part of discussion. But I thought I at least mension this FWIW.

garlick commented 6 years ago

I think we will have to handle partial destruction from the early days - to reclaim partial resources when a small component of a job fails to exit. @grondo pointed out that this was the case with slurm.

A consequence is that a jobid state change to "completed" or whatever is insufficient information for the scheduler to free resources. It seems like it needs some way for the execution system to "free" pieces of the original R back to the scheduler. That being the case, would you add a "completing" state to an FSM? What purpose would that serve?

dongahn commented 6 years ago

That being the case, would you add a "completing" state to an FSM? What purpose would that serve?

Sched's current action on the running to completing transition is NOOP. It does a bunch of cleanup work on the completing to completed transition. So unless I miss something, we should be okay without that.

The wreck is currently emitting this state to signal the epilogue phase. Will this serve any purpose for the new execution system, also?

In case it is not super clear, I really like the idea of "normalizing the states" to resource-focused as much as possible, like using alloc and free as the primitives.

I just need to sit down and draw an FSM or two written on top of those events to be fully convinced.

BTW, I am not immediately clear if we can express all cases for grow and shrink with alloc and free alone. There may be a case in the future grow entails migration, in which case alloc, realloc and free might be a bit better extensible. realloc can either join a new resource to the existing resource allocation or create a bigger allocation and migrate the existing processes. At least, that's what I envisioned way back...

grondo commented 6 years ago

Just catching up here, but how would you differentiate a job that is in the process of terminating from a job that is "shrinking" but will continue to run with some amount of reduced resources?

Conceptually they are not different, the exiting job shrinks down to zero resources. However, administrators and tools may want to differentiate (admins will want to detect stuck resources, workflow tools may want to launch a replacement job while the predecessor is "shrinking" if it is considered complete). It would seem that the main difference is in the exec system, where its own internal state may transition by default to "exiting" when tasks/job shells begin exiting, but have some hook for shrink-capable job shells to allow the "job" to continue in the running state even after some job shells exit (I think this was discussed before)

On completing, growing, and shrinking, would it make sense to think about those in terms of resource allocate and free requests, and put the requests and responses in the main eventlog?

I like this idea, and was thinking along the same lines. This allows another event for the exec system to monitor, in case it has to execute new job shells on newly allocated resources.

I like the idea of taking a static R and allowing it to be a function of time by "committing" new Rs to the job's kvs directory. We should attempt to support this in our kvs schema from the beginning (maybe by allowing a series of values to be stored with timestamps in the KVS), so that administrative tools can analyze the patterns of how "completing" jobs behave over time.

dongahn commented 6 years ago

It would seem that the main difference is in the exec system, where its own internal state may transition by default to "exiting" when tasks/job shells begin exiting, but have some hook for shrink-capable job shells to allow the "job" to continue in the running state even after some job shells exit (I think this was discussed before)

@grondo: What is your plan to implement the internal state in your new execution system? Do you plan to use the domain-specific event log or something else? The scheduler also has a need to represent its own internal states. And if there is a common mechanism both can use, this might be an opportunity to abstract this out into a reusable component as well.

grondo commented 6 years ago

The wreck is currently emitting this state to signal the epilogue phase. Will this serve any purpose for the new execution system, also?

Since we're attempting to at least conceptually support grow/shrink, I'm not sure that "epilog" is really tied to any global job state for the new execution system. The epilog is usually a per-job/per-node "cleanup" script that administrators want to run before the resources are considered available for new jobs. Therefore, I think the execution system will simply allow configuration of a script that will be run (via the IMP) after any job shell exits, whether it be as part of job in the process of completing, or one that is shrinking (and as we've seen those are sometimes the same thing).

So I think the execution system will run the epilog, and if it is successful (or not configured) will release the resources to the scheduler. On failure of the epilog, there will need to be a way to set the resources "down" or something with a text description of why.

Since the epilog could be running on resources for a job that isn't completing, it doesn't make sense to change the state of the job for this reason. However, I do realize administrators will want a way to be able to check for this state on resources, and tie a currently running epilog to the job it is running for -- I'm not sure exactly how to do that -- but since the resources are still part of the job until the epilog completes it at least seems feasible.

grondo commented 6 years ago

@grondo: What is your plan to implement the internal state in your new execution system?

I think we've been iterating on the design of the execution system as a team, so I don't think I'm the sole designer here.

However, I think our plan was that the exec system proper would emit and consume a couple events from the main event log, but that the job shell, which runs potentially as a different user remember, would use its own event log with its own event namespace. First, this allows the job shell to directly write events to a KVS Event Log, since it may use the user namespace for the job, and also it would allow experimental or other job shells flexibility in their implementation.

So execution events like "starting", "running", "exiting" etc may not appear in the main KVS event log.

However, I would personally be open to adding these into the main event log if we decide that is a simpler approach.

grondo commented 6 years ago

Oh and to more directly answer your question, I think @garlick was working on a low level API for using KVS Event Logs, so I don't forsee us inventing anything new for the exec system.

dongahn commented 6 years ago

Oh and to more directly answer your question, I think @garlick was working on a low level API for using KVS Event Logs, so I don't forsee us inventing anything new for the exec system.

From talking to @SteVwonder, I was thinking the same thing -- making use of a scheduler-specific event log to track the scheduler's internal states. But then, the main scheduler code doesn't really want to know whether it deals with an event log; it just want to deal with state changes. So my current understanding is my comment above:

https://github.com/flux-framework/flux-core/issues/1666#issuecomment-421709802

Maybe we are talking about yet another interface ar the scheduler level that multiplexs the main event log through the proposed interface and domain specific event log.

dongahn commented 6 years ago

FWIW, these are very old FSM diagrams that I drew in preparation for summer interns. Things change quite a bit ever since, but what I'd like to do is to redraw these in terms of resource central events and other lessons learned in the past (use prepare, check, idle events as well).

BTW, is there any reusable finite statement machine construction library?

No dynamicity: state_machine.pdf
Dynamicity: state_machine_dynamic.pdf

garlick commented 6 years ago

@dongahn - OK new idea:

Instead of trying to map sched's FSA on top of these event logs (which seems like it could be a bit fragile), maybe we could reduce the complexity of the scheduler interface and prep for grow/shrink by arranging for the job manager to issue resource allocate and free requests to the scheduler on behalf of jobs.

An immediate benefit of this is that sched could implement "partial resource release" where an allocated R is freed in pieces R1, R2,... to handle the use case of a job that partially hangs during completion. This has been identified as a corona requirement.

The scheduler wouldn't need to track job state at all. The job manager would be responsible for translating various "events" (such as the partial termination of a job) into resource alloc/free requests. The jobid, and therefore a reference to the job's extended info in the KVS, would be passed in with these requests.

The scheduler could avoid being inundated by requests or suffering from head of line blocking if it used a credit-based system to dynamically limit the number of outstanding allocate requests to the scheduler queue depth (but not limit free requests, to avoid deadlock).

This came up in our brainstorming discussions yesterday and I just wanted to make sure it was out there for pondering by all the stake holders. Any feedback welcome!

dongahn commented 6 years ago

@garlick:

Interesting. I may be wrong but it sounds like you want the job manager module to be a scheduler loop module. One concern might be that we need relatively high complexity in our queue management scheme (e.g., depending on what scheduler specialization users want, they want FCFS, or backfill, or different parameters (e.g., queue depth, reservation depth etc) etc.

Do you want to manage this complexity in the job manager module?

Or can the new scheme still do this without having to manage this complexity?

garlick commented 6 years ago

That was not my intent, although maybe I've oversimplified sched in my mind, and that's a consequence of what I was proposing. I hope not. Rereading what I wrote, maybe this part was unclear?

The scheduler wouldn't need to track job state at all. The job manager would be responsible for translating various "events" (such as the partial termination of a job) into resource alloc/free requests.

I didn't mean to imply that the scheduler wouldn't need to track any state about jobs. It would still need to maintain its internal priority queue, reservations, etc. (depending on the selected algorithm). I just meant it wouldn't have to track external state transitions across the set of all possible jobs. Instead it would just respond to requests pertaining to specific jobs, and only need to internally keep track of jobs that are requesting resources (and then only to its configured queue depth).

For example, instead of watching for any job to transition to SUBMITTED, it handles allocate requests for a specific job; instead of watching for any job to transition to FAILED, COMPLETED, etc, it handles free requests for a specific job.

This proposal would further decouple the scheduler from the design of the rest of the system. Knowledge of all possible job states for implementation of the action() FSA is a pretty tight coupling. It also incorporates the idea that allocate and free may occur many times within the same job.

dongahn commented 6 years ago

OK. Thank you for the clarification. I was thinking that could be what's being proposed, but I thought I should have the above question first ironed out.

I think cutting down on the number of states (and jobs) the scheduler has to deal with makes sense to me.

A couple of things to consider though just to push forward this discussion:

Instead it would just respond to requests pertaining to specific jobs, and only need to internally keep track of jobs that are requesting resources (and then only to its configured queue depth

Does it imply that the job manager will send an rpc to the scheduler? Assuming it is,

at one point, the reason we chose eventing or kvs watch as a job state notification mechanism is that more than our scheduler (sometime unknown users) can program to such events simultaneously (e.g. workflow manager like what Merine does today). We probably want to support this model even if with this propsal is provided...

Maybe other users like workflow managers can use a different interface and maybe that is okay. If so, what should this interface look like and who should provide this. Job manager will provide a different interface for those users? Can we somehow consolidate those two so that we have only one focal interface we need to harden?

Finally, if this is th RPC approach, the job manager has to know the name of the service which is not part of flux-core, does it concern you?

Generally I like the propsal in terms of the schedule implementation simplification. But I realize there are other considerations..

grondo commented 6 years ago

The scheduler could avoid being inundated by requests or suffering from head of line blocking if it used a credit-based system to dynamically limit the number of outstanding allocate requests to the scheduler queue depth (but not limit free requests, to avoid deadlock).

Would the credit-based allocate imply the scheduler would only know about jobs up to its queue-depth? If I've understood correctly, then in that case, how would something like a forced administrative reprioritization (e.g. expedite) work on a job that has just been submitted and thus its allocate is blocked?

garlick commented 6 years ago

Would the credit-based allocate imply the scheduler would only know about jobs up to its queue-depth? If I've understood correctly, then in that case, how would something like a forced administrative reprioritization (e.g. expedite) work on a job that has just been submitted and thus its allocate is blocked?

@grondo, @SteVwonder and I chatted about this yesterday.

One idea proposed was to associate an external (to the scheduler) priority value with each job similar to the UNIX nice value, that guests could decrease for their jobs, and instance owners could increase or decrease for any job. Jobs could be ordered in (priority, submission) order for listing by tools and submission to the scheduler.

Since priority could be raised administratively after a job has been submitted, there should also be a way for higher-than-default priority jobs to bypass any credit-based mechanism and become visible to the scheduler, which could then apply its own algorithms with this priority as one input, to create a schedule.

garlick commented 6 years ago

at one point, the reason we chose eventing or kvs watch as a job state notification mechanism is that more than our scheduler (sometime unknown users) can program to such events simultaneously (e.g. workflow manager like what Merine does today). We probably want to support this model even if with this propsal is provided...

This is a really good design point to keep in mind, thank you. In this proposal, we still want a full featured, "open" job event tracking interface, and in the simulator sprint, we will need it to implement the simulator portion (all contained in python initial program was the plan).

A big difference here though is that that the job manager becomes the primary owner of the job life cycle. The exec system and the scheduler can be more or less "services" used by the job manager, with the eventlog tracking the events that occur while the job is active.

This is in contrast to the previous model where job state transitions drive the various cooperating parts of the system to take action, and those cooperating parts could also update the state. That requires the state machine to be rigid, and modeling the sorts of things we want to do going forward (such as grow, shrink, and partial resource release at completion) with FSA seems a bad fit (also evidence: slurm's complex notion of job state in slurm.h - simple job states 15 years later)

It is better IMHO if we can limit reduction of eventlog to job state (the subject of this issue) to tools for human consumption, rather than input to a distributed FSA.

dongahn commented 6 years ago

This is a really good design point to keep in mind, thank you. In this proposal, we still want a full featured, "open" job event tracking interface, and in the simulator sprint, we will need it to implement the simulator portion (all contained in python initial program was the plan).

A big difference here though is that that the job manager becomes the primary owner of the job life cycle. The exec system and the scheduler can be more or less "services" used by the job manager, with the eventlog tracking the events that occur while the job is active.

Good. I think this is a much cleaner model with a better separation of concern. Just to echo back my understanding onthe new model, for the services that the job manager needs, it will recruit them using a RPC. For the other users (e.g.k workflows) that the job manager doesn't need their services, we will provide another interface so that they can keep track of the job states in a way they need.

Not sure yet, though, if the job event log is high level enough for them to be effective. (Wouldn't hurt to get early feedback from people like Joe Koning)

That requires the state machine to be rigid, and modeling the sorts of things we want to do going forward (such as grow, shrink, and partial resource release at completion) with FSA seems a bad fit (also evidence: slurm's complex notion of job state in slurm.h - simple job states 15 years later)

True.

But one way that at least I see the new model is the job manager is now becoming a FSA responsible for calling an action on certain event transitions, the action being a request of the corresponding service it recruits.

It seems that main difference is who owns the major FSA logic. With the new model, since the job manager becomes the main coordinator coordinating across multiple services, it makes it easy to manage the complexity and minimize duplicate FSAs across the different services.

Of course, a downside can be, if the main FSA is under-designed, other services' capabilities will be boxed in by that. But I don't see that as a problem.

dongahn commented 6 years ago

A big difference here though is that that the job manager becomes the primary owner of the job life cycle.

Related. When the scheduler allocate a resource set, how should effect the execution?

grondo commented 6 years ago

Related. When the scheduler allocate a resource set, how should effect the execution?

As I understand it, the scheduler is no longer involved in execution. It responds to the alloc request once resource set is allocated, and the job manager will take care of initiating execution.

dongahn commented 6 years ago

As I understand it, the scheduler is no longer involved in execution. It responds to the alloc request once resource set is allocated, and the job manager will take care of initiating execution.

Hmmm. There can many many RPCs on the fly, though.

grondo commented 6 years ago

Hmmm. There can many many RPCs on the fly, though.

Sorry, I don't follow, what RPCs are you referencing? (and hope my statement above wasn't confusing)

dongahn commented 6 years ago

Sorry, I don't follow, what RPCs are you referencing? (and hope my statement above wasn't confusing)

My understanding so far was: the job manager will send an alloc request. Upon receiving it, the scheduler will enqueue that into the pending job queue and try to allocate the resources. But normal case is, it won't be able to allocate it right away and wait until the next scheduler loop invocations.

Say about 2K jobs are ingested, and the job manager sends alloc requests for those 2K jobs when those jobs cannot be allocated, isn't there 2K RPCs on the fly?

Maybe I don't understand the proposed scheme after all.

garlick commented 6 years ago

Say about 2K jobs are ingested, and the job manager sends alloc requests for those 2K jobs when those jobs cannot be allocated, isn't there 2K RPCs on the fly?

2K shouldn't be a problem. Greater numbers might be for future based RPCs but there are other ways to manage requests and responses that we can use if need be.

grondo commented 6 years ago

For large numbers of "requests", it may be more efficient to send noresponse requests and handle scheduler responses through a dedicated service on the job manager, however that is an implementation detail.

For limiting the number of outstanding alloc requests, see @garlick's proposed "credit based" scheme for flow control.

(edit: oops sorry, I talked over @garlick)

dongahn commented 6 years ago

Greater numbers might be for future based RPCs but there are other ways to manage requests and responses that we can use if need be.

I guess my main question is why a full transactional RPC? Can we split this into two smaller transitions?

alloc request -- just please allocation request and the scheduler acks right away once it puts that into the pending queue.
when the scheduler schedules them, it puts the R into KVS and the job manager gets notified.

Free request doesn't have to be complicated though.

dongahn commented 6 years ago

For large numbers of "requests", it may be more efficient to send noresponse requests and handle scheduler responses through a dedicated service on the job manager, however that is an implementation detail.

Yes I think in general, this or KVS based scheme would be more resource efficient and scalable.

garlick commented 6 years ago

I think the main point of the proposal is that the job manager does transact with the scheduler's "resource allocation service", with the job manager taking the active role, and the scheduler taking a passive role.

A starting point might be to use a standard request handler on the sched side for "allocate" and "free" methods, with request and response messages carrying pointers to R's and J's stored in the KVS.

dongahn commented 6 years ago

I think the main point of the proposal is that the job manager does transact with the scheduler's "resource allocation service", with the job manager taking the active role, and the scheduler taking a passive role.

The main point agreed.

A starting point might be to use a standard request handler on the sched side for "allocate" and "free" methods, with request and response messages carrying pointers to R's and J's stored in the KVS.

I am not fully convinced that this will lead to the performance/scalability we need. But if this is a good first target to do a further evaluation, I won't object. I do strongly feel that storing R into the KVS is the right way to go. The actual service that will create an R is the resource matching service (not the scheduler loop service where such a request handler will be implemented). So unless we only pass lightweight references to KVS, this can lead to excessive copying.

garlick commented 5 years ago

RFC 21/Job States reqires:

Replaying the job eventlog SHALL accurately reproduce the current job state.

and goes on to define the events that drive state transitions. This property is relied upon in the job manager restart logic.

I am not sure at this point that we need the ability to derive state from the eventlog anywhere outside of the job manager, as one can simply query the job manager for this information. Let's tentatively close this and reopen if we find we need to move the internal job-manager eventlog replay code to the public API.

flux-framework / flux-core

add API for reducing eventlog events to job states #1666