flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
162 stars 49 forks source link

need a way for job manager epilog to implement "partial release" #4312

Open garlick opened 2 years ago

garlick commented 2 years ago

Problem: the current job manager epilog posts epilog-start and epilog-finish events, and no job resources can be freed until epilog-finish. If the job manager epilog does something that could take a long time on a subset of nodes, then there is no opportunity to release a partial set of resources back to the scheduler.

One idea floated by @grondo was to include an idset in the context of the epilog-finish, like the release event. Both epilog-finish and release would decrement a refcount on a set of execution targets, and the free to the scheduler would occur once a target's count reaches zero.

grondo commented 2 years ago

Just a note that we should update RFC 21 as well.

jameshcorbett commented 2 years ago

The rabbit setup which motivated this discussion goes something like this:

  1. User's executable finishes
  2. Flux tells DWS (via kubernetes API) to unmount rabbit FS's on compute nodes 3 (a). If (2) succeeds, compute nodes can be returned to the scheduler 3 (b). If (2) doesn't succeed within time t (for some t we choose), ping kubernetes for a list of nodes that have succeeded unmounting and release those nodes. Continue pinging every t seconds (maybe with some exponential backoff or similar) until complete success is reached.

The way I had planned to implement this was to add a job manager epilog which would send an RPC to a Python script, which would talk to kubernetes and then respond to the RPC either with a complete success message or some list of successful nodes, terminated (hopefully) by a final complete success message.

jameshcorbett commented 2 years ago

I am wondering if there might be some trickery involved from the fact that jobs will have resources (rabbits) that aren't associated with nodes? For instance there will be cases where all the compute nodes are ready to be freed but the rabbits aren’t.

garlick commented 2 years ago

Yeah that's a bit tricky. We cut some corners in the current job manager / exec system / scheduler design so that we use "execution targets" (broker ranks) to refer to subsets of R. That is what the idset we discussed returning from the job manager epilog would represent. That does not quite work for resources that are not associated with an execution target like apparently the rabbit.

Aside: just had a quick review of RFC 27/Resource Allocation Protocol and noted that we will need to change it to support partial release, since currently a free request just contains the job ID, which the scheduler can use to look up R. There's no way to refer to partial R with an idset.

I didn't have any great ideas offhand. This needs pondering.

jameshcorbett commented 2 years ago

@grondo and I talked about it on the coffee call and he proposed putting in partial job R to the free request rather than an idset. He noted that it would work for sched-simple because it frees resources based on the R passed by free_cb in src/common/libschedutil/ops.c, so it could handle partial release that way, but fluxion might be more complicated. I will talk to @dongahn about it later today.

But yeah the trigger to free the rabbits is independent of the compute nodes. For the compute nodes, we can free them once the job shells have stopped and rabbit software tells us that the file systems have been unmounted. For the rabbits, we can free them once the compute nodes can be freed and rabbit software tells us that user data has been safely moved off of the rabbits and the rabbit file systems have been cleaned up.

jameshcorbett commented 2 years ago

There is an additional complication, which is that Flux can technically alert the user that their job has completed before the last condition has been reached (that the rabbit file systems have been cleaned up).

Since I'm guessing that would be very difficult to implement I don't think it would be too big of a deal to ignore that part though and only mark the job as completed once all the conditions are met and all the resources have been freed. If the FS clean-up outright fails, it's fine as long as we can still mark the job as succeeding. If the FS clean-up hangs, there wouldn't be any data loss, the user just wouldn't know that.

jameshcorbett commented 2 years ago

If the FS clean-up hangs, there wouldn't be any data loss, the user just wouldn't know that.

Maybe a solution could be that once the rabbit software tells us that user's data is secure, we post an eventlog entry saying so.

grondo commented 2 years ago

There is an additional complication, which is that Flux can technically alert the user that their job has completed before the last condition has been reached (that the rabbit file systems have been cleaned up).

I might be misunderstanding, but the job manager should not issue the clean event and the job would not go into the INACTIVE state until all resources have been released, not just the compute nodes. Therefore, an entity that needs to wait until user's data is secure could wait for the clean event for a job, however entities that only need to wait until the job tasks or initial program are complete can just wait for the finish event. (Side note: flux job attach currently waits for the clean event, but should probably only wait for the finish event. I thought there was an open issue on this, but can't find it ATM)

jameshcorbett commented 2 years ago

I might be misunderstanding, but the job manager should not issue the clean event and the job would not go into the INACTIVE state until all resources have been released, not just the compute nodes. Therefore, an entity that needs to wait until user's data is secure could wait for the clean event for a job, however entities that only need to wait until the job tasks or initial program are complete can just wait for the finish event.

What I was trying to get at is that with the rabbits there might be three events the user cares about, rather than just clean and finish:

  1. job tasks or initial program finishes (finish)
  2. Rabbit data is secure (no name for this one, but maybe call it data_out)
  3. All resources have been released (clean)

and 2 would always happen before 3.

So yeah the user who cares about their data could wait for 3, since 3 implies 2 in all the cases I can think of right now, but I was wondering whether it would be good to have a separate event, particularly for cases where 3 might not come for a long time after 2 for whatever reason.

grondo commented 2 years ago

Ok, understood, and that makes sense. A node could be hung in the epilog for some reason (a somewhat common occurrence), so the clean event could be delayed, but the job's rabbit data could still be secure so a separate event here makes sense.

Edit: I wonder if we should favor a specific event name in this case though, or if the case is general enough that we should add a new event to RFC 21. Something to consider.

ryanday36 commented 4 months ago

I don't think that this issue (and https://github.com/flux-framework/flux-core/issues/2204) made it into my production RM features list, but I was thinking about it this week due to a 16 node job on tioga not being released due to one node stuck in epilog. If it's something that would be reasonably easy to implement, it would be nice to have. If it's not easy, it should probably stay lower priority than other things. We need to fix the thing that's hanging in the epilog anyway.

garlick commented 4 months ago

Since more nodes potentially get idled when a large job is stuck in the prolog compared to a small one, and currently only takes one straggler, it seems like this could get really annoying on el cap as we scale up.

@grondo and I were chatting about things adjacent to this today, and in that discussion (concerning rabbits and when to have the job tools declare a job complete) was that the system epilog script could be decoupled from the job and run after the job reaches INACTIVE state. Then it might be quick and easy to implement partial release of resources to the scheduler as the epilog completes, since it wouldn't be dependent on the big exec system rewrite.

As an alternative to decoupling the system epilog script we could add a new decoupled system script. Maybe some things in the epilog really should run while the job is in CLEANUP state and be "billed" to the user and logged in their eventlog as opposed to treated as system overhead or whatever. Other things like running ansible seem clear candidates for decoupling.

Anyway, the original Flux design (not yet fully realized, but to an extent planned for in the code) was that a job would free multiple R fragments back to the scheduler as the epilog completed in batches. Unfortunately it looks like fluxion ignores the R fragment it receives in the free callback, and instead just uses the job ID (that's in the message for request/response matching purposes) to free all the resources associated with the job on the first free callback.

https://github.com/flux-framework/flux-sched/blob/master/qmanager/modules/qmanager_callbacks.cpp#L266

So if we do this, some work will be needed in fluxion.

dongahn commented 4 months ago

I quickly reviewed the fluxion code, and the change seems manageable. Setting aside the work on qmanager, the fluxion-resource service takes care of deallocating resources by traversing the resource tree with the jobid and removing allocations tagged with that jobid. A viable strategy could involve taking the R fragment during the removal process and deallocating only the resources that the R fragment covers. I can provide more guidance on this task if needed. The ability to return resources partially in large-scale systems is crucial. I can look some more over the weekend and add more suggestions.

https://github.com/flux-framework/flux-sched/blob/master/resource/modules/resource_match.cpp#L1871

garlick commented 4 months ago

Hi @dongahn, thanks for chiming in!

I opened flux-framework/flux-sched#1151 for the fluxion specific discussion.

trws commented 3 months ago

We actually have had user requests to be able to early-release parts of their allocations also. A bit of a refactor is needed, but we need this for elasticity and for production, so I'll try to move this up the priority list. Might take a crack at it myself.

grondo commented 3 months ago

We should open a separate issue for voluntary early release. That's a pretty interesting idea, and resiliency work done recently for Flux instances and the job shell should make it possible to terminate non-critical shell and broker ranks of the job and the job can keep going.

For this specific issue, note that @garlick has a proof of concept proposed in #5818 which should address the major pain points. (I'm not sure if you meant you were going to take a crack at the job manager or Fluxion support needed for partial release, which is why I mention it)

trws commented 3 months ago

Thanks @grondo, agreed there would be more work to do for eager release. I meant to look at the fluxion side, though at first glance I need to work through where the free RPC actually gets handled right now, as it sits I only see a handler for a cancel RPC, which does do this but has to come through a slightly different path I guess?

grondo commented 3 months ago

The discussion in flux-framework/flux-sched#1151 may be helpful. The protocol is described in RFC 27, which as @garlick pointed out needs an update since it only currently describes a single free response.

garlick commented 3 months ago

The handlers are not message handlers b/c we abstracted the scheduler interface in "libschedutil" (for better or worse):

alloc: https://github.com/flux-framework/flux-sched/blob/master/qmanager/modules/qmanager_callbacks.cpp#L190

free: https://github.com/flux-framework/flux-sched/blob/master/qmanager/modules/qmanager_callbacks.cpp#L266

trws commented 3 months ago

Thanks both of you, that's very helpful. They're all registered in qmanager.cpp, it's a bit tangled but now that I know where the root is it's much easier to follow.

milroy commented 3 months ago

I've got some time to work on this in Fluxion and will try to get a WIP PR out with support in the traverser, module, planners/pruning filter, and reader soon. @trws let me know if you're making progress so we can avoid effort duplication.

A thought exercise to check my understanding and determine if it's ever useful to cancel a job in Fluxion based only on the jobid (which will likely be faster than processing a sequence of partial releases):

Will there be a way to distinguish on a per-RPC basis between the cases where the union of all R fragments in a sequence of sched.free RPCs for a single jobid is equal to the the full R for the jobid and the case where it is not? In other words, will there be a way to detect if the first sched.free RPC indicates an eventual full cancellation of the job?

I doubt there's a valid use case for distinguishing between the two. If distinguishing was possible and Fluxion could wait for the last sched.free RPC in the sequence to run a full cancellation based on the jobid, the resources corresponding to R fragments would be blocked from allocation in the resource graph. The reverse (issuing a full cancellation upon reception of the first sched.free RPC in the sequence) could result in multiple bookings as resources stuck in epilog could be allocated.

Since Fluxion is doing something similar to the latter already, has anyone observed multiple bookings when job resource subsets are stuck in epilog?

garlick commented 3 months ago

I think your understanding is complete. We haven't observed double bookings with Fluxion (at least not since you fixed that other bug) because flux-core doesn't do partial release now and in fact it's not allowed by RFC 21. It was prototyped in flux-framework/flux-core#5818 but I only tested with sched-simple since I knew Fluxion would not handle it.

We could define a flag that is set on the last R fragment freed for a given job if that turned out to be useful, but it sounds like you are arguing that it would not be and it was just a thought experiment?

milroy commented 3 months ago

It was prototyped in https://github.com/flux-framework/flux-core/pull/5818 but I only tested with sched-simple since I knew Fluxion would not handle it.

Ok, I wasn't sure if multiple free responses with R were supported or used in any Flux deployment yet due to PR #5783.

We could define a flag that is set on the last R fragment freed for a given job if that turned out to be useful, but it sounds like you are arguing that it would not be and it was just a thought experiment?

Yeah, I started by thinking a flag or similar might be a good idea, but I ended up not being able to justify it. It was mainly a thought experiment that I posted so others could check the reasoning and add valid use cases.

grondo commented 3 weeks ago

Update: this issue will be partially solved by #5818 by pushing off the partial release problem specific to differential runtimes of the administrative epilog into a new job manager housekeeping service. However, there may be more to do as expressed by @jameshcorbett in #5818:

I think there's a chance this approach may rule out or make more difficult some of the ideal rabbit-y behavior.

The flux-coral2 software issues an epilog-start event at the moment, to hold jobs while 1) compute nodes unmount their file systems and then 2) data is transferred from the rabbit nodes to the backing file system. We don't want to have users think their jobs are complete until that's all done. So making flux job attach wait for clean is desirable as far as that goes, and since this PR I think makes that possible, that's great.

However, 1) and 2) could potentially take a long time. And as soon as any individual compute node unmounts its file systems, it could potentially be released back into the resource pool. But it sounds like this PR is building up the assumption that partial release would only happen after housekeeping, which in turn only happens after epilog-finish.

Unless I have misunderstood and there might still be a way to release resources during an epilog? Or perhaps it's just worth forgetting about the potential to release nodes after they unmount their file systems and before the epilog completes.

In response, the following ideas were presented:

Not as part of this PR. However, the original idea proposed in https://github.com/flux-framework/flux-core/issues/4312:

include an idset in the context of the epilog-finish, like the release event. Both epilog-finish and release would decrement a refcount on a set of execution targets, and the free to the scheduler would occur once a target's count reaches zero.

or a variant thereof may actually be simpler after this PR lands, because we could more easily set up an epilog-finish event to hand back a subset of execution targets to the job-manager for potential housekeeping. (What I'd actually propose is that an epilog-start event would optionally include an execution target idset which would add an extra reference count on those ids (similar to housekeeping) which the corresponding epilog-finish would release)

In the case of data movement from the rabbit nodes to the backing file system, this may not need any compute resources to be held back (i.e. housekeeping could be started and the nodes released to the scheduler before the job becomes inactive, at least conceptually), so we'd probably need some way for an epilog-start event to say it doesn't hold back any resources, just keeps the job active.

To which @jameshcorbett replied:

An epilog-start event for data movement that doesn't hold back any resources sounds good. It would still need Fluxion to hold on to the rabbit resource vertices for the duration of the epilog but I think that's a Fluxion problem.

And I take it there could potentially then be a separate epilog-start event for rabbit-file-system unmounting that holds all the compute resources and releases them (or decrements the refcount) as the nodes complete unmounting, allowing the nodes to transition to housekeeping? I think that could work well.

milroy commented 3 weeks ago

I'm having a hard time following the implications of the discussion in #5818 with respect to Fluxion and the corresponding PR there: https://github.com/flux-framework/flux-sched/pull/1163.

Will the partial-release RPCs to Fluxion change, or just when they are sent?

grondo commented 3 weeks ago

I think only when they are sent. The only change being discussed I think is when the job releases resources to the job manager. The job manager will still employ the same protocol with the scheduler.