flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

feature tracking: Advanced Reservations (DATs) #5201

Open grondo opened 1 year ago

grondo commented 1 year ago

This is a tracking issue for an implementation of DATs. The requirements as I understand them include:

grondo commented 1 year ago

Linked flux-framework/flux-sched#1013 above. According to @trws and @milroy once that PR is merged, we will have much of the support needed in Fluxion to schedule a DAT.

garlick commented 9 months ago

Idea: add a new job state RESERVED between SCHED and RUN such that a job request with a special attribute could get its alloc response R from the scheduler early, in advance of the the starttime field in R,. The job manager and the rest of flux could just treat that like any other allocation, except the job would remain in RESERVED state until starttime arrives. With R stored in the KVS, the sched.hello protocol could throw it back to the scheduler on a restart.

This would work for any job, including a sub-instance.

An advance static R would be more susceptible to having resources go bad before the job starts. With a flux instance, we could initially just set the quorum value to some fraction of the total and let the instance start with some non-critical nodes down.

grondo commented 9 months ago

For clarity, is the benefit of having the RESERVED state and advance R available so that an subinstance could be configured with the eventual resources assigned to the job? Would this also have some benefit for normal jobs? Adding a new state just for that purpose feels like it could be short sighted (though I'm probably missing the other benefits!), especially if we plan one day to support instances that can grow onto unknown resources instead of just known resources.

Another thing I'll just throw out there is that there is already a way to hold a job between SCHED and RUN by issuing a prolog-start event. Perhaps requiring R be known before hand could be a special case of the reservation alloc request (e.g. a reservation can include a hostlist or a node count like in Slurm) to satisfy the DAT-as-an-instance use case and a jobtap plugin could prevent transition to the RUN state and could perhaps even handle startup of a single broker instance with no resources, configured to use FLUB.

garlick commented 9 months ago

I guess the main appeal to me is that we wouldn't need to have a separate set of tools and rules for reservations like slurm. A job request would be sufficient to request a reservation, the existing scheduler interfaces would be sufficient to communicate the results, existing tools could be used to view/update reservations (since they are just jobs). In a way regular jobs then are just a degenerate case of a reservation anyway, where time in RESERVED is very short, so the plan doesn't introduce niche features that would have less testing than mainstream ones.

But yeah it builds upon the existing resource model which is fundamentally static. However, as we add dynamic resource capability to flux, this could grow too. For example, maybe a job could request to start as soon as an initial resource request can be fulfilled, and also hold a reservation that would be added to the job later? Maybe we could also add a way for the scheduler to modify an already allocated R, such as replacing nodes that are no longer available, and we could make that work the same for running and reserved jobs.

Anyway I'm not hard over on this idea - just throwing it out there to see if it sticks. Sounds like it's sliding down the wall a bit :-)

grondo commented 9 months ago

No, this is sounding appealing to me, but I'm afraid I still don't follow some points:

I guess the main appeal to me is that we wouldn't need to have a separate set of tools and rules for reservations like slurm.

I like this idea, but unfortunately don't have the mental capacity today to follow the reasoning. How would a reservation be requested? Would we just add a field to jobspec with an enforced start and end time, and only satisfy these requests from the instance owner? If a reservation is just a job that hasn't yet started, how would multiple jobs be submitted to a job in RESERVED state? It seems like these actions would require separate or missing tools we don't already have anyway.

In a way regular jobs then are just a degenerate case of a reservation anyway, where time in RESERVED is very short, so the plan doesn't introduce niche features that would have less testing than mainstream ones.

Ah, this is a good point. I had missed that all jobs would go through RESERVED (I had envisioned it as a one-off state). I do like this idea.

For example, maybe a job could request to start as soon as an initial resource request can be fulfilled, and also hold a reservation that would be added to the job later? Maybe we could also add a way for the scheduler to modify an already allocated R, such as replacing nodes that are no longer available, and we could make that work the same for running and reserved jobs.

I think this the general case of grow/shrink we've discussed before, and it doesn't seem like a RESERVED state is necessary to make this happen (at least we've never discussed it in that way) It seems like we were headed towards using resource-update events to manage that. (already we can update R using this approach)

garlick commented 9 months ago

I didn't really say this clearly but yes, I was thinking some new jobspec attributes would be the way a job would request "reserved" resources. We already have a duration, so maybe attributes for start time and flags indicating whether start time is absolute or best effort, what to do if resources become unavailable before start time, etc.

If a reservation is just a job that hasn't yet started, how would multiple jobs be submitted to a job in RESERVED state? It seems like these actions would require separate or missing tools we don't already have anyway.

I was thinking in that case the RESERVED job would be a subinstance, but would only accept jobs after it starts for now. Hmm, maybe that's a stronger requirement than I thought.

I think this the general case of grow/shrink we've discussed before, and it doesn't seem like a RESERVED state is necessary to make this happen (at least we've never discussed it in that way) It seems like we were headed towards using resource-update events to manage that. (already we can update R using this approach)

I just meant that jobs with reserved resource allocations could benefit in a general way from grow, not necessarily help us get there.

grondo commented 9 months ago

I was thinking in that case the RESERVED job would be a subinstance, but would only accept jobs after it starts for now. Hmm, maybe that's a stronger requirement than I thought.

Ah, I see. Forgive me, but do we need a separate state to handle this case then? For the purposes of all other tools the job would effectively be pending. I guess Flux could start a single rank instance (with the sole initial online rank excluded) to handle early job submission, but in principle that doesn't seem to require a new state. I worry that if the R is constantly evolving for a reserved allocation, then this would create a lot of traffic in the eventlog, whereas if we just keep the job in SCHED state until the allocation is granted we can just emit the actual R.

I really apologize because I feel like I'm missing the piece of the design that requires a new state. I am sure it is my fault and not yours.

trws commented 9 months ago

If we had a "reserved" state, possibly with either soft or hard semantics, we might also be able to use that to show it has been given a prospective start time by the scheduler. This is a bit of an idle thought while I'm in an OpenMP meeting, so it might not match super well, but if we could get both a nicer interface for DATs and have a way to surface predicted starts for jobs other than the next that would make users happy.

grondo commented 9 months ago

Is this necessary if flux-framework/flux-sched#1015 is fixed? We do already have ephemeral "annotations" that do not potentially fill the eventlog with events to communicate this kind of data which could change with each schedule update.

OTOH, with the estimated starttime and resources for every job which is in the scheduler plan available, we could expose the scheduler's plan via some kind of visualization (kind of like OAR's Gantt drawing tool). Does even this, though, require a new job state? Could the planned resources for jobs be exposed in some other manner that doesn't require writing data to the KVS and an eventlog each time it changes? (Just throwing that question out there, I don't really know the answer)

grondo commented 9 months ago

Also, would a RESERVED state also require a transition back to SCHED, e.g. if a new higher priority job is submitted, changing the schedule such that a RESERVED job no longer has any reserved resources in the current plan?

grondo commented 9 months ago

A note from the meeting: Being to submit to a DAT/reservation before its starttime is a optional requirement for a minimum viable solution. I take that to mean we fulfill this requirement by being able to submit a job request that is guaranteed to be fulfilled at some time point in the future, with a way to launch a multi-user instance on those resources once allocated, including a way to restrict the set of users allowed to submit to that instance.

Assuming this is correct, I'll update the bullet list above with some missing items. I don't think this solution requires a new job state and all the changes that would come with it?

garlick commented 9 months ago

I'd say let's hit the reset button on this discussion and start from the requirements. IOW let's drop the idea of RESERVED state and also of "regular jobs" having reservations and see what else we can come up with. If we need those ideas we can come back to them.

garlick commented 9 months ago

On user restrictions: only the system instance currently loads the mf_priority plugin from flux-accounting, so we should think about how we would restrict users in a multi-user sub-instance.

A related question is whether we worry about proper accounting for users within that subinstance.

In RFC 33 we did define an access policy, so if we didn't want to load mf_priority in a subinstance, we could potentially generate a list of allowed users and pass it down in the subinstance policy config. (I think the access controls are not implemented yet but that would be trivial).

grondo commented 9 months ago

In RFC 33 we did define an access policy, so if we didn't want to load mf_priority in a subinstance, we could potentially generate a list of allowed users and pass it down in the subinstance policy config. (I think the access controls are not implemented yet but that would be trivial).

Loading mf_priority in a subinstance seems like it would be a challenge. It currently assumes the flux-accounting service is loaded on rank 0, that rank 0 is on the same node as the accounting database, and it is trying to do system wide fair share on a portion of granted resources instead of within the DAT job itself (if that is even a thing). Also, we probably want to allow DAT/reservations to work without requiring flux-accouting, so I like the idea of access controls implemented by config stashed in the job's jobspec.

I'm also not sure how accounting for a subinstance would work. The subinstance jobs would not be going to the job archive or accounting archive, so we'd need some way to attribute usage, perhaps in an epilog or rc3 script when the DAT job is exiting? @ryanday36 - I assume we do currently account for jobs in DATs and reservations since Slurm only has one level of scheduling?

ryanday36 commented 9 months ago

That's correct. We do want to charge DAT usage to the users bank(s).

trws commented 9 months ago

Is a DAT currently represented as a Queue, such that normal user jobs in that queue are accounted, or as a job where only that job is actually accounted that runs many job steps?

grondo commented 9 months ago

For reference, here is a snippet of how Slurm accounts for reservations:

Jobs executed within a reservation are accounted for using the appropriate user and bank account. If resources within a reservation are not used, those resources will be accounted for as being used by all users or bank accounts associated with the reservation on an equal basis (e.g. if two users are eligible to use a reservation and neither does, each user will be reported to have used half of the reserved resources).

https://slurm.schedmd.com/reservations.html#account

trws commented 9 months ago

That begs a question for me. How often do we run into a DAT where it's composed of multiple banks rather than a single bank for the DAT? I admit I'd conceived of a dat as being a charged entity in and of itself which would be charged for at that level rather than the usage cost falling directly on the users that submitted work to it.

grondo commented 9 months ago

Good question @trws. And if we need to use a bank/account to control access to a DAT job, then we would need some way to create the access control list from the bank when the job is started, or extend support for the mf_priority plugin for running in a subinstance. (Note also that the mf_priority plugin would just restrict users that can submit a job, not users that can use other instance services)