flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
168 stars 50 forks source link

tracking issue: standby/preemptible jobs #5739

Open grondo opened 9 months ago

grondo commented 9 months ago

From @ryanday36's list in #5165:

Preemptible jobs: AKA 'standby' qos / queue. Allow users to submit jobs that can be killed automatically by the system instance if another job needs the resources.

In some offline discussion, it was proposed that we could add a preemptible (or similar) job submission flag for this purpose. Drawbacks to this approach:

Most of those can be easily overcome if a submission flag is the correct approach.

Alternate solutions include:

ryanday36 commented 9 months ago

I think that a submission flag would work as long as the drawbacks that you noted could be overcome. Generally we allow 'standby' jobs to be exempt from other queue limits and allow all users to access them. So, we would also want the preemptible flag could also be seen by the priority plugin so that it can not count those jobs against queue limits. I think that would provide the same benefits as the queue implementation, at least for how we use standby / preemption.

That said, there are a number of use cases that can be solved by overlapping queues (exempt / expedite, whole cluster DATs), so that could be considered a benefit of that approach. Exempt / expedite could probably all be done through accounting / the priority plugin. We should probably talk more about DATs where we want to be able to let a user run on all nodes on a cluster that we've split into multiple queues.

grondo commented 3 months ago

This idea was discussed again in a meeting recently. The preemptible flag still seems to be the solution of choice, but this will require an update to the resource acquisition protocol. I've opened flux-framework/rfc#423.

ryanday36 commented 1 month ago

Over in the flux team on Teams, one of the users on Tuolumne had an interesting idea around standby / preemption, which would be to allow users to specify a minimum duration for their jobs:

However I got to thinking that a minimum time in addition to a maximum time could create a more powerful mechanism than standby. If you wanted a slurm like standby you would set your job's minimum time to 0, but if you wanted to actually get something done but also let other jobs in after you'd made some progress setting a minimum time of an hour or something might be a reasonable compromise.

garlick commented 5 days ago

Note that we added a preemptible-after attribute to RFC 14 after discussion in flux-framework/rfc#423

During development, this can be set on a job with e.g. flux run --setattr=preemptible-after=0.

trws commented 4 days ago

Sounds good. I like having it as an attribute, and if we want it to be a flag, we could always offer a CLI flag that sets the attribute, or is there another meaning of flag I'm not processing?

grondo commented 4 days ago

Sounds good. I like having it as an attribute, and if we want it to be a flag, we could always offer a CLI flag that sets the attribute, or is there another meaning of flag I'm not processing?

The meaning of flag here is a job submission flag as defined in the submit or set-flags events. submit flags are set via the cli submission --flags option and include debug, waitable, and novalidate. I think we were originally thinking of adding preemptible as one of these flags.

Unfortunately not documented anywhere, there are also a couple other flags that may be set by the job manager or jobtap plugins. These include the alloc-bypass and immutable flags. (Just adding those for completeness sake, we should get all these added to an RFC)

trws commented 4 days ago

That helps @grondo, thanks! From a logical perspective I can see it fitting in with those. That said, if we want to expose those to fluxion I would think attributes might be a good way to do it. Maybe worth thinking about as a general sub-object or something.

On Nov 18, 2024, at 12:11 PM, Mark Grondona @.***> wrote:



Sounds good. I like having it as an attribute, and if we want it to be a flag, we could always offer a CLI flag that sets the attribute, or is there another meaning of flag I'm not processing?

The meaning of flag here is a job submission flag as defined in the submithttps://urldefense.us/v3/__https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_21.html*submit-event__;Iw!!G2kpM7uM-TzIFchu!2PxQbkwb6Ojef-n4cQgR_Dw8mDq3F7QqBoATxq1vyehr1Cg3KY55ZHy-feIWmuqh8Jm81BstGS9BXNq9FLPxDStD9Tw$ or set-flagshttps://urldefense.us/v3/__https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_21.html*set-flags-event__;Iw!!G2kpM7uM-TzIFchu!2PxQbkwb6Ojef-n4cQgR_Dw8mDq3F7QqBoATxq1vyehr1Cg3KY55ZHy-feIWmuqh8Jm81BstGS9BXNq9FLPxTXe-Ap4$ events. submit flags are set via the cli submission --flagshttps://urldefense.us/v3/__https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man1/flux-submit.html*cmdoption-flux-submit-flags__;Iw!!G2kpM7uM-TzIFchu!2PxQbkwb6Ojef-n4cQgR_Dw8mDq3F7QqBoATxq1vyehr1Cg3KY55ZHy-feIWmuqh8Jm81BstGS9BXNq9FLPxKKk2WGA$ option and include debug, waitable, and novalidate. I think we were originally thinking of adding preemptible as one of these flags.

Unfortunately not documented anywhere, there are also a couple other flags that may be set by the job manager or jobtap plugins. These include the alloc-bypass and immutable flags. (Just adding those for completeness sake, we should get all these added to an RFC)

— Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https://github.com/flux-framework/flux-core/issues/5739*issuecomment-2483631560__;Iw!!G2kpM7uM-TzIFchu!2PxQbkwb6Ojef-n4cQgR_Dw8mDq3F7QqBoATxq1vyehr1Cg3KY55ZHy-feIWmuqh8Jm81BstGS9BXNq9FLPx5QGVPHs$, or unsubscribehttps://urldefense.us/v3/__https://github.com/notifications/unsubscribe-auth/AAFBFNKRKALDQRVIOJXP7TD2BINUPAVCNFSM6AAAAABDIVZBRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBTGYZTCNJWGA__;!!G2kpM7uM-TzIFchu!2PxQbkwb6Ojef-n4cQgR_Dw8mDq3F7QqBoATxq1vyehr1Cg3KY55ZHy-feIWmuqh8Jm81BstGS9BXNq9FLPxShJck7A$. You are receiving this because you commented.Message ID: @.***>