flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
168 stars 50 forks source link

support partially allocated jobs across scheduler reload #6445

Open garlick opened 1 week ago

garlick commented 1 week ago

This is a proof of concept implementation of the changes to the scheduler hello protocol proposed in flux-framework/rfc#433, in which support is added for reloading the scheduler with housekeeping running and some nodes of job(s) already released.

The scheduler indicates it supports this by calling schedutil_create() with the SCHEDUTIL_HELLO_PARTIAL_OK flag. When it sets that, it agrees to parse an optional allocated key in each hello response. The allocated key is set to an idset representing the subset of ranks of R that are actually allocated. If missing, all ranks are assumed to be allocated.

We need to get some feedback from @milroy, @trws, et al to make sure this approach works for fluxion. I thought working through this with sched-simple would be helpful to illustrate the idea.

trws commented 6 days ago

I think this would work, but as I look over it I find myself wondering if we might be better off either with a list of things that have been released, or actually leaving this out of hello entirely and using a partial release after the hello response from the scheduler. The main reason is that this approach sends what is allocated as R, then we have to work out what to release by taking the difference between what is allocated and what needs to be removed before we can act on it. If instead the released portion is sent, either in hello or as a partial release, we can reuse the same code we already have for partial release without introducing a new code path into what's already proven somewhat complex or fragile.

That said, working through this as I write, it looks like the job-manager has the allocated resources in this format as pending. That makes me wonder if we should send that pending R as the allocation R as part of the base protocol rather than sending the initial R. As it is in this PR we're sending both right? We certainly need the whole R in job tracking, but I'm not sure the scheduler actually needs it, and it might make the whole thing a bit easier on both ends.

garlick commented 6 days ago

leaving this out of hello entirely and using a partial release after the hello response from the scheduler

Duh! Why didn't I think of that? Let me explore that one and see how it goes.

In general, I'm in favor of making the job manager offload work from scheduler(s). It seems like your idea would make it "just work" wtihout any changes.

garlick commented 6 days ago

That makes me wonder if we should send that pending R as the allocation R as part of the base protocol rather than sending the initial R. As it is in this PR we're sending both right? We certainly need the whole R in job tracking, but I'm not sure the scheduler actually needs it, and it might make the whole thing a bit easier on both ends.

No, in the current proposal we are just sending the allocated idset (ranks) and some basic job metadata like the id in the RPC responses, then libschedutil looks up the job's original R (including JGF) from the KVS and passes it to the scheduler in its callback.

garlick commented 6 days ago

OK, actually, it's quite the pain to send a free after the hello is finished, because the scheduler has to send the ready request before we can do that, which is handled in a different part of the job manager and pretty inconvenient to synchronize with.

Would sending a free idset make things a little easier than the current proposed allocated set? I'd be fine with that!

grondo commented 6 days ago

Could libschedutil do the work of creating the partial R from the original R and free idset, then directly call the scheduler's free callback if necessary after each hello callback?

Edit: (sorry if this is a naive suggestion, I haven't gone back to look at the actual implementation before suggesting it)

garlick commented 6 days ago

Could libschedutil do the work of creating the partial R from the original R and free idset, then directly call the scheduler's free callback if necessary after each hello callback?

Oh good idea, that makes sense to me. That gets around the fact that the R fragment in the job manager is missing the JGF.

trws commented 5 days ago

Yeah I think having the "free" idset and possibly also doing as @grondo suggested and having schedutil do the translation into a free call would work really well if it's not too difficult to factor that way.

garlick commented 5 days ago

Yeah I think having the "free" idset and possibly also doing as @grondo suggested and having schedutil do the translation into a free call would work really well if it's not too difficult to factor that way.

Great! I'll push a rework of this PR with those changes shortly.

garlick commented 5 days ago

ok, pushed those changes. This is built on top of #6450 currently - will rebase when that gets merged.

The RFC PR will need a small rework for free instead of allocated.

codecov[bot] commented 4 days ago

Codecov Report

Attention: Patch coverage is 80.88235% with 13 lines in your changes missing coverage. Please review.

Project coverage is 83.63%. Comparing base (03b0860) to head (24436e3).

Files with missing lines Patch % Lines
src/modules/job-manager/housekeeping.c 72.72% 9 Missing :warning:
src/common/libschedutil/hello.c 86.95% 3 Missing :warning:
src/modules/job-manager/alloc.c 83.33% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #6445 +/- ## ========================================== + Coverage 83.60% 83.63% +0.03% ========================================== Files 523 523 Lines 87505 87559 +54 ========================================== + Hits 73156 73233 +77 + Misses 14349 14326 -23 ``` | [Files with missing lines](https://app.codecov.io/gh/flux-framework/flux-core/pull/6445?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework) | Coverage Δ | | |---|---|---| | [src/modules/sched-simple/sched.c](https://app.codecov.io/gh/flux-framework/flux-core/pull/6445?src=pr&el=tree&filepath=src%2Fmodules%2Fsched-simple%2Fsched.c&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL21vZHVsZXMvc2NoZWQtc2ltcGxlL3NjaGVkLmM=) | `77.34% <100.00%> (+0.21%)` | :arrow_up: | | [src/modules/job-manager/alloc.c](https://app.codecov.io/gh/flux-framework/flux-core/pull/6445?src=pr&el=tree&filepath=src%2Fmodules%2Fjob-manager%2Falloc.c&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL21vZHVsZXMvam9iLW1hbmFnZXIvYWxsb2MuYw==) | `75.26% <83.33%> (-0.01%)` | :arrow_down: | | [src/common/libschedutil/hello.c](https://app.codecov.io/gh/flux-framework/flux-core/pull/6445?src=pr&el=tree&filepath=src%2Fcommon%2Flibschedutil%2Fhello.c&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL2NvbW1vbi9saWJzY2hlZHV0aWwvaGVsbG8uYw==) | `74.64% <86.95%> (+4.64%)` | :arrow_up: | | [src/modules/job-manager/housekeeping.c](https://app.codecov.io/gh/flux-framework/flux-core/pull/6445?src=pr&el=tree&filepath=src%2Fmodules%2Fjob-manager%2Fhousekeeping.c&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL21vZHVsZXMvam9iLW1hbmFnZXIvaG91c2VrZWVwaW5nLmM=) | `82.46% <72.72%> (-0.21%)` | :arrow_down: | ... and [10 files with indirect coverage changes](https://app.codecov.io/gh/flux-framework/flux-core/pull/6445/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework)

🚨 Try these New Features:

garlick commented 4 days ago

flux-framework/rfc#433 was updated to specify the (optional) free key instead of allocated as originally proposed.

Then schedutil was modified so that the hello callback subtracts the ranks in the free key (if present) from the R that it looks up in the KVS. R should still include the full original scheduling object.

The current behavior is preserved unless the scheduler sets the SCHEDUTIL_HELLO_PARTIAL_OK flag.