Open garlick opened 1 week ago
I think this would work, but as I look over it I find myself wondering if we might be better off either with a list of things that have been released, or actually leaving this out of hello entirely and using a partial release after the hello response from the scheduler. The main reason is that this approach sends what is allocated as R, then we have to work out what to release by taking the difference between what is allocated and what needs to be removed before we can act on it. If instead the released portion is sent, either in hello or as a partial release, we can reuse the same code we already have for partial release without introducing a new code path into what's already proven somewhat complex or fragile.
That said, working through this as I write, it looks like the job-manager has the allocated resources in this format as pending. That makes me wonder if we should send that pending
R as the allocation R as part of the base protocol rather than sending the initial R. As it is in this PR we're sending both right? We certainly need the whole R in job tracking, but I'm not sure the scheduler actually needs it, and it might make the whole thing a bit easier on both ends.
leaving this out of hello entirely and using a partial release after the hello response from the scheduler
Duh! Why didn't I think of that? Let me explore that one and see how it goes.
In general, I'm in favor of making the job manager offload work from scheduler(s). It seems like your idea would make it "just work" wtihout any changes.
That makes me wonder if we should send that pending R as the allocation R as part of the base protocol rather than sending the initial R. As it is in this PR we're sending both right? We certainly need the whole R in job tracking, but I'm not sure the scheduler actually needs it, and it might make the whole thing a bit easier on both ends.
No, in the current proposal we are just sending the allocated idset (ranks) and some basic job metadata like the id in the RPC responses, then libschedutil looks up the job's original R (including JGF) from the KVS and passes it to the scheduler in its callback.
OK, actually, it's quite the pain to send a free after the hello is finished, because the scheduler has to send the ready
request before we can do that, which is handled in a different part of the job manager and pretty inconvenient to synchronize with.
Would sending a free
idset make things a little easier than the current proposed allocated
set? I'd be fine with that!
Could libschedutil do the work of creating the partial R from the original R and free
idset, then directly call the scheduler's free
callback if necessary after each hello
callback?
Edit: (sorry if this is a naive suggestion, I haven't gone back to look at the actual implementation before suggesting it)
Could libschedutil do the work of creating the partial R from the original R and free idset, then directly call the scheduler's free callback if necessary after each hello callback?
Oh good idea, that makes sense to me. That gets around the fact that the R fragment in the job manager is missing the JGF.
Yeah I think having the "free" idset and possibly also doing as @grondo suggested and having schedutil do the translation into a free call would work really well if it's not too difficult to factor that way.
Yeah I think having the "free" idset and possibly also doing as @grondo suggested and having schedutil do the translation into a free call would work really well if it's not too difficult to factor that way.
Great! I'll push a rework of this PR with those changes shortly.
ok, pushed those changes. This is built on top of #6450 currently - will rebase when that gets merged.
The RFC PR will need a small rework for free
instead of allocated
.
Attention: Patch coverage is 80.88235%
with 13 lines
in your changes missing coverage. Please review.
Project coverage is 83.63%. Comparing base (
03b0860
) to head (24436e3
).
🚨 Try these New Features:
flux-framework/rfc#433 was updated to specify the (optional) free
key instead of allocated
as originally proposed.
Then schedutil was modified so that the hello
callback subtracts the ranks in the free
key (if present) from the R that it looks up in the KVS. R should still include the full original scheduling
object.
The current behavior is preserved unless the scheduler sets the SCHEDUTIL_HELLO_PARTIAL_OK flag.
This is a proof of concept implementation of the changes to the scheduler hello protocol proposed in flux-framework/rfc#433, in which support is added for reloading the scheduler with housekeeping running and some nodes of job(s) already released.
The scheduler indicates it supports this by calling
schedutil_create()
with the SCHEDUTIL_HELLO_PARTIAL_OK flag. When it sets that, it agrees to parse an optionalallocated
key in each hello response. Theallocated
key is set to an idset representing the subset of ranks of R that are actually allocated. If missing, all ranks are assumed to be allocated.We need to get some feedback from @milroy, @trws, et al to make sure this approach works for fluxion. I thought working through this with sched-simple would be helpful to illustrate the idea.