rfc20: need partial release guidance

garlick commented 4 years ago

RFC 20 describes the form of R version 1.

One thing that is missing is a discussion of how the exec system generates R fragments to support partial release of resources.

The exec system will operate at the granularity of shell instances, which currently map 1:1 with broker ranks or "execution targets". Since this is the same unit as the R_lite "rank", splitting this portion of R into fragments should not be challenging.

What if anything should be done with the (optional) scheduler dict?

If there is no way for the exec system to release portions of JGF, then is there a compelling reason to expose it directly in R? If it's only for scheduler bookkeeping, could the scheduler instead include some abbreviated id instead?

Edit: regardless of the details, the main issue is that this important use case of R isn't covered in the RFC.

dongahn commented 4 years ago

Thanks. Tagging @milroy, as he probably want to pay special attention to this per his research.

dongahn commented 4 years ago

If there is no way for the exec system to release portions of JGF, then is there a compelling reason to expose it directly in R? If it's only for scheduler bookkeeping, could the scheduler instead include some abbreviated id instead?

I don't follow. Are you suggesting the scheduler add "rank" (or execution target) into each resource vertex?

dongahn commented 4 years ago

I don't follow. Are you suggesting the scheduler add "rank" (or execution target) into each resource vertex?

Looking at the code, JGF already add "rank" to each vertex.

dongahn commented 4 years ago

The exec system will operate at the granularity of shell instances, which currently map 1:1 with broker ranks or "execution targets". Since this is the same unit as the R_lite "rank", splitting this portion of R into fragments should not be challenging.

I think an important design point would be to determine how exactly we deal with the R object on a partial release.

For free RPC, it seems the pair of jobid and releasing execution target list should be sufficient for flux-sched to do its resource deallocation. (Need some testing to see how difficult or easy to do this though)

But we probably don't want to keep the original R on such partial release. If this partially freed job continues to run across a scheduler reload event, this will lead the newly loaded scheduler to reconstruct the full allocation, not partial allocation.

So it seems we have two choice:

Upon partial release, manipulate the original R to carve out the released portion from either or both R_lite and JGF keys.

Or

Augment R (or introduce another metadata) to encode the releasing execution target .

Now I suspect, flux-core probably don't want to write a graph code to manipulate JGF. Perhaps, we can add the "released" key or similar somewhere and make the list of releasing execution targets as its value?

On a scheduler reload, such an augmented R (or R + the released execution target metadata) can be passed to the hello callback to assist the scheduler with correct state reconstruction?

dongahn commented 4 years ago

BTW, having the JGF in R is absolutely necessary for flux-sched to be able to reconstruct the scheduler state on reload and to construct the nested scheduler state when we use a reader beyond hwloc. For high end systems, I believe we will need more than hwloc.

garlick commented 4 years ago

My main thought was that perhaps the integer exec targets (ranks) could be stand-ins for JGF subgraphs in R if the scheduler maintained a consistent internal mapping, including across restarts. The advantage would be allowing exec to work in a scheduler-neutral manner, and keeping the size of R's lean and operations on them simple.

As far as what appears in the KVS, I don't think we can modify the original R since that should remain intact for provenance (what node did I run on?) We had discussed dropping R fragments into the KVS (maybe in a "shrink" subdirectory) as chunks are freed... Then the complementary "grow" directory could contain chunks that are added.

I dunno about passing resources down to a subinstance, but it seems like the common case for every job may not be to bootstrap a graph scheduler instance, and we care about job throughput, so should we think about other options?

If I've wandered into the weeds, apologies!

dongahn commented 4 years ago

My main thought was that perhaps the integer exec targets (ranks) could be stand-ins for JGF subgraphs in R if the scheduler maintained a consistent internal mapping, including across restarts.

I don't think ranks will work because node-local resources can be allocated. Graph vertex and edges IDs may become such stand-ins at the expense of added complexity. But this probably won't serve your need?

We had discussed dropping R fragments into the KVS (maybe in a "shrink" subdirectory) as chunks are freed... Then the complementary "grow" directory could contain chunks that are added.

R fragments can only contain R_lite. Or this can even be a simpler form of "ranks" given the current granularity. Maybe we can revise R_lite key to make rank list form also a valid R. Then, original R - Freed Rs + Extended R can represent your current state and can be used in scheduler state reconstruction as well as elastic scheduling I supposed.

dongahn commented 4 years ago

I dunno about passing resources down to a subinstance, but it seems like the common case for every job may not be to bootstrap a graph scheduler instance, and we care about job throughput, so should we think about other options?

Yeah, this is an important case. My thought on this has been to leverage our concept of "scheduler specialization" again

Here I think it makes sense we use the full JGF-included R writer only at the system and internal level instances. At the leaf level where large number of jobs need to scheduled and run, users can specialize the emit behavior of scheduler to R_lite only.

My conjecture is using JFG at the internal levels actually will increase the throughtput compared to relying on hwloc reader.

dongahn commented 4 years ago

The advantage would be allowing exec to work in a scheduler-neutral manner

If we go with the proposed Original R - Freed Rs + Extended R approach, would having the optional scheduling key disallow exec to work consistently across different schedulers though?

If I've wandered into the weeds, apologies!

dongahn commented 4 years ago

If I've wandered into the weeds, apologies!

These are really essential points to discuss at this point. Please keep your comments coming.

garlick commented 4 years ago

After reading my comments and your responses again, I just realized an error in my thinking: I was suggesting an execution target could be a standin for a JGF subtree in R, but that only works if whole nodes are allocated. Without more information in R, the scheduler receiving it during FREE or HELLO wouldn't know what subset of the execution target's resources were allocated to that job. Sorry about that.

R fragments can only contain R_lite. Or this can even be a simpler form of "ranks" given the current granularity. Maybe we can revise R_lite key to make rank list form also a valid R.

It is good if JGF does not need to be repeated in R fragments during "shrink", and if exec / job manager can just ignore it. That was one question I wanted to get clarified in the RFC.

In fact if that is all that is needed during partial release, the release events logged to the job eventlog in the KVS already contain this information and might be sufficient?

dongahn commented 4 years ago

After reading my comments and your responses again, I just realized an error in my thinking: I was suggesting an execution target could be a standin for a JGF subtree in R, but that only works if whole nodes are allocated. Without more information in R, the scheduler receiving it during FREE or HELLO wouldn't know what subset of the execution target's resources were allocated to that job. Sorry about that.

Exactly!

It is good if JGF does not need to be repeated in R fragments during "shrink", and if exec / job manager can just ignore it. That was one question I wanted to get clarified in the RFC.

Yes.

In fact if that is all that is needed during partial release, the release events logged to the job eventlog in the KVS already contain this information and might be sufficient?

I think we still need this info passed to FREE and also hello callback for scheduler reload event. If the eventlog is the ultimate source of this info which serves this two calls, this should be sufficient I think. Like I said before I need to do some work to see how easy or difficult to do this this at flux-sched. But I am pretty positive that this can be done.

dongahn commented 4 years ago

Then the complementary "grow" directory could contain chunks that are added.

While we are here, maybe we can also has this out a bit as this is what @milroy will soon need.

I don't think adding an additional R is difficult. But what is currently difficult would be how to do this under the original JOBID. In particular, flux job submit will always generate a new JOBID. Do you think there is an easy path to to generate a new R under the same JOBID using flux job submit|flux mini interface?

garlick commented 4 years ago

Could we keep this issue focused on what needs to be updated in RFC 20 to implement partial release, and open another issue for the grow case?

dongahn commented 4 years ago

Yes. Indeed that's what I was going to suggest anyway.

garlick commented 4 years ago

I think we still need this info passed to FREE and also hello callback for scheduler reload event. If the eventlog is the ultimate source of this info which serves this two calls, this should be sufficient I think. Like I said before I need to do some work to see how easy or difficult to do this this at flux-sched. But I am pretty positive that this can be done.

How about if we just send a free request to the scheduler as we do now, let libschedutil lookup the full R before calling eachfree() callback as now (or cache it as an optimization), and add an idset to the request which describes which exec target ranks are being released in that free message? The scheduler would then need to take the intersection of the idset and the original R to decide what to free internally.

For the hello handshake, we could also add an idset to the hello() callback that indicates which exec target ranks are still allocated if a subset? The scheduler would take the intersection of the idset and R to decide what to allocate internally.

Is it reasonable to change RV1 to describe partial release in terms of exec targets, not in terms of R subsets; and indicate that the scheduler dict is optional, opaque scheduler-specific data not to be tampered with or depended on by other system components? (Thus avoiding the implication that it has to be subdivided for partial release)

This is probably not a long term solution since it only covers a coarse granularity "shrink", but it may get us well past our TOSS milestone. I think the point of version 1 was to get the bare minimum defined, and it is a hard production requirement to be able to tolerate a hung node without tying up unrelated resources, so I think it fits.

grondo commented 4 years ago

Is it reasonable to change RV1 to describe partial release in terms of exec targets, not in terms of R subsets; and indicate that the scheduler dict is optional, opaque scheduler-specific data not to be tampered with or depended on by other system components?

Sorry to jump in late, but @garlick I feel your scheme posted above makes the most sense for an initial solution. It allows each consumer of R to make use of any ancillary data stored in the format, for as long as we restrict partial release to one or more execution target ids, anyway.

Note, one of the early use cases for partial release (release of resources from execution targets as soon as epilog completes, rather than waiting for the entire job to finish) will be for flux jobs or other tools to display the currently allocated resources for jobs. We should strive for a partial release/shrink format that makes this operation straightforward. Also, this might have implications for the accounting service, e.g. do sites want to charge for resources while the job epilog is running, and if so the accounting module may need to integrate over the shrink operation of partial release.

Here's a crazy idea: what if R was an eventlog instead of a static object? It could eventually have grow and shrink events and the exec system could watch this eventlog.

garlick commented 4 years ago

for flux jobs or other tools to display the currently allocated resources for jobs. We should strive for a partial release/shrink format that makes this operation straightforward.

Good point.

Here's a crazy idea: what if R was an eventlog instead of a static object? It could eventually have grow and shrink events and the exec system could watch this eventlog.

I always kind of thought of the alloc and free events in the main job eventlog as having an implicit, indirectly referenced R operand, and that grow would be represented by more allocs with explicit operands, and shrink by more frees with explicit operands. So in a way we already something like that?

grondo commented 4 years ago

an implicit, indirectly referenced R operand

I guess my point was then an individual R is not meaningful. Tools that want to determine the allocated resources to any job at any given time have to parse the job eventlog and load multiple Rs from the kvs to put together something sane.

However, I agree it is mostly the same conceptually.

Edit: Also, I guess it is the mechanism of taking the "implicltly referenced" and making it explicit for all R users that we're trying to figure out in this issue? I guess I was thinking it would be nice if the R format itself could be directly amended by an append.

garlick commented 4 years ago

Makes sense. Or maybe R and its deltas just gets pulled into the eventlog alloc/free contexts?

Free range coffee discussion indicated!

grondo commented 4 years ago

Perhaps we could put a pin in this to be taken up when we're ready to do the actual work involved? Or is this issue on critical path for upcoming (within the week) milestone?

garlick commented 4 years ago

Yes, pinned!

flux-framework / rfc

rfc20: need partial release guidance #230