flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
87 stars 41 forks source link

Match support for a jobspec that contains same resource-typed siblings #941

Open dongahn opened 2 years ago

dongahn commented 2 years ago

As documented in https://github.com/flux-framework/flux-sched/blob/master/resource/utilities/README.md and https://github.com/flux-framework/flux-sched/blob/master/resource/traversers/dfu_impl.cpp#L232, our resource graph infrastructure cannot handle a jobspec with same resource type siblings: e.g.,

cluster[1]->node[1]->core[1]
          ->node[2]->core[40]

But the class of jobspec has lots of use cases including support for upcoming rabbit multi-tiered storage support.

We have brainstormed a path forward and this ticket is to document the progress.

The initial idea was to solve this by adding partial subtree loop traversal which appeared to have its own issue.

The current idea is to support this via "Jobspec splitting" where hlibjobspec splits a jobspec with same typed siblings into multiple conforming jobspects and traverse the graph with each as an internal operation to the resource graph infrastructure -- this way this won't affect the job allocation pipeline such as ones documented in RFC 27.

dongahn commented 2 years ago

Thinking again this morning, jobspec splitting won't work either because this way you won't be able to express higher level locality constraints like:

cluster[1]->rack[1]->node[1]->core[1]
                   ->node[2]->core[40]

So... want to try if a modified version of partial subtree loop traversal instead.

dongahn commented 2 years ago

@jameshcorbett: Hmmm it looks to me like quite involved. One path forward would be actually to combine both ideas: on-the-fly jobspec splitting and subtree loop traversal. I was trying to get a very simple case working but the code change needed just to get to that point was too much for a short period of time.

The idea would be to split the jobspec on the fly when you have same resource typed siblings and use the split points to modify traversal -- i.e., calling DFV on each of the siblings.

For example if you have the jobspec like https://github.com/flux-framework/flux-sched/issues/941#issuecomment-1131896300:

When you visit a rack, you will call the DFV() with a split jobspec with node[1]->core[1] as well as postorder dom_finish_vtx with the modified jobspec (rack[1]->node[1]->core[1]) so that the best node(s) under the rack will be resolved first calling DFV with the next sibling.

Then you call the DFV() with the other split jobspec with node[2]->core[40] including dom_finish_vtx with rack[1]->node[1]->core[40] so that the best node(s) under the rack will be resolved.

We will need to make sure we use a coloring trick so that the subsequent loop traversal (beyond one pass traversal) can still be performed and still avoid selecting the the resources that were already resolved/enforced by the previous visit.

Plus, there are also lots of corner cases we need to handle as we support things like first policy, satisfiability and prefix omitted jobspec...

If I do this work, I expect it will take at least 3 - 4 weeks and this would require a quite a bit of architectural changes at the traverser level. I don't mean to scare you but set some expectations here.

dongahn commented 2 years ago

Another potentially easier viable path would be still use one-path traversal with some match test tweaks which should require many architectural changes...

Currently the main reason that same resource-typed sibling requests cannot be supported is mainly because each can match with a visiting vertex and esp. with the globally best matching policies (all but "first"), the traverse will exhaustively search for the first request and disallow another sibling to match/choose.

We can avoid or resolve this problem by using a pre-order match test scheme to be more than "type": for example, use type+property as the test criteria. I think also think of a few different ways to solve this problem -- maybe this is an easier way.

@jameshcorbett: let's discuss this scheme a bit today at some point.