flux-framework / rfc

Flux RFC project
https://flux-framework.readthedocs.io/projects/flux-rfc/
7 stars 13 forks source link

Advanced job spec examples #97

Closed dongahn closed 7 years ago

dongahn commented 7 years ago

We discussed this at yesterday's scheduler team meeting. As I now have my resource strawman integrated with @morrone's jobspec, we agreed that we will generate further discussions for some advanced use cases:

I want to make sure these cases can be specified to support real use cases and the resulting jobspec objects can be used by the graph-based resource service for matching and scoring. Ultimately, the resulting examples will become the actually test cases for the upcoming resource-query utility.

lipari commented 7 years ago

Regarding bullet item 2 above, in a discussion I just had with @morrone , his view is that associations that one vertex has other hierarchies (beyond the dominant hierarchy) is supported under RFC14 using a type. For example, if a requested node resource needed to receive 50 watts of power, it would look something like this:

  - type: node
    count: 4
    with:
      - type: power
        count: 50

The only conceptual bridge is that we're specifying a flow resource as a resource_vertex. When the matcher matches a node request that wants 50 watts of power, after the matcher discovers that there is no child type of the dominant hierarchy named "power", it has to look through its other hierarchies for an edge named "power". I believe that is the general idea.

dongahn commented 7 years ago

Thanks @lipari.

Because I don't know exactly the best way to specify flow resource request myself, please allow me to expand on this to clarify this problem space.

It seems this specification is tailored forup walk towards an auxiliary flow hierarchy.

A concern I have with using the up walk for general cases would be that it is unclear if and how you can implement a meaningful flow-aware policy using this technique -- while there would be a subset of polices you can enforce -- for example, use only a specific named pdu (pdu1 or pdu2).

More specifically, say, a power-aware matcher, through an up walk on the power hierarchy, evaluated that node2 has 50 watts. Say, its pdu has 60 watts available and the next-level powerpanel has 1k watts. So far so good and node1 is evaluated to be "qualified."

Then, onto node3 which, say, shares the same pdu with node2. In this case, the matcher would also again evaluate this node to be qualified because it has 50 watts along its path in the power supply hierarchy. But then, in this case, you can't select node2 and node3 at the same time because you don't have 100 watts along the power supply path.

Now, the scheduler can, of course, put some temporary staging info to the resource vertex in the power hierarchy and only allow the first match to be allowed. But Then, this will lead to many unsatisfiability cases.

So I currently believe that a richer power-aware policy can only be implemented when the matcher uses power hierarchy as its dominant hierarchy (e.g., powerpanel0->pdu1->node...) For that case, the proposed form is not most useful...

tpatki commented 7 years ago

There's also the scenario where power is not uniformly distributed among nodes, and is distributed based on the critical path of the application. The following example tries to capture that, with an 8-node job, requesting 4 nodes at 150W and 4 nodes at 200W. This uses nodes as the dominant hierarchy (power one coming shortly).

resources:
   - type: slot
      with:
       - type: node
        count: 4
        with:
           - type: power
              count: 150
              unit: W
       - type: node
          count: 4
          with:
           - type: power
              count: 200
              unit: W
tpatki commented 7 years ago

Some examples that try to use power as the dominant hierarchy. I'm not sure yet what the granularity for power grouping will look like. The power-aware schedulers look at the full cluster's power, typically, because these don't have a hierarchical view like Flux...I believe we can capture that as such, if we assume "cluster-power", "rack-power", "node-power" as types.

1) 12 nodes spread across two racks with a non-uniform power request, 4 nodes with 100 W each and 8 nodes with 50 W each.

resources:
 - type: slot
    with:
    - type: rack-power
       count: 400
       unit: W
       with: 
       - type: node
            count: 4
    - type: rack-power
       count: 400
       unit: W
       with: 
       - type: node
            count: 8

2) Looking at the full cluster's power, no requirement of rack or node-level checks.

resources:
     - type: cluster-power
       count: 1600
       unit: W
       with: 
       - type: node
            count: 16

3) Another example with 8 cores spread across 2 sockets on the same node 100 W.

- type: node-power
   count: 100
   unit: W
   with: 
   - type: socket
      count: 2
      with: 
       - type: core
          count: 4
lipari commented 7 years ago

@dongahn, I agree that the optimal solution for searching a dominant hierarchy for flow resources is to make flow part of the dominant hierarchy. The goal is to support n numbers of auxiliary hierarchies, and if the solution is to fuse them into one dominant hierarchy, then you will need to create a hierarchy that includes power, bandwidth, and ???

That's fine if you can pull that off elegantly. The only other alternative I can see is to upwalk the auxiliary hierarchies and do the incremental staging activity like the current resrc prototype does.

We may need to modify RFC14 to directly support graph associations. Right now, a matcher has to look for with resources in the dominant hierarchy, and only if not found look for any associated edges. A mod to RFC14 would create a graph type. When a vertex includes a with to a graph type, it would clue the matcher to search its auxiliary graphs for that graph type and do the upwalk for the requisite flow quantity or connectivity.

dongahn commented 7 years ago

Some examples that try to use power as the dominant hierarchy. I'm not sure yet what the granularity for power grouping will look like. The power-aware schedulers look at the full cluster's power, typically, because these don't have a hierarchical view like Flux...I believe we can capture that as such, if we assume "cluster-power", "rack-power", "node-power" as types.

@tpatki: yeah, some variants of what your are showing here should work when you choose to use a power hierarchy as the dominant hierarchy of the matcher.

dongahn commented 7 years ago

I agree that the optimal solution for searching a dominant hierarchy for flow resources is to make flow part of the dominant hierarchy. The goal is to support n numbers of auxiliary hierarchies, and if the solution is to fuse them into one dominant hierarchy, then you will need to create a hierarchy that includes power, bandwidth, and ???

I don't think you want to fuse all hierarchies into one at all. There would be some optimization you can use to create a virtual hierarchy like that, (which we talked a bit about in the past), but that wouldn't be a general case. But the scheduler can "choose" what hierarchy it wants to use as its dominant hierarchy. I was just asking a question, when this is indeed the case (as this will make it possible to write useful policies), what should an example jobspec look like...

dongahn commented 7 years ago

OK. Based on these examples above, let me create a few example power hierarchies and job specs to further our discussions.

tpatki commented 7 years ago

@dongahn, @morrone

I've been a bit confused about the representation in jobspec. For example, my previous yaml spec (copied below) could be specified in two ways, depending on the values for "type".
The first assumes "node-power" is a type, and is of course more concise and easier to follow than the latter, although I'm not sure if we have a list of types somewhere or if we're looking at jobspec verbosity at all . I haven't looked at the generator/parser for yaml yet, these are just manual examples I could come up with.

For example:

resources:
- type: node-power
   count: 100
   unit: W
   with: 
   - type: socket
      count: 2
      with: 
       - type: core
          count: 4

Could be something like:

resources:
- type: cluster
  with:
  - type: power
    count: 100
    unit: W
    with: 
    - type: node
      count: 1
      with: 
      - type: socket
         count: 2
         with:
         - type: core
            count: 4
dongahn commented 7 years ago

As I understood, type can be anything. This way, we can model many different resources that we cannot even conceive today. When it comes down to my matcher, what it boils down to is a string match between a type if a jobspec object and a type in the visiting resource vertex.

When you use the power hierarchy as your dominant hierarchy, you will traverse this hierarchy in the depth first search, and that's why whether the power type appears above or below the compute resources become relevant, if that makes sense.

tpatki commented 7 years ago

Okay, that's good to know that there are no restrictions on types and that my understanding was correct. So my earlier examples seem to be appropriate -- as they are looking at allocating power before allocating nodes, which makes power the dominant hierarchy.

dongahn commented 7 years ago

Well, there is one exception :-) Reserved type(s): currently slot.

tpatki commented 7 years ago

Yes. If I understand correctly, reserved type "slot" will be used for grouping resources at the same level -- like I did in example 1? Please do let me know if I got that example jobspec right.

dongahn commented 7 years ago

I wouldn't know how to contain a program to the rack-power resource though...

tpatki commented 7 years ago

I think for most power-related scheduling, either a direct power cap (RAPL) or an indirect power control loop using DVFS will be used. These are all node level techniques, but most algorithms for scheduling assume a "soft bound" on the rack and use a monitoring system to regularly ensure that this rack-level bound is not exceeded in a closed-loop form. This is because most systems have a physical rack-level PDU/circuit breaker.

As a first cut, we can ignore rack-power though, and only assume "cluster-power-->node-power". This should be sufficient for our implementation, as I'm assuming when we launch a scheduler at the parent-level, the "cluster-power" type will have the information about the power/nodes for that particular flux instance...

Maybe a more appropriate term would be "instance-power" instead of "cluster-power"

dongahn commented 7 years ago

rack-power can be easily modeled. But at the root of the power hierarchy, you won't likely have cluster-power but more like powerpanel-power or in short powerpanel.

Further, for rack-power, would you want that to be named rack-power or simply pdu with some unique id?

In any case, if the jobspec specifies these power quantities in a fully hierarchical fashion, matcher will use that as an additional constraint. If partially specified (starting from node-power downwards), the matcher can use the specific matching policy of the matcher plugin. This would analogous to the difference between a fully vs. partially hierarchical spec in the containment hierarchy cluster[1]->rack[2]->node[3]->socket[1]->core[4] vs. simply node[3]->socket[1]->core[4].

dongahn commented 7 years ago

As a first cut, we can ignore rack-power though, and only assume "cluster-power-->node-power". This should be sufficient for our implementation, as I'm assuming when we launch a scheduler at the parent-level, the "cluster-power" type will have the information about the power/nodes for that particular flux instance...

Maybe a more appropriate term would be "instance-power" instead of "cluster-power"

In a child instance, the scheduler can have the full power hierarchy information. The only difference would be that, what's allocated to the child becomes what's available for further scheduling.

tpatki commented 7 years ago

Ah, I see where you're coming from -- I think I'm confusing physical vs logical hierarchy again.

If it's a physical hierarchy that has "hard-bounds" that we never change (such as node TDP or PDU supply limit), then powerpanel -> pdu_id -> node -> socket makes sense, as we cannot exceed these bounds ever.

I was suggesting a more dynamic "soft-bound" or logical hierarchy: a common scenario when scheduling power is to move power between instances/jobs during their execution as well as change power allocations based on other jobs prior to launch.
I was imagining a scenario of borrowing or donating power between instances or jobs or ranks within a job by just updating the jobspec by changing the "count" for the "type: instance-power" or "type: rack-power" prior to launching or during a job's execution, which is a common scenario with power scheduling. This of course will have to adhere to the hard constraints and should never exceed the powerpanel- or pdu-limit for that particular resource.

Question: Are we assuming that flux-instances that users/sysadmins launch will also have a jobspec?

dongahn commented 7 years ago

change power allocations based on other jobs prior to launch.

Could you expand on this a bit? Are you talking about choosing your power request between min and max based on the current scheduling state? If this is what you need, I believe the current logic in dealing with min/max for compute resource should just suffice.

Question: Are we assuming that flux-instances that users/sysadmins launch will also have a jobspec?

Good question. Given a flux-instance is a program, I expect it will submit a jobspec...

tpatki commented 7 years ago

change power allocations based on other jobs prior to launch.

Could you expand on this a bit? Are you talking about choosing your power request between min and max based on the current scheduling state? If this is what you need, I believe the current logic in dealing with min/max for compute resource should just suffice.

Sure. There are two things at play when scheduling power. The first assumes that the user knows how much power they want, which can be specified with the min-max example that you talked about. This is however, not a fair assumption because power is microarchitecture dependent, and users may not have access to power monitors. Also, because of manufacturing variability, processors on a cluster with exact same microarchitecture can consume different amounts of power (~25% difference), and depending on which node the user's job gets scheduled on, they might not be able to give an accurate min-max.

This is why in power scheduling, typically, system overriding should be permitted. Hence, the second approach assumes you're optimizing for say energy savings or for maximizing performance under a power bound or for placement of power-hungry jobs on "good" nodes; and that the system scheduler has full control to override user requests for power.

For example, let us say a user requests 200 W on a node, and the system scheduler launches their job with 200 W of power. However, via regular monitoring, the system scheduler realizes that the user's job is only consuming less than 100 W, but 200 W were allocated. For a long running job, this may result in significant underutilization of power and the remaining 100 W could be transferred to another job that can utilize it better, or be used to launch a new job. In such a scenario, the min-max as currently defined (user-request) is not sufficient. We may need to update the job-spec of a job that hasn't been launched yet to have an additional 100 W, update the parent instance's spec to reduce those 100 W for the currently executing job. Not sure if this can be captured with min-max.

Thoughts?

dongahn commented 7 years ago

Also, because of manufacturing variability, processor with exact same microarchitecture can consume different amounts of power (~25% difference)

I believe you should be about to group the compute nodes based on their power-performance efficiency (e.g., representing as an additional level in your power hierarchy) for more effective scheduling. Such an modeling exercise will be very good to have in a near future.

Hence, the second approach assumes you're optimizing for say energy savings or for maximizing performance under a power bound or for placement of power-hungry jobs on "good" nodes; and that the system scheduler has full control to override user requests for power.

If your resource representation already has the notion of good nodes, many of these can be done in a near future?

However, via regular monitoring, the system scheduler realizes that the user's job is only consuming < 100 W but 200 W were allocated.

I believe this will require a dynamic scheduling capability + grow/shrink services in our task/execution service. RFC 8. We will want to be able to do this elastic scheduling/execution for many resource types beyond power. But it will take some time before we will be able to get to such an advanced topic.

So, my suggestion would be to keep this in our back pocket for now and make sure we can get the best mileage out of the current capabilities (min/max + flexible ways to represent resource relations). Glad you mention this so that we can ultimately have to get to this.

tpatki commented 7 years ago

Also, because of manufacturing variability, processor with exact same microarchitecture can consume different amounts of power (~25% difference)

I believe you should be about to group the compute nodes based on their power-performance efficiency (e.g., representing as an additional level in your power hierarchy) for more effective scheduling. Such an modeling exercise will be very good to have in a near future.

Hence, the second approach assumes you're optimizing for say energy savings or for maximizing performance under a power bound or for placement of power-hungry jobs on "good" nodes; and that the system scheduler has full control to override user requests for power.

If your resource representation already has the notion of good nodes, many of these can be done in a near future?

This is a great suggestion! I can bin nodes based on efficiency into say k-groups, and then have a hierarchy with a type: efficiency-bin-<1,k> to address some of these issues.

I believe this will require a dynamic scheduling capability + grow/shrink services in our task/execution service. RFC 8. We will want to be able to do this elastic scheduling/execution for many resource types beyond power. But it will take some time before we will be able to get to such an advanced topic.

Sure, makes sense. We can limit it to the simple "hard-bound" cases I described before. One random question for future thinking though, for RFC 8, it seems like we will need the ability to dynamically change the jobspec if needed. Not sure what protocol we'll use at that point.

dongahn commented 7 years ago

This is a great suggestion! I can bin nodes based on efficiency into say k-groups, and then have a hierarchy with a type: efficiency-bin-<1,k> to address some of these issues.

Sure. Beauty of Flux's resource model is such a flexibility, and for the "resource-selection" piece of our scheduling, we should think about how to adapt the resource representations in conjunction with how to adapt our algorithms to get what we want, which is a much different way of thinking than the traditional approach. The traditional approach is to adapt your algorithm to a pre-existing rigid resource representation...

BTW, the same grouping should work to address resource-aging issues as the efficiency of compute/storage hardware is known to differ across different hardware as they age.

One random question for future thinking though, for RFC 8, it seems like we will need the ability to dynamically change the jobspec if needed. Not sure what protocol we'll use at that point.

This is an open question. For scheduling, Suraj and @SteVwonder have been looking at the resource scheduling coherence protocol for dynamic scheduling. But it's a bit premature.

Like I said, I am not sure if this should be done by changing the jobspec. I think a job spec can have a way to specify whether the job should be subject to dynamic scheduling but once the job gets an allocation, it is unclear if this will play a further role.

dongahn commented 7 years ago
resources:
- type: node-power
  count: 100
  unit: W
  with: 
    - type: node
       count: 4
       with:
         - type: socket
         - count: 2
            with: 
              - type: core
                count: 4

Trying to test this easy case first and realizing this wouldn't work well. with would be an multiplicative edge and this will try to get 400 nodes if I'm not mistaken.

tpatki commented 7 years ago

I'm confused. You mean 400 W, I think, not 400 nodes, which would be correct if you wanted 4 nodes with 100 W each. "type: node-power" will be per-node power cap (so you can only allocate socket and core power, but not multiple nodes). So it shouldn't have "4" nodes under it.

The correct job spec to request 4 nodes, each with 100 W on a cluster will be something like:

resources:
- type: cluster-power
  count: 400
  unit: W
  with: 
    - type: node
      count: 4
      with:
      - type: node-power
         count: 100 
         unit: W
         with: 
         - type: socket
           count: 2
           with: 
            - type: core
              count: 4
dongahn commented 7 years ago

This is because of the multiplicative effect of thw with edge. It was discussed as part of @morrone's previous PR at https://github.com/flux-framework/rfc/pull/93#issuecomment-309891034.


From: Tapasya Patki notifications@github.com Sent: Monday, August 21, 2017 6:16:20 PM To: flux-framework/rfc Cc: Ahn, Dong H.; Mention Subject: Re: [flux-framework/rfc] Advanced job spec examples (#97)

I'm confused. You mean 400 W, I think, not 400 nodes, which would be correct if you wanted 4 nodes with 100 W each. "type: node-power" will be per-node power cap (so you can only allocate socket and core power, but not multiple nodes). So it shouldn't have "4" nodes under it.

The correct job spec to request 4 nodes, each with 100 W on a cluster will be something like:

resources:

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/flux-framework/rfc/issues/97#issuecomment-323892225, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AA0nq3KYcNynDRHXrGQArR1IVAdw-w3fks5saivjgaJpZM4O8GlP.

tpatki commented 7 years ago

Let me edit the examples to follow our new plan. So from what I was thinking earlier:

cluster-power -> rack-power -> node-power -> socket-power 

to:

powerpanel -> pdu_id -> efficiency_grp_id -> node-power -> socket-power
tpatki commented 7 years ago

@dongahn: let me take a proper look at the multiplicative effect and come up with a better jobspec. I must have missed something in understanding how the jobspec is parsed.

dongahn commented 7 years ago

@tpatki: Parsing would be fine as your syntax looks okay. I think it is a semantics issue. Semantically, with is multiplicative.

  - type: rack
    count: 2
    with:
      - type: node
        count: 10

specifies that you request 20 nodes: 2 racks, each with 10 nodes. Likewise, if we naively interpreted your spec:

resources:
- type: cluster-power
  count: 400
  unit: W
  with: 
    - type: node
      count: 4

You specified 1600 nodes: 400 cluster powers, each with 4 nodes.

If we want to fit what you want using the current idiom set, something like this can be workable:

- type: cluster
  count: 1
  with:
    - type: power
      count: 400
    - type: rack
      count: 2
      - type: power
        count: 200 
      - type: node
        count: 2
        - type: power
          count: 100

But this would be a bit too much for a user to specify.

Overall, in terms of specifying power requirements, I find @lipari's original proposal looks a bit clearer under the expressiveness and limitations of our current job spec. My concern was it is not best suited when we want use a power hierarchy as our dominant one. But maybe, we can approach this problem by introducing a jobspec transformation step that transform the spec to match the underlying dominant hierarchy representation...

One question to you @tpatki: do you ever want to express higher-level power requirements at the user level (e.g., rack-power)? Or can this be internal to the scheduler match plugin?

tpatki commented 7 years ago

Hi Dong,

Maybe I'm not understanding this at all, but can you please explain how the second example is 1600 nodes? Note that "cluster-power" is different than "cluster". The unit for "cluster-power" is W -- the count of 400 is for power, not for nodes, at least the way I view it.

If we go bottom-up and multiply like you did, this would result in 4 nodes * 400 W, which is 1600 W. In this case, the top-level "cluster-power" count will have to be modified to be 100 W, so we can allocate a total of 400 W (not 1600 W). This is confusing to me, because I was thinking of "node-power" and "cluster-power" as two different "types" or power banks (which they should be). If we were to go down that route, we'd have to use "node-power" of 100 W, not "cluster-power".

So, your second example will look like:

resources:
- type: node-power
  count: 100
  unit: W
  with: 
    - type: node
      count: 4

Note that in that scenario, we may have a feasibility check problem (you can imagine cluster-power being rack-power instead, and not having enough power on the rack but having 4-nodes with 100 W each in the cluster).

With a purely multiplicative jobspec, there's no way I can think of to verify that resources can be sufficiently divided. If a user wants something specific such as "4 nodes on 2 efficiency-groups or 2 racks with 100 W power per node", there is no way to ensure that both constraints (node power of 100 W and racks having 400W each) are met with power as the dominant hierarchy.

The only way to accomplish this will be to resort to a dominant hierarchy with "cluster or racks or nodes" and "power" being a child to these type (with clause). These were Don's and my earlier examples, and your third example. But this will mean that we can't guarantee that power specifications will be met or use power as a dominant hierarchy. For resources like power, we need a closed-loop approach, and the current one is open-loop.

My initial understanding however was that we were taking a top-down or a depth-first approach, which would mean that we will first look if "cluster-power" of 400 W are available in the power-scheduler, then look at how many nodes are requested (4), and then come to the conclusion that each node gets 400/4 = 100 W.

lipari commented 7 years ago

My concern was it is not best suited when we want use a power hierarchy as our dominant one.

I want to suggest that there is minimal risk to always making the composite pool (containment) hierarchy the dominant hierarchy. As much as we strive to maintain generic-ness, the guts of virtually all job requests will be processing units and number of tasks. I can't think of a practical use case where a job request's primary focus is power or bandwidth with nodes, cores and GPUs as secondary. Hence, I don't think we'd close off any practical avenues of search and selection by decreeing that the dominant hierarchy will always be the containment one.

@dongahn, do you see a scenario where the power hierarchy is searched first, and processing units are auxiliary "up-walks"?

Another way to think of this is that resources of a Flux instance's are going to be divvied up and distributed to child jobs. These resources are going to be members of the containment hierarchy: of the 10 nodes in a flux instance, 5 will go to a child job, and 2 will go to the grandchild job. Hence, the primary schedule-able entities come from the containment hierarchy. And I would expect that the dominant hierarchy for the parent, child, and grandchild schedulers would all be the same: the containment hierarchy.

dongahn commented 7 years ago

This is a good point and thank you for asking this question. Unfortunatley, I will just have to give a philosophical answer.

My general position is to make common HPC case fast but still have extension points/customizability to make advanced cases possible.

I agree that what you described would be the common case for the time being for HPC, but i am not sure if this will contunue be the case in the next 30 years as we design software that should last a long time... And also I am not sure if this will be the case when flux is brought to other domains of conputing (big data cluster etc)

So I thought it makes sense to position us so that we dont have to corner ourselves later down the road. If we can organize resources using the concept of subsystem or hierarchies (custimizable) and let the scheduler choose one as its main hierarchy for its scheduling objective, I thought this would provide us one knob to help us with future proofness. Being able overwrite DFU traverser with something more sophisticated would be another extension point.

Maybe I am overthinking this, though.

tpatki commented 7 years ago

@dongahn, @lipari:

Can you please explain how Dong's third example will be parsed (pasted below)? That is, an example when two "types" are at the same level? This spec will allocate a cluster with "400 W AND 2 racks with 200 W", and will ensure that the two allocations are intersected (not additive), correct? If so, can you look at suggestion (2) below and tell me if that can be a valid spec?

- type: cluster
  count: 1
  with:
    - type: power
      count: 400
    - type: rack
      count: 2
      - type: power
        count: 200 
      - type: node
        count: 2
        - type: power
          count: 100

Two ways for power specifications:

1) We go with Don's initial format, which will result in verbose expressions (example 3 from Dong). Note that in most cases user wont be specifying this jobspec, this will be the jobspec for a system-level flux-instance, so this should be okay. We can also write a simple generator for this to make it easier for the user if needed.

2) We can add support for multiple "types" at the very first level in the jobspec and guarantee that both constraints are met (use an intersection or and of the two types when parsing or scheduling). This will make it harder to understand which hierarchy to parse first though. I'm not sure if this is a valid specification right now. This will solve the feasibility/multiplicative issue as well as deep-hierarchy/verbosity issue. I think we can also use "slot" for this grouping if needed to resolve the multiple-hierarchy issue.

An example would be something like:

resources:
 - type: cluster-power
    count: 400
    unit: W
- type: node-power
   count: 100
   unit: W
   with:
   - type: node
      count: 4
dongahn commented 7 years ago

Can you please explain how Dong's third example will be parsed (pasted below)?

Oops. withs are missing. Sorry about that.

- type: cluster
  count: 1
  with:
    - type: power
      count: 400
    - type: rack
      count: 2
      with:
      - type: power
        count: 200 
      - type: node
        count: 2
        with:
        - type: power
          count: 100
tpatki commented 7 years ago

@dongahn: My question wasn't about the missing with:, but more on the semantics of intersection of resources when matching.

In your above example, we assume that under with:, if two types (eg type: power and type: rack) are at the same level (not hierarchically organized using a with:), an intersection of the two allocations will be made. That is, the matcher/scheduler will not SEPARATELY allocate two racks (each with 200W) and a cluster power of 400 W.

More specifically, what would happen to the jobspec below? Will this allocate 4 nodes AND 8 cores, or 4 nodes WITH 8 total cores (2 cores per node)?

resources:
- type: cluster
   count: 1
   with:
       - type: node
          count: 4
       - type: core
          count: 8

I'm assuming the latter from your example (WITH operation), which means that we should be able to use the power-hierarchy example below to ensure that the feasibility check is met without having a multiplicative effect:

resources:
- type: cluster-power
   count: 400
   unit: W
- type: node-power
   count: 100
   unit: W
   with:
   - type: node
      count: 4
dongahn commented 7 years ago

@tpatki:

Ah.

resources:
- type: cluster
   count: 1
   with:
       - type: node
          count: 4
       - type: core
          count: 8

Under the resource representation we have been thinking about, this will return "no match" because there is no cluster resource type each with one node and one core types underneath it unless I'm not mistaken. I don't believe we allowed a partial specification other than the omission of the prefix to the first level resource types. The spec currently supports a forest of trees and the first level would be the roots of these trees. So, if every node to be selected is assumed to be in the same cluster:

resources:
- type: node
  count: 4
  with:
    - type: socket
      count: 1
      with:
        - type: core
          count: 8
dongahn commented 7 years ago
resources:
 - type: cluster-power
    count: 400
    unit: W
- type: node-power
   count: 100
   unit: W
   with:
   - type: node
      count: 4

I think what's a valid specification really goes back to what your resource graph looks like. We do graph matching; so to me, a jobspec in a sense is an abstract sub-graph that is used for the matching of two graphs. So, if your resource graph representation is simply:

node-power->node cluster-power

Then, the matcher will match yours. (except for that multiplicative effect). But it will be a different story whether this will effect the desired scheduling policy.

I'm not sure if these help...

morrone commented 7 years ago

Question: Are we assuming that flux-instances that users/sysadmins launch will also have a jobspec?

I am not totally clear on the question, but I think the answer is no.

One might submit a jobspec to a flux instance (the "parent") and the application that the jobspec says to run is another flux instance (the "child"). But it would not be useful to say that the child flux instance "has" a jobspec. The jobspec was only used by the parent to assign resources and execute the child.

When the child flux instance runs, it will be able to get a full accounting of the resources allocated to it. But that block of information that describes its resources is not a jobspec. It is a resource description.

And remember that flux instances can be started other ways in which a jobspec was never involved. For instance, if we use slurm to launch a flux instance, there was no jobspec used to start that instance. If "has" resources allocated to it, but it doesn't "have" a jobspec. It can accept jobspecs once it is running.

morrone commented 7 years ago

We do graph matching; so to me, a jobspec in a sense is an abstract sub-graph that is used for the matching of two graphs.

I think that I would just caution that the graph matching that you are doing is an implementation detail for one class of scheduler. It is not something either required or implied by the jobspec. From discussions we have had, it sounded likely that we would also have a simple high-throughput FIFO scheduler that would not necessarily employ graph-based algorithms.

dongahn commented 7 years ago

I think that I would just caution that the graph matching that you are doing is an implementation detail for one class of scheduler. It is not something either required or implied by the jobspec. From discussions we have had, it sounded likely that we would also have a simple high-throughput FIFO scheduler that would not necessarily employ graph-based algorithms.

Yes, good point. I can imagine the underlying resource representation that a FIFO scheduler uses, beingcore[1024] which is one compute-core resource pool with size 1024 and the scheduler schedules a job spec to this core pool in aggregate. (fast). While it is a degenerate case, I can still call this a graph matching? A graph with one vertex is still a graph :-). Is this along the line what you are thinking about this, @morrone? Or are you expecting that the underlying resource representation can go completely astray from our flux resource model?

morrone commented 7 years ago

I suppose you can think of it as a degenerate graph...but if the implementation doesn't actually use any graph-related code in executing the scheduling algorithm then I'm not sure what purpose that mental model serves.

dongahn commented 7 years ago

Maybe, I should have said "resource matching" and graph-matching is one implementation of the resource matching.

dongahn commented 7 years ago

As I thought about this a bit more, it seems a reasonable resource model for a flow resource like power and bandwidth would consist of two components: 1) distribution hardware and 2) its distribution capacity. And I think we can model 1) as a distinct resource type (e.g., pdu, edge-switch) and 2) as its child resource vertex. So for power, the distribution hardware would include power-panel, pdu and even certain compute hardware itself such as compute node; and each can have the maximum power (watt or volt amp) from which to draw. I think this can apply to other flow resource types.

Using this concept, the power hierarchy of a system can look like:

pa

And if the scheduler chooses to use this as its dominant hierarchy, a jobspec like:

version: 1
resources:
  - type: node
    count: 2
    with:
      - type: power
        count: 100
      - type: socket
        count: 1
        with:
          - type: core
            count: 1
          - type: gpu
            count: 1

# a comment
attributes:
  system:
    duration: 1 hour
tasks:
  - command: app
    slot:
      type: node
    count:
      per_slot: 1

can be supported well with a wide range of power scheduling policy implementable as a matcher callback plugin.

Further, if the system's ppl want to more explicitly control power allocation through these distribution units, a hierarchical specification like

version: 1
resources:
  - type: powerpanel
    count: 1
    with:
      - type: power
        count: 400
      - type: pdu
        count: 2
        with:
          - type: power
            count: 200
          - type: node
            count: 2
            with:
              - type: power
                count: 100
              - type: socket
                count: 1
                with:
                  - type: core
                    count: 1
                  - type: gpu
                    count: 1
# a comment
attributes:
  system:
    duration: 1 hour
tasks:
  - command: flux-start
    slot:
      type: node
    count:
      per_slot: 1

can also be supported .

The first node-level spec would also be satisfiable when the scheduler uses the containment hierarchy as its dominant hierarchy and power as an auxiliary. Like I said before, however, it will be a bit difficult to implement meaningful power policies.

Note that, under such a scheduler, the second spec will result in “no match.”

Unless I hear otherwise, these will be some of my test cases.

tpatki commented 7 years ago

Hi Dong,

Yes, this is precisely what I was getting at with a simpler example from my email to you yesterday. Changing the root to be a powerpanel instead of a cluster allows us to use the power hierarchy and supports both jobspecs. I think this will be more than sufficient for most policies. The only thing that needs to be ensured is that a job spec such as the one below is rejected immediately with a "no match", which was my question from yesterday because I didn't know where this would occur in the current strawman code.

version: 1
resources:
  - type: powerpanel
    count: 1
    with:
      - type: power
        count: 400
      - type: pdu
        count: 2
        with:
          - type: power
            count: 200
          - type: node
            count: 2
            with:
              - type: power
                 count: 300   <====
              - type: socket
                count: 1
                with:
                  - type: core
                    count: 1
                  - type: gpu
                    count: 1
dongahn commented 7 years ago

This can be either invalidated by the spec validator module or by your matcher callback plugin. For planner optimization, i will need to compute aggregates for a jobspec anyway, and this can be checked during that procedure, i guess.

morrone commented 7 years ago

As I thought about this a bit more, it seems a reasonable resource model for a flow resource like power and bandwidth would consist of two components

I'm a little confused about what just happened. You seem to have redefined the problem to elimnate any use of non-dominant heirarchies and upwalks. Sure, if we have a single dominant tree-based heirachy the problem is much easier. But then, there is nothing "advanced" about it anymore, right? You redefined the problem to make is fit into the "basic" model. Power is now just another abstract resource the way that any other thing could be an abstract resource.

The the fact that the solution here was to eliminate upwalks and interaction between a dominant and non-dominant hierarchy makes me again question whether we know of any real-world use case where the upwalks are actually going to be useful in the current scheduler design.

dongahn commented 7 years ago

@morrone:

Good point and question, although I'm not as pessimistic as yours on the usefulness of auxiliary hierarchy upwalk because of two reasons: 1) In particular, with planner, I believe there will be a subset of policies one can implement (pending demonstrations) that can achieve certain scheduling objectives both on the dominant and auxiliary hierarchies; 2) perhaps more importantly, this can be our baseline from which to try more aggressive ways to achieve schedule optimization across dominant and auxiliary hierarchies. As I noted in response to @lipari yesterday, DFU is just one traversal type that I want to support initially. With a repeat (loop) capable traversal, uptalk can be more useful.

As I mention:

The first node-level spec would also be satisfiable when the scheduler uses the containment hierarchy as its dominant hierarchy and power as an auxiliary. Like I said before, however, it will be a bit difficult to implement meaningful power policies.

The reason I said "it will be a bit difficult" would be that this is one of these things that need to be tried once we have all of the infrastructure components in place. We are not there yet. BTW, I also want to mention that the general scheduling problem we are facing is known to be a NP-hard problem, so we will always have to rely heavily on heuristics. There will be fair numbers of trial and errors.

Oh, finally @morrone: I guess I'm also in general an optimist :-)

tpatki commented 7 years ago

@morrone:

Like Dong pointed out, a scheduling policy that does a "closed loop" optimization (allocate-monitor-reallocate based on heuristics) will need an aggregator (Planner API) and upwalk capablility. This is because they'll need to traverse the hierarchies in a way that's top-down first (divisive) and then bottom-up (dfu-like) to make sure all constraints are met. I don't have a clear picture of how to do this in Flux yet; but this is already a "real" use case for power as I had pointed out earlier in this thread. Within the Argo project as well as the other power runtime ECP projects, there is a requirement of being able to move power dynamically between application phases. We're not targeting that use case just yet; so that we can gain confidence with the simpler open-loop (single-pass) approach first.

I think the upwalk will also be needed for multi-constraint policies that optimize for say power and IO.

@dongahn: I like to be optimistic too :)

tpatki commented 7 years ago

Here's an example where an upwalk will be useful. We begin by walking down the node hierarchy, and do an upwalk to check power constraints (or vice versa). But I couldn't capture this in the jobspec properly (I need to think more about how to do that and when to utilize such a graph as opposed to when to keep it simple).

hier_2

dongahn commented 7 years ago

I got most of the feedback I needed. Closing.