Summarize the findings from resource-query effort

Create this to capture the findings from my resource-query effort, a precursor to our resource comms. module.

I will soon create a resource-query repo under my person space and start to do some performance and other studies using it and I plan to do this throughout this week. Our ECP milestone is due at the end of this week.

I've already spoken with @lipari as getting his early feedback on this as well as the planner PR (#268) should be very useful at this stage.

Here is the diagram of our scheduling infrastructure, which I presented at a meeting; this high-level architecture remained intact and resource-query uses this infrastructure.

rearchitecting

resource-query usage:

usage: resource-query [OPTIONS…]

Command-line utility to select HPC resources in accordance with a
resource-matching policy, given a job specification written in
Flux's Canonical Job Specification (RFC 14).

Read in a resource-graph generation recipe written in the GRUG format
and populate a resource-graph data store representing the compute and
other HPC resources and their relationships (RFC 4).

Provide a simple command-line interface (cli) to allow users to allocate
or reserve the resource set in this resource-graph data store.
Traverse the resource graph in a predefined order for resource selection.
Currently only support one traversal type: depth-first traversal on the
dominant subsystem and up-walk traversal on one or more auxiliary subsystems.

OPTIONS allow for using a predefined matcher that is configured
to use a different set of subsystems as its dominant and/or auxiliary
ones to perform matches on.

OPTIONS allow for instantiating a different resource-matching
selection policy--e.g., select resources with high or low IDs first.

OPTIONS also allow for exporting the filtered graph of the used matcher
in a selected graph format at the end of the cli session.

To see cli commands, type in "help" in the cli: i.e., 
  % resource-query> help

OPTIONS:
    -h, --help
            Display the usage information

    -G, --grug=<genspec>.graphml
            GRUG resource graph generator specification file in graphml
            (default=conf/default)

    -S, --match-subsystems=<CA|IBA|IBBA|PFS1BA|PA|C+IBA|C+PFS1BA|C+PA|IB+IBBA|C+P+IBA|VA|V+PFS1BA|ALL>
            Set the test matcher to use. Available matchers are:
                CA: Containment Aware
                IBA: InfiniBand connection-Aware
                IBBA: InfiniBand Bandwidth-Aware
                PFS1BA: Parallel File System 1 Bandwidth-aware
                PA: Power-Aware
                C+IBA: Containment and InfiniBand connection-Aware
                C+PFS1BA: Containment and PFS1 Bandwidth-Aware
                C+PA: Containment and Power-Aware
                IB+IBBA: InfiniBand connection and Bandwidth-Aware
                C+P+IBA: Containment, Power and InfiniBand connection-Aware
                VA: Virtual Hierarchy-Aware 
                V+PFS1BA: Virtual Hierarchy and PFS1 Bandwidth-Aware 
                ALL: Aware of everything.
            (default=CA).

    -P, --match-policy=<low|high|locality>
            Set the resource match selection policy. Available policies are:
                high: Select resources with high ID first
                low: Select resources with low ID first
                locality: Select contiguous resources first in their ID space
            (default=high).

    -g, --graph-format=<dot|graphml>
            Specify the graph format of the output file
            (default=dot)

    -o, --output=<basename>
            Set the basename of the output file
            For AT&T Graphviz dot, <basename>.dot
            For GraphML, <basename>.graphml

Generating Resources Using GraphML (GRUG)

Overview

GRUG is a GraphML-based language for specifying a resource-graph generation recipe. The resource service strawman can read in a GRUG file and populate its store of the resource graph data conforming to Flux’s resource model (RFC4). The goal of GRUG is to help Flux scheduler plug-in developers easily determine the representation of this resource graph data (e.g., granularity of resource pools, relationships between resources, and subsystems/hierarchies to use to organize the resources) that are best suited for their scheduling objectives and algorithms.
Without having to modify the source code of the resource service strawman, developers can quickly test various resource graph representations by only modifying the GRUG text file.

GraphML is an easy-to-use, XML-based graph specification language. GRUG uses the vanilla GraphML schema (http://graphml.graphdrawing.org) with no extension, and thereby familiarity with GraphML is the only prerequisite for fluent uses of GRUG. We find that the following on-line GraphML materials are particularly useful:

GRUG

GRUG describes a resource-generation recipe as a graph. A vertex prescribes how the corresponding resource pool (or simply resource as a shorthand) should be generated; an edge prescribes how the corresponding relationships between two resources should be generated. The edge properties also allow a small recipe graph to generate a large and more complex resource graph store. A multiplicative edge has a scaling factor that will generate the specified number of copies of the resources of the target type. An associative edge allows a source resource to be associated with some of the already generated resources in a specific manner.

The resource service strawman walks this recipe graph using the depth-first-search traversal and emits and stores the corresponding resources and their relationship data into its resource graph store.
The recipe graph must be a forest of trees whereby each tree represents a distinct resource hierarchy or subsystem. We use the terms, hierarchy and subsystem interchangeably.

A conforming GRUG file is composed of two sections: 1) recipe graph definition and 2) recipe attributes declaration. We explain both in the following sections.

Recipe Graph Definition

A recipe graph definition is expressed as GraphML’s graph elements consisting of two nested elements: node and edge. A node element prescribes ways to generate a resource pool and an edge for generating relationships (RFC 4). For example, given the following definition,

<node id="socket">
     <data key="type">socket</data>
     <data key="basename">socket</data>
     <data key="size">1</data>
     <data key="subsystem">containment</data>
</node>

<node id="core">
    <data key="type">core</data>
    <data key="basename">core</data>
    <data key="size">1</data>
    <data key="subsystem">containment</data>
</node>

these node elements are the generation recipes for a socket and compute-core resource (i.e., scalar), respectively. And they belong to the containment hierarchy.

<edge id="socket2core" source="socket" target="core">
    <data key="e_subsystem">containment</data>
    <data key="relation">contains</data>
    <data key="rrelation">in</data>
    <data key="gen_method">MULTIPLY</data>
    <data key="multi_scale">2</data>
</edge>

Here, this edge element is the generation recipe for the relationship between the socket and core resources. It specifies that for each socket resource, 2 new core resources (i.e., MULTIPLY and 2) will be generated, and the relationship is contains and the reverse relationship is in.

A resource in one subsystem (e.g., power hierarchy) can be associated with another subsystem (e.g., containment hierarchy), and associative edges are used for this purpose.

<node id="pdu_power">
    <data key="type">pdu</data>
    <data key="basename">pdu</data>
    <data key="subsystem">power</data>
</node>

<edge id="powerpanel2pdu" source="powerpanel" target="pdu_power">
    <data key="e_subsystem">power</data>
    <data key="relation">drawn</data>
    <data key="rrelation">flows</data>
    <data key="gen_method">ASSOCIATE_IN</data>
    <data key="as_tgt_subsystem">containment</data>
</edge>

Here, this edge element is the generation recipe for the relationship between powerpanel and pdu resource. It specifies that a powerpanel resource will be associated (i.e., ASSOCIATE_IN) with all of the pdu resources that have already generated within the containment subsystem. The forward relationship is drawn and the reverse relationship is flows.

Oftentimes, association with all resources of a type is not sufficient to make a fine-grained association. For the case where the hierarchical paths of associating resources can be used to make associations, ASSOCIATE_BY_PATH_IN generation method can be used.

<edge id="pdu2node" source="pdu_power" target="node_power">
    <data key="e_subsystem">power</data>
    <data key="relation">drawn</data>
    <data key="rrelation">flows</data>
    <data key="gen_method">ASSOCIATE_BY_PATH_IN</data>
    <data key="as_tgt_uplvl">1</data>
    <data key="as_src_uplvl">1</data>
</edge>

Here, the method is similar to the previous one except that the association is only made with the node resources whose hierarchical path at its parent level (i.e., as_tgt_uplvl=1) is matched with the hierarchical path of the source resource (also at the parent level, as_src_uplvl=1).

Recipe Attributes Declaration

This section appears right after the GraphML header and before the recipe graph definition section. To be a valid GRUG, this section must declare all attributes for both node and edge elements. Currently, there are 16 attributes that must be declared. 5 for the node element and 11 for the edge elements. You are encouraged to define the default value for each attribute, which then can lead to more concise recipe definitions. A graph element will inherit the default attribute values unless it specifically overrides them. The 16 attributes are listed in the following:

<-- attributes for the recipe node elements -->
<key id="root" for="node" attr.name="root" attr.type="int">
<key id="type" for="node" attr.name="type" attr.type="string"/>
<key id="basename" for="node" attr.name="basename" attr.type="string"/>
<key id="size" for="node" attr.name="size" attr.type="long"/>
<key id="subsystem" for="node" attr.name="subsystem" attr.type="string"/>

<-- attributes for the recipe edge elements -->
<key id="e_subsystem" for="edge" attr.name="e_subsystem" attr.type="string"/>
<key id="relation" for="edge" attr.name="relation" attr.type="string"/>
<key id="rrelation" for="edge" attr.name="rrelation" attr.type="string"/>
<key id="id_scope" for="edge" attr.name="id_scope" attr.type="int"/>
<key id="id_start" for="edge" attr.name="id_start" attr.type="int"/>
<key id="id_stride" for="edge" attr.name="id_stride" attr.type="int"/>
<key id="gen_method" for="edge" attr.name="gen_method" attr.type="string"/>
<key id="multi_scale" for="edge" attr.name="multi_scale" attr.type="int"/>
<key id="as_tgt_subsystem" for="edge" attr.name="as_tgt_subsystem" attr.type="string">
<key id="as_tgt_uplvl" for="edge" attr.name="as_tgt_uplvl" attr.type="int"/>
<key id="as_src_uplvl" for="edge" attr.name="as_src_uplvl" attr.type="int"/>

The root attribute specifies if a resource is the root of a subsystem. If root, 1 must be assigned.

id_scope, id_start and id_stride specify how the id field of a resource will be generated. The integer specified with id_scope defines the scope in which the resource id should be generated. The scope is local to its ancestor level defined by id_scope. If id_scope is higher than the most distant ancestor, then the id space becomes global.

For example, if id_scope=0, the id of the generating resource will be local to its parent. If id_scope=1, the id becomes local to its grand parent For example, in rack[1]->node[18]->socket[2]->core[8] configuration, if id_scope is 1, the id space of a core resource is local to the node level instead of the socket level. So, 16 cores in each node will have 0-15, instead of repeating 0-7 and 0-7, which will be the case if the id_scope is 0.

GRUG Visualizer

grug2dot utility can be used to generate a GraphViz dot file that can render the recipe graph. The dot file can be converted into svg format by typing in dot -Tsvg output.dot -o output.svg:

Usage: grug2dot <genspec>.graphml
    Convert a GRUG resource-graph generator spec (<genspec>.graphml)
    to AT&T GraphViz format (<genspec>.dot). The output
    file only contains the basic information unless --more is given.

    OPTIONS:
    -h, --help
            Display this usage information

    -m, --more
            More information in the output file

OK. I created resource-query repo in my personal space and dumped the current code base. (Likely, it won't built in your environment because of its planner and other dependencies.) If you're interested in building this, let me know.

I will post some performance findings for the planner. In summary, I found no surprises and am pretty pleased with the initial results. All experiments were run on Quartz, and GCC 4.9.3 was used with -O3.

add1dperf

I measured how long it takes to add varying numbers of spans into a planner. Starting from 1024, I increased the span count all the way to a million (2^20) at every powers of two number.

In this particular case, I configured the planner to use only one resource type so this will be a typical performance when the planner is used to keep track of the time windows of allocations and reservations of a resource vertex.

Different colors represent different overlap factors. Overlap(1) means that each span is configured to be overlapped with no other spans (e.g., exclusive allocation); Overlap(10) means that each span is configured to be overlaped with 9 other spans (e.g., shared allocations).

The worse case is adding 1-million non-overlapping spans to a planner object: ~2.5 seconds. Given 1 million is much bigger than the number of reservations you would add to our resources, this seems adequate.

queryperf

This is time-based query performance. I measured how long it takes to do 1024 availability check queries at randomly selected times when the planner is loaded with varying numbers of spans. I'm pleased to see the logarithmic trends. The worse case is 0.0016408 seconds -- this is when the planner is loaded with 1-million spans and those 1024 random queries were performed on it.

removeperf

How long does it take to remove 1024 spans when the planner is loaded with varying numbers of spans.

add5dperf

One additional chart to measure planner add performance. In this case, the planner was configured to use 5 different resource types, which are representative when I use it for "scheduler driven aggregate update" scheme. The additional planner object in a resource vertex keeps track of aggregate information on configurable numbers of resource types under its subtree. The performance signature is almost identical to 1D case.

resourcebasedquery

Finally, resource-based query performance. Here, I measured the time to query "find me the earliest schedulable points given a resource count requirement." Essentially, I varied the planner load factor all the way to one million spans, and for each of these configurations I performed 1024 resource-based queries using planner_avail_time_first() and planner_avail_next()

Resource-based query isn't as good as time-based query, but nevertheless being able to complete 1K queries less than .25 seconds for O(M) planner (extremely large case IMO) should be sufficient for our purposes.

I think I understand where the time is going so I can optimize this even further if needed in the future.

At this point, I'd say we have met our FY17 ECP milestone. I will spend some time tomorrow to look at the performance of resource-query, but I am happy as is. The milestone description and the plan was:

Milestone Description: Extend flux-sched to introduce a scalable scheduling infrastructure
that can facilitate scalability and ease-of-writing of a wide range of advanced
scheduling algorithms and policies such as I/O bandwidth- or Power-aware scheduling. 
Systematically explore the effectiveness and scalability of Flux's hierarchical scheduling
model as the numbers of both jobs and resources as well as workload types vary.
Identify and address potential bottlenecks discovered as part of this exploration.

Milestone Execution Plan: 

1. Scalable scheduling infrastructure:
A. Develop Planner API using high-performance Linux red-black tree in order
to lay the foundation of scheduler-driven-aggregate update schemes as required
for scheduling over large resource graphs

B. Refactor the current resource graph APIs into visitor/matcher patterns
and implement scheduler driven-aggregates updates on top of it to prune
unnecessary descent walks for matching operations.
A and B will serve as our scalable scheduling infrastructure. 

2. Hierarchical scheduling exploration:
A. Set up workloads that expose a wide range of performance characteristics
ranging from high throughput loads to traditional high scale loads; Build flux instance
hierarchies with different tree shapes varying between very broad and shallow ones;
and schedule and execute the above workloads through these scheduler hierarchies

B. Measure performance/scalability of flux scheduling services and
the makespan of the workloads so as to explore the general effectiveness and scalability
of Flux's hierarchical scheduling model

C. Iterate 1 and 2 bottlenecks that are not fundamental to our hierarchical scheduling
model will be addressed.

That’s good news. Thank you, Dong.

Becky

From: "Dong H. Ahn" notifications@github.com Reply-To: flux-framework/flux-sched reply@reply.github.com Date: Thursday, September 28, 2017 at 11:25 PM To: flux-framework/flux-sched flux-sched@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [flux-framework/flux-sched] Summarize the findings from resource-query effort (#269)

At this point, I'd say we have met our FY17 ECP milestone. I will spend some time tomorrow to look at the performance of resource-query tomorrow, but I am happy as is. The milestone description and the plan was:

Milestone Description: Extend flux-sched to introduce a scalable scheduling infrastructure

that can facilitate scalability and ease-of-writing of a wide range of advanced

scheduling algorithms and policies such as I/O bandwidth- or Power-aware scheduling.

Systematically explore the effectiveness and scalability of Flux's hierarchical scheduling

model as the numbers of both jobs and resources as well as workload types vary.

Identify and address potential bottlenecks discovered as part of this exploration.

Milestone Execution Plan:

Scalable scheduling infrastructure:

A. Develop Planner API using high-performance Linux red-black tree in order

to lay the foundation of scheduler-driven-aggregate update schemes as required

for scheduling over large resource graphs

B. Refactor the current resource graph APIs into visitor/matcher patterns

and implement scheduler driven-aggregates updates on top of it to prune

unnecessary descent walks for matching operations.

A and B will serve as our scalable scheduling infrastructure.

Hierarchical scheduling exploration:

A. Set up workloads that expose a wide range of performance characteristics

ranging from high throughput loads to traditional high scale loads; Build flux instance

hierarchies with different tree shapes varying between very broad and shallow ones;

and schedule and execute the above workloads through these scheduler hierarchies

B. Measure performance/scalability of flux scheduling services and

the makespan of the workloads so as to explore the general effectiveness and scalability

of Flux's hierarchical scheduling model

C. Iterate 1 and 2 bottlenecks that are not fundamental to our hierarchical scheduling

model will be addressed.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/flux-framework/flux-sched/issues/269#issuecomment-333041703, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AEdU0NSXTRhiCMgg0Ku80HOcKwcIS_Ldks5snI1cgaJpZM4PlCth.

I think I understand where the time is going so I can optimize this even further if needed in the future.

It turns out the query overheads are so small, it is pretty difficult to capture the performance profiles using sampling-based performance profilers... Used both Linux perf and Mac OS/X Instruments but no insight. This probably means, I shouldn't worry about the query performance at this stage.

resource-query should now build on TOSS3 systems (tested on Quartz and pushed my changes.)

2592-core

This is my final chart on this ticket. I started to measure the performance of resource-query itself that makes use of the planner and implements the proposed scheduler-driven aggregate update scheme.

This chart shows the amount of time it takes to allocate 2592 1-core jobs (jobspec far below) on a system with 2,592 compute cores (the GRUG file used to build this machine configuration is below).

Good news is that this shows the potential of the scheduler-driven aggregate update scheme: each successive allocation takes an increasingly smaller amount of time as more and more cores are allocated and the new update scheme allows the scheduler to prune the graph search.

However, I'm not entirely happy with the overall performance of resource-query as I scale up the machine configurations.

Lots lots of things to do, not only from the perspective of adding all the features needed (e.g., slot support, corner cases, code cleanup...) but also from the general performance optimization of graph code... sigh

Well, I am closing the door for this round of investigation and and open a new one, worrying about all these during the next round of investigation. Happy new fiscal year!

<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns">
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
        http://graphml.graphdrawing.org/xmlns/1.1/graphml.xsd">

    <!-- resource pool vertex generation spec attributes -->
    <key id="root" for="node" attr.name="root" attr.type="int">
        <default>0</default>
    </key>
    <key id="type" for="node" attr.name="type" attr.type="string"/>
    <key id="basename" for="node" attr.name="basename" attr.type="string"/>
    <key id="unit" for="node" attr.name="unit" attr.type="string"/>
    <key id="size" for="node" attr.name="size" attr.type="long">
        <default>1</default>
    </key>
    <key id="subsystem" for="node" attr.name="subsystem" attr.type="string">
        <default>containment</default>
    </key>

    <!-- resource relationship generation attributes     -->
    <key id="e_subsystem" for="edge" attr.name="e_subsystem" attr.type="string">
        <default>containment</default>
    </key>
    <key id="relation" for="edge" attr.name="relation" attr.type="string">
        <default>contains</default>
    </key>
    <key id="rrelation" for="edge" attr.name="rrelation" attr.type="string">
        <default>in</default>
    </key>

    <!-- id generation method                             -->
    <key id="id_scope" for="edge" attr.name="id_scope" attr.type="int">
        <default>0</default>
    </key>
    <key id="id_start" for="edge" attr.name="id_start" attr.type="int">
        <default>0</default>
    </key>
    <key id="id_stride" for="edge" attr.name="id_stride" attr.type="int">
        <default>1</default>
    </key>

    <!-- resource gen method: multiply or associate-in   -->
    <key id="gen_method" for="edge" attr.name="gen_method" attr.type="string">
        <default>MULTIPLY</default>
    </key>
    <!-- argument (scaling factor) for multiply method   -->
    <key id="multi_scale" for="edge" attr.name="multi_scale" attr.type="int">
        <default>1</default>
    </key>
    <!-- 3 arguments for associate-in method             -->
    <key id="as_tgt_subsystem" for="edge" attr.name="as_tgt_subsystem"
             attr.type="string">
        <default>containment</default>
    </key>
    <key id="as_tgt_uplvl" for="edge" attr.name="as_tgt_uplvl" attr.type="int">
        <default>1</default>
    </key>
    <key id="as_src_uplvl" for="edge" attr.name="as_src_uplvl" attr.type="int">
        <default>1</default>
    </key>

    <!-- generation recipe for the medium cluster         -->
    <graph id="medium_coarse_cluster" edgedefault="directed">

        <!-- containment subsystem generation recipe    -->
        <node id="cluster">
            <data key="root">1</data>
            <data key="type">cluster</data>
            <data key="basename">medium-coarse-cluster</data>
        </node>
        <node id="rack">
            <data key="type">rack</data>
            <data key="basename">rack</data>
        </node>
        <node id="node">
            <data key="type">node</data>
            <data key="basename">node</data>
        </node>
        <node id="socket">
            <data key="type">socket</data>
            <data key="basename">socket</data>
        </node>
        <node id="core">
            <data key="type">core</data>
            <data key="basename">core</data>
        </node>
        <node id="memory">
            <data key="type">memory</data>
            <data key="basename">memory</data>
            <data key="size">32</data>
            <data key="unit">GB</data>
        </node>
        <node id="gpu">
            <data key="type">gpu</data>
            <data key="basename">gpu</data>
        </node>

        <edge id="cluster2rack" source="cluster" target="rack">
            <data key="multi_scale">4</data>
        </edge>
        <edge id="rack2node" source="rack" target="node">
            <data key="id_scope">1</data>
            <data key="multi_scale">18</data>
        </edge>
        <edge id="node2socket" source="node" target="socket">
            <data key="multi_scale">2</data>
        </edge>
        <edge id="socket2core" source="socket" target="core">
            <data key="id_scope">1</data>
            <data key="multi_scale">18</data>
        </edge>
        <edge id="socket2gpu" source="socket" target="gpu">
            <data key="id_scope">1</data>
            <data key="multi_scale">1</data>
        </edge>
        <edge id="socket2memory" source="socket" target="memory"/>
    </graph>
</graphml>

version: 1
resources:
  - type: node
    count: 1
    with:
      - type: socket
        count: 1
        with:
          - type: core
            count: 1
            exclusive: true

flux-framework / flux-sched