Analyze and improve performance of resource_query for Kube integration

dongahn commented 5 years ago

From @cmisale:

In the last two days I've been doing some performance evaluation of resource_query vs kube-scheduler. Results are not looking good.

I am measuring times both in Golang code and in the C++ resource query code. In particular, I'm measuring

The time the default k8s scheduler takes to run predicates and priorities and produce a scheduling_result (the IP of the selected node)
The time it takes to my Go binding to:
1. inspect the pod info (job requirements in terms of HW) by the flux manager
2. create the grpc message out of 2.1
3. send the message to resourcequery pod
4. decode the jobspec on resourcequery side
5. write on file the yaml file created out put 2.4
6. call cmd_match()
7. send back to the manager a string with the result
The time taken by the cmd_match() command
The time taken by ctx->traverser->run

k8s provides a wrapper over Golang functions to get exec times, so I am using that one and the result is in microseconds. In the C++ resource query code, I am using std::chrono and times are in seconds.

For k8s, to schedule a simple job with a pod (point 1), take about 800 microseconds in best cases. For Flux, to schedule the exact same job (point 2.x), take about 104000 microseconds in best cases.

I have done a breakdown of exec times following the list above. Given this particular run (the very last I ran, a bit slower but things don't change):

Flux Elapsed time in Microseconds 123800 (this is the time reported in Flux Management Level aka the scheduler plugin a short time after point 2.7, and it is done in the same place where the following is done) [Scheduler] Default Elapsed time in Microseconds 1267

The first line contains points 2-4, while the second line is point 1. Considering only the Flux part, times are mixed seconds and microseconds but I make it clear:

[INFO] printing before cmd_match
[INFO] traverser run in COMMAND.CPP 0.116969 seconds
      ---------------core157[1:x]
      ---------------core158[1:x]
      ---------------gpu3[1:x]
      ---------------memory0[1938096128:x]
      ------------socket0[1:x]
      ---------172.17.1.1828[1:s]
      ------rack0[1:s]
      ---Kubernetes-cluster0[1:s]
[INFO] Elapsed time in COMMAND.CPP 0.117008 seconds
INFO: =============================
INFO: JOBID=1
INFO: RESOURCES=ALLOCATED
INFO: SCHEDULED AT=Now
INFO: =============================
[INFO] elapsed time: 0.118571 seconds
[RQ] Match in 118691 Microseconds (back to Go binding)
[RQ] ResqWrapper returned value 0
[GRPCServer] Fake request allocate on file /data/flux/yamlexample.yaml in 118711 Microseconds (point 2.6)

So there's of course overhead in all the grpc calls, files creation (both grug and yaml for jobspec), but it is "negligible" wrt the match call. The entire overhead takes about 4-5k microseconds, which is anyway a lot but it can be improved.

I tried to read the dfu code to check where to read some other exec times, but I can't do that properly without your help.

dongahn commented 5 years ago

@cmisale:

Great work! Let's tease this a part one by one. I have to believe the performance in the end should be much better but let's take one at a time.

First of all, could you please paste your GRUG file in this ticket? I would like to see the size and granularity of the system you are testing this for?
Could you also paste your jobspec in this ticket?
Assuming the size would be large, could you please add the following option to resource-query that will improve the initialization time significantly: --reserve-vtx-vec=400000
The reason for a high overhead for matching is because we are traversing the entire graph to find the best matching resources. What's best is determined by the chosen match policy -- you probably use the default low-id first policy.

Now, searching a large graph is not so scalable so we use a few techniques to mitigate that. For example, we aggressively prune the graph search (e.g., based on exclusivity and subtree aggregates that higher-level resource vertices maintain). I assume you are not using --prune-filters option. This means, the default prune-by-aggregates will kick in and it will be based on the number of available cores under the visiting vertex.

In any case, with this kind of optimization techniques, one has to measure the match overheads in terms of scheduling many jobs, not a single job. As the more loaded the machine becomes, the better match performance becomes because of pruning.

So, can you change your evaluation so that you will schedule as many jobs as you can allocate after the clean setup is done? Then, measure average/min/max match performance? In resource-query, stat command should report these numbers for you
If works correctly, point 4 above should bring down the overheads by a factor of 2 or 3.
With 6, I think you still have on the order of 50x difference between k2s's default scheduler. I think the remaining performance differences can be tackled by making performance vs. schedule effective tradeoff and using hierarchical scheduling: A. Our model supports representation levels of details so we can build the system model at a coarser level (once I look at your GRUG file I will see if I can make some concrete suggestions). B. One can create a match policy which tries to find N best -- instead of traversing the entire graph. I don't have a ready made policy in hand but let's see if this is needed. This will allow one to find best match, subject to the affordable performance. C. Using the Flux's marquee divide and concur hierarchical scheduling and divv up scheduling loads to multiple children instances each managing a subset of resources

IMHO, it would be best to pursue point 5 first and where it leaves us. Then, we will decide what techniques in point 6 we will want to further pursue as our near term effort.

cmisale commented 5 years ago

I had to rename the files to be able to attach. Thanks for all the suggestions, I'm sure I'm not using resource_query in the right way. I'll try the points 3 and 4 in the next 1-2 days and I'll come back with the numbers I collect.

grug.graphml.txt jobspec.yaml.txt

dongahn commented 5 years ago

Hmmm. @cmisale: I am not sure if your grug file is specifying the machine you want. What is the machine you wanted to configure?

How many racks from the cluster?

How many compute nodes from each of the racks?

How many sockets per node and how many cores and gpu per socket?

dongahn commented 5 years ago

After removing

112a113
>           <data key="d17">false</data>

The machine seems to be configured as

cluster[1]->rack[1]->node[29]->socket[1]->core[159]
                                        ->memory[1]
                                        ->gpu[1]

Just want to confirm that you wanted to build one socket with 159 cores and only one allocatable memory chunk?

dongahn commented 5 years ago

Graph output of your grug file is attached:

$../../resource/utilities/grug2dot --more grug.graphml
$ dot -Tpdf grug.dot -o grug.pdf

grug.pdf

dongahn commented 5 years ago

Regardless, I ran a session with your grug and jobspec on my Mac OS/X w/ docker, @cmisale:

INFO: Num. of Jobs Matched: 29
INFO: Min. Match Time: 0.0018239
INFO: Max. Match Time: 0.0365601
INFO: Avg. Match Time: 0.0186694

Average match time is much better than your measurements: 18669 usecs. 5.6x better than your measurement; but still 23x slower than best case for K8s

But the Minimum match time, performance that would be close to production environment when the system is loaded : 1823 usecs. 2.27 worse than k8s.

But I’m getting pretty large variability because I’m doing this on my Mac OS/X with docker and the performance number might be better when we do this on an exclusive machine. In fact, for some runs, I saw min time 2.2x better than K8s performance.

I think this is a feasibility that we can do complex scheduling without compromising performance too much?

dongahn commented 5 years ago

Reconfiguring the system that looks more like modern machines. In this case, you can schedule more jobs, 58 jobs to fully load the system:

cluster[1]->rack[1]->node[29]->socket[2]->core[22]
                                        ->memory[1]
                                        ->gpu[1]

INFO: Num. of Jobs Matched: 58
INFO: Min. Match Time: 0.000919104
INFO: Max. Match Time: 0.0166061
INFO: Avg. Match Time: 0.00737557

Avg match time is 7375 usec, 14x better than your measurements and 9.2x slower than k2s. Min match time is 919 usec, only 14% slower than k2s.

If we write a new match policy callback, the average match time will converge to the min time. But I'm not sure if that's necessary to show the benefit. If it turns out, that's needed, I can take a look.

I think it will be wise to agree on the machine configuration(s) and jobspec to test and run "schedule jobs to fully load a system" metric with both Fluxion and k2s to do a initial bakeoff; then see if we need to add more optimizations like a new match policy and etc.

cmisale commented 5 years ago

About the cluster layout: I am creating the graph based on the information I get from kubernetes, that is,

the total number of cpus
the total amount of memory
the total amount of storage
the total number of gpus

I know that there are two sockets per node, but is an information I can't get from kubernetes. For this reason I set it to one until I manage to get the right value. I didn't spend time there yet. Each node of the cluster is composed of

2 sockets
10 cores per socket
8 threads per core
total of 160 cpus (0-159)
4 gpu per node (not sure why the grug says 1, I'll check that)

Our cluster SW stack is being updated now, so I can't do much testing for now..

dongahn commented 5 years ago

Ahn Thanks @cmisale. Maybe we should build a few GRUG files that represent the target system in a progressive more fine grained and gather performance. Do you mind if I build a few of those along with jobspec to use? Shouldn't take too much time for me. Once I do some validation, we can do some experiments under your k2s-flux?

cmisale commented 5 years ago

@dongahn that would be extremely helpful. Thank you. What other information do you need? For the jobspec, as soon as I have the cluster back I will provide a couple of them.

Question: if I want to use the prune-filter parameter, what value should I set?

dongahn commented 5 years ago

Question: if I want to use the prune-filter parameter, what value should I set?

I think the default will just do fine for you: ALL:core

dongahn commented 5 years ago

To be clear, prune-filters=ALL:core is set by default. This installs a prune filter -- a planner object embedded into each and all (thus ALL) of the resource vertices in the graph, which keeps track of available cores under the subtree rooted at that vertex. Using this, if your request cannot be satisfied by resources in the subtree, we prune the graph descent traversal.

This would be helpful if your jobspecs are mostly requesting core-level resource:

Say, your system is cluster[1]->rack[1]->node[29]->socket[2]->core[16] And you have many jobs requesting slot[1]->core[8]

Because none of your jobs requires exclusive access to higher level resources like rack, node, socket, our traverser will have to traverse the entire graph to find the best 8. But if you have prune-filters=ALL:core, these higher level resources will also know how many cores are available under it by looking up the filter. (The filter is extremely scalable and we also do this update very scalable, which is referred to as scheduler driven aggregates update or scheduler driven prune filter updates.)

For instance, once 4 jobs are scheduled, each of the nodes and sockets under which these cores are allocated will have their prune filters updated such that we will stop our depth first descent for the next job request at the corresponding node level. And move on to the next node on the depth first walk order.

Similarly, when you have exclusive request on some high level resources, prune by exclusivity will kick in which will be as efficient as aggregates filter.

dongahn commented 5 years ago

From resource-query --help:

    -p, --prune-filters=<HL-resource1:LL-resource1[,HL-resource2:LL-resource2...]...]>
            Install a planner-based filter at each High-Level (HL) resource
                vertex which tracks the state of the Low-Level (LL) resources
                in aggregate, residing under its subtree. If a jobspec requests
                1 node with 4 cores, and the visiting compute-node vertex has
                only a total of 2 available cores in aggregate at its
                subtree, this filter allows the traverser to prune a further descent
                to accelerate the search.
                Use the ALL keyword for HL-resource if you want LL-resource to be
                tracked at all of the HL-resource vertices. Examples:
                    rack:node,node:core
                    ALL:core,cluster:node,rack:node
                (default=ALL:core).

dongahn commented 5 years ago

Here is a GRUG that has high level of details. So I called it kubecluster-high-LOD.graphml.

kubecluster-high-LOD.graphml.txt

The graphical representation of this kubecluster is:

high.pdf

Jobspecs on this graph can have a global resource constraints such as "I request a node within a single rack". I used the following jobspc:

jobspec-high-LOD.yaml.txt

I then ran resource-query on it, which allowed for scheduling 116 jobs of this jobspec. And the performance was the following:

$ resource-query -G kubecluster-high-LOD.graphml --match-format=pretty_simple --match-policy=low -e

<CUT>

resource-query> stat
INFO: Num. of Jobs Matched: 116
INFO: Min. Match Time: 0.000698
INFO: Max. Match Time: 0.0249701
INFO: Avg. Match Time: 0.00848717

dongahn commented 5 years ago

Here is a GRUG that has medium level of details. So I called it kubecluster-medium-LOD.graphml.

kubecluster-medium-LOD.graphml.txt

It removed the rack level representation. And the memory detail was coarsened such that each allocable memory chunk is 128GB instead of 64GB.

Jobspecs on this graph cannot have the rack level global resource constraints and also need to be at a coarsened level for memory constraints:

jobspec-medium-LOD.yaml.txt

I then ran resource-query on it, which allowed for scheduling 116 jobs of this jobspec. And the performance was the following:

$ resource-query -G kubecluster-medium-LOD.graphml --match-format=pretty_simple --match-policy=low -e

<CUT>

resource-query> stat
INFO: Num. of Jobs Matched: 116
INFO: Min. Match Time: 0.000102997
INFO: Max. Match Time: 0.0169539
INFO: Avg. Match Time: 0.00795901

dongahn commented 5 years ago

I also tried a GRUG with even more representation coarsening but didn't get performance improvements. I think this is because further coarsening was impeding our pruning capabilities.

dongahn commented 5 years ago

Not sure if I will have time, but if I do, we can look at a new match policy to show another way to make a schedule and performance trade-off. In the meantime, it would be good if you can gather some new numbers in your rig. Also you might want to consider designing your experiments to measure our overheads when the system is 30% loaded, 60% loaded and 80% loaded. I think the important overheads would be the system being 50% and higher load.

cmisale commented 5 years ago

Dong, thank you for all the jobspecs and the grug file. Also having something to compare to is gold. I still can't use the cluster, unfortunately.. Hence, for now I can only wait.

dongahn commented 5 years ago

This has been done and I have a decent perf exploration of resource I will be sharing with the team.

flux-framework / flux-sched

Analyze and improve performance of resource_query for Kube integration #507