Closed dongahn closed 5 years ago
@cmisale:
Great work! Let's tease this a part one by one. I have to believe the performance in the end should be much better but let's take one at a time.
First of all, could you please paste your GRUG file in this ticket? I would like to see the size and granularity of the system you are testing this for?
Could you also paste your jobspec in this ticket?
Assuming the size would be large, could you please add the following option to resource-query that will improve the initialization time significantly: --reserve-vtx-vec=400000
The reason for a high overhead for matching is because we are traversing the entire graph to find the best matching resources. What's best is determined by the chosen match policy -- you probably use the default low-id first policy.
Now, searching a large graph is not so scalable so we use a few techniques to mitigate that. For example, we aggressively prune the graph search (e.g., based on exclusivity and subtree aggregates that higher-level resource vertices maintain). I assume you are not using --prune-filters
option. This means, the default prune-by-aggregates will kick in and it will be based on the number of available cores under the visiting vertex.
In any case, with this kind of optimization techniques, one has to measure the match overheads in terms of scheduling many jobs, not a single job. As the more loaded the machine becomes, the better match performance becomes because of pruning.
So, can you change your evaluation so that you will schedule as many jobs as you can allocate after the clean setup is done? Then, measure average/min/max match performance? In resource-query, stat
command should report these numbers for you
If works correctly, point 4 above should bring down the overheads by a factor of 2 or 3.
With 6, I think you still have on the order of 50x difference between k2s's default scheduler. I think the remaining performance differences can be tackled by making performance vs. schedule effective tradeoff and using hierarchical scheduling: A. Our model supports representation levels of details so we can build the system model at a coarser level (once I look at your GRUG file I will see if I can make some concrete suggestions). B. One can create a match policy which tries to find N best -- instead of traversing the entire graph. I don't have a ready made policy in hand but let's see if this is needed. This will allow one to find best match, subject to the affordable performance. C. Using the Flux's marquee divide and concur hierarchical scheduling and divv up scheduling loads to multiple children instances each managing a subset of resources
IMHO, it would be best to pursue point 5 first and where it leaves us. Then, we will decide what techniques in point 6 we will want to further pursue as our near term effort.
I had to rename the files to be able to attach. Thanks for all the suggestions, I'm sure I'm not using resource_query in the right way. I'll try the points 3 and 4 in the next 1-2 days and I'll come back with the numbers I collect.
Hmmm. @cmisale: I am not sure if your grug file is specifying the machine you want. What is the machine you wanted to configure?
How many racks from the cluster?
How many compute nodes from each of the racks?
How many sockets per node and how many cores and gpu per socket?
After removing
112a113
> <data key="d17">false</data>
The machine seems to be configured as
cluster[1]->rack[1]->node[29]->socket[1]->core[159]
->memory[1]
->gpu[1]
Just want to confirm that you wanted to build one socket with 159 cores and only one allocatable memory chunk?
Graph output of your grug file is attached:
$../../resource/utilities/grug2dot --more grug.graphml
$ dot -Tpdf grug.dot -o grug.pdf
Regardless, I ran a session with your grug and jobspec on my Mac OS/X w/ docker, @cmisale:
INFO: Num. of Jobs Matched: 29
INFO: Min. Match Time: 0.0018239
INFO: Max. Match Time: 0.0365601
INFO: Avg. Match Time: 0.0186694
Average match time is much better than your measurements: 18669 usecs. 5.6x better than your measurement; but still 23x slower than best case for K8s
But the Minimum match time, performance that would be close to production environment when the system is loaded : 1823 usecs. 2.27 worse than k8s.
But I’m getting pretty large variability because I’m doing this on my Mac OS/X with docker and the performance number might be better when we do this on an exclusive machine. In fact, for some runs, I saw min time 2.2x better than K8s performance.
I think this is a feasibility that we can do complex scheduling without compromising performance too much?
Reconfiguring the system that looks more like modern machines. In this case, you can schedule more jobs, 58 jobs to fully load the system:
cluster[1]->rack[1]->node[29]->socket[2]->core[22]
->memory[1]
->gpu[1]
INFO: Num. of Jobs Matched: 58
INFO: Min. Match Time: 0.000919104
INFO: Max. Match Time: 0.0166061
INFO: Avg. Match Time: 0.00737557
Avg match time is 7375 usec, 14x better than your measurements and 9.2x slower than k2s. Min match time is 919 usec, only 14% slower than k2s.
If we write a new match policy callback, the average match time will converge to the min time. But I'm not sure if that's necessary to show the benefit. If it turns out, that's needed, I can take a look.
I think it will be wise to agree on the machine configuration(s) and jobspec to test and run "schedule jobs to fully load a system" metric with both Fluxion and k2s to do a initial bakeoff; then see if we need to add more optimizations like a new match policy and etc.
About the cluster layout: I am creating the graph based on the information I get from kubernetes, that is,
I know that there are two sockets per node, but is an information I can't get from kubernetes. For this reason I set it to one until I manage to get the right value. I didn't spend time there yet. Each node of the cluster is composed of
Our cluster SW stack is being updated now, so I can't do much testing for now..
Ahn Thanks @cmisale. Maybe we should build a few GRUG files that represent the target system in a progressive more fine grained and gather performance. Do you mind if I build a few of those along with jobspec to use? Shouldn't take too much time for me. Once I do some validation, we can do some experiments under your k2s-flux?
@dongahn that would be extremely helpful. Thank you. What other information do you need? For the jobspec, as soon as I have the cluster back I will provide a couple of them.
Question: if I want to use the prune-filter
parameter, what value should I set?
Question: if I want to use the prune-filter parameter, what value should I set?
I think the default will just do fine for you: ALL:core
To be clear, prune-filters=ALL:core
is set by default. This installs a prune filter -- a planner object embedded into each and all (thus ALL
) of the resource vertices in the graph, which keeps track of available cores under the subtree rooted at that vertex. Using this, if your request cannot be satisfied by resources in the subtree, we prune the graph descent traversal.
This would be helpful if your jobspecs are mostly requesting core-level resource:
Say, your system is cluster[1]->rack[1]->node[29]->socket[2]->core[16]
And you have many jobs requesting slot[1]->core[8]
Because none of your jobs requires exclusive access to higher level resources like rack
, node
, socket
, our traverser will have to traverse the entire graph to find the best 8. But if you have prune-filters=ALL:core
, these higher level resources will also know how many cores are available under it by looking up the filter. (The filter is extremely scalable and we also do this update very scalable, which is referred to as scheduler driven aggregates update or scheduler driven prune filter updates.)
For instance, once 4 jobs are scheduled, each of the nodes and sockets under which these cores are allocated will have their prune filters updated such that we will stop our depth first descent for the next job request at the corresponding node level. And move on to the next node on the depth first walk order.
Similarly, when you have exclusive request on some high level resources, prune by exclusivity will kick in which will be as efficient as aggregates filter.
From resource-query --help
:
-p, --prune-filters=<HL-resource1:LL-resource1[,HL-resource2:LL-resource2...]...]>
Install a planner-based filter at each High-Level (HL) resource
vertex which tracks the state of the Low-Level (LL) resources
in aggregate, residing under its subtree. If a jobspec requests
1 node with 4 cores, and the visiting compute-node vertex has
only a total of 2 available cores in aggregate at its
subtree, this filter allows the traverser to prune a further descent
to accelerate the search.
Use the ALL keyword for HL-resource if you want LL-resource to be
tracked at all of the HL-resource vertices. Examples:
rack:node,node:core
ALL:core,cluster:node,rack:node
(default=ALL:core).
Here is a GRUG that has high level of details. So I called it kubecluster-high-LOD.graphml
.
kubecluster-high-LOD.graphml.txt
The graphical representation of this kubecluster is:
Jobspecs on this graph can have a global resource constraints such as "I request a node within a single rack". I used the following jobspc:
I then ran resource-query
on it, which allowed for scheduling 116 jobs of this jobspec. And the performance was the following:
$ resource-query -G kubecluster-high-LOD.graphml --match-format=pretty_simple --match-policy=low -e
<CUT>
resource-query> stat
INFO: Num. of Jobs Matched: 116
INFO: Min. Match Time: 0.000698
INFO: Max. Match Time: 0.0249701
INFO: Avg. Match Time: 0.00848717
Here is a GRUG that has medium level of details. So I called it kubecluster-medium-LOD.graphml
.
kubecluster-medium-LOD.graphml.txt
It removed the rack level representation. And the memory detail was coarsened such that each allocable memory chunk is 128GB instead of 64GB.
Jobspecs on this graph cannot have the rack level global resource constraints and also need to be at a coarsened level for memory constraints:
I then ran resource-query
on it, which allowed for scheduling 116 jobs of this jobspec. And the performance was the following:
$ resource-query -G kubecluster-medium-LOD.graphml --match-format=pretty_simple --match-policy=low -e
<CUT>
resource-query> stat
INFO: Num. of Jobs Matched: 116
INFO: Min. Match Time: 0.000102997
INFO: Max. Match Time: 0.0169539
INFO: Avg. Match Time: 0.00795901
I also tried a GRUG with even more representation coarsening but didn't get performance improvements. I think this is because further coarsening was impeding our pruning capabilities.
Not sure if I will have time, but if I do, we can look at a new match policy to show another way to make a schedule and performance trade-off. In the meantime, it would be good if you can gather some new numbers in your rig. Also you might want to consider designing your experiments to measure our overheads when the system is 30% loaded, 60% loaded and 80% loaded. I think the important overheads would be the system being 50% and higher load.
Dong, thank you for all the jobspecs and the grug file. Also having something to compare to is gold. I still can't use the cluster, unfortunately.. Hence, for now I can only wait.
This has been done and I have a decent perf exploration of resource I will be sharing with the team.
From @cmisale:
In the last two days I've been doing some performance evaluation of resource_query vs kube-scheduler. Results are not looking good.
I am measuring times both in Golang code and in the C++ resource query code. In particular, I'm measuring
k8s provides a wrapper over Golang functions to get exec times, so I am using that one and the result is in microseconds. In the C++ resource query code, I am using std::chrono and times are in seconds.
For k8s, to schedule a simple job with a pod (point 1), take about 800 microseconds in best cases. For Flux, to schedule the exact same job (point 2.x), take about 104000 microseconds in best cases.
I have done a breakdown of exec times following the list above. Given this particular run (the very last I ran, a bit slower but things don't change):
Flux Elapsed time in Microseconds 123800 (this is the time reported in Flux Management Level aka the scheduler plugin a short time after point 2.7, and it is done in the same place where the following is done) [Scheduler] Default Elapsed time in Microseconds 1267
The first line contains points 2-4, while the second line is point 1. Considering only the Flux part, times are mixed seconds and microseconds but I make it clear:
So there's of course overhead in all the grpc calls, files creation (both grug and yaml for jobspec), but it is "negligible" wrt the match call. The entire overhead takes about 4-5k microseconds, which is anyway a lot but it can be improved.
I tried to read the dfu code to check where to read some other exec times, but I can't do that properly without your help.