flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
84 stars 39 forks source link

possible performance issue in `sched.resource-status` RPC #1137

Open grondo opened 4 months ago

grondo commented 4 months ago

As mentioned in a meeting yesterday, during preparation for scale testing it was noted that flux resource list response times were excessively slow when Fluxion was the loaded scheduler module. To probe the performance of this RPC, the following script was developed which can test the response times of the RPCs used by flux resource list at different node sizes. Results for sched-simple are included as well for comparison:

#!/bin/bash
PROG=$(basename $0)
VERBOSE=0

log() { 
    test $VERBOSE -eq 0 && return
    local fmt=$1;
    shift
    printf >&2 "$PROG: $fmt\n" $@
}
rpc() { flux python -c "import flux; flux.Flux().rpc(\"$1\").get()"; }

runtest() {
    SCHEDULER=$1
    NNODES=$2
    log "Starting test of ${SCHEDULER} with ${NNODES} nodes"
    log "Removing modules..."
    flux module remove -f sched-fluxion-qmanager
    flux module remove -f sched-fluxion-resource
    flux module remove -f sched-simple
    flux module remove resource

    log "Loading fake resources via config..."
    flux config load <<EOF
[resource]
noverify = true
norestrict = true

[[resource.config]]
hosts = "test[1-${NNODES}]"
cores = "0-63"
gpus = "0-8"
EOF

    log "Reloading resource module..."
    flux module load resource noverify

    log "Loading ${SCHEDULER} modules..."
    if test "$SCHEDULER" = "sched-simple"; then
        flux module load sched-simple
    else
        flux module load sched-fluxion-resource
        flux module load sched-fluxion-qmanager
    fi

    log "Timing sched.resource-status RPC"
    t0=$(date +%s.%N)
    rpc "sched.resource-status"
    t1=$(date +%s.%N)
    dt1=$(echo "$t1 - $t0" | bc -l)

    log "Timing resource.status RPC"
    t0=$(date +%s.%N)
    rpc "resource.status"
    t1=$(date +%s.%N)
    dt2=$(echo "$t1 - $t0" | bc -l)

    printf "%-13s %8s %28.3f %18.3f\n" $SCHEDULER $NNODES $dt1 $dt2
}

printf "%-13s %8s %28s %18s\n" \
       SCHEDULER NNODES "T(sched.resource-status)" "T(resource.status)"
for scheduler in sched-simple fluxion; do
    for nnodes in 128 256 512 1024 2048 4096 8192 16384; do
        runtest $scheduler $nnodes
    done
done

# vi: ts=4 sw=4 expandtab

And here are the results:

$ flux start ./rlist-test.sh 
SCHEDULER       NNODES     T(sched.resource-status) T(resource.status)
sched-simple       128                        0.173              0.185
sched-simple       256                        0.185              0.171
sched-simple       512                        0.202              0.185
sched-simple      1024                        0.246              0.161
sched-simple      2048                        0.457              0.160
sched-simple      4096                        0.917              0.161
sched-simple      8192                        1.827              0.166
sched-simple     16384                        3.641              0.167
fluxion            128                        0.254              0.160
fluxion            256                        0.426              0.168
fluxion            512                        0.876              0.158
fluxion           1024                        1.710              0.160
fluxion           2048                        3.439              0.157
fluxion           4096                        6.994              0.176
fluxion           8192                       14.307              0.161
fluxion          16384                       29.418              0.169

I would expect that Fluxion should have a similar response time as sched-simple, but instead it is approximately 8x slower (and even 4s is probably too slow for production use). The other worry is that most of this ~30s, the resource module is busy assembling the response and cannot work on scheduling jobs. On a production system it could be common for many resource listing queries per second to be issued to the scheduler, so this could keep the module occupied for a large percentage of the time.

As a first cut, it might be useful to think about making a cached response to this RPC that could be sent immediately when valid.

grondo commented 4 months ago

Related #1039

milroy commented 3 months ago

Thanks for the helpful reproducer @grondo . The first thing I notice is that while Fluxion is certainly slower than sched-simple, the majority of the time (>55%) is spent outside of finding the resources:

SCHEDULER       NNODES     T(sched.resource-status) T(resource.status)
sched-simple       128                        0.097              0.091
sched-simple       256                        0.101              0.086
sched-simple       512                        0.118              0.090
sched-simple      1024                        0.176              0.089
sched-simple      2048                        0.337              0.086
sched-simple      4096                        0.668              0.088
sched-simple      8192                        1.374              0.089
sched-simple     16384                        2.858              0.085
Elapsed times: R all 0.0302172 R down 0.0274029 R alloc 0.026943
fluxion            128                        0.185              0.100
Elapsed times: R all 0.068368 R down 0.0651569 R alloc 0.061074
fluxion            256                        0.395              0.088
Elapsed times: R all 0.119063 R down 0.103589 R alloc 0.102197
fluxion            512                        0.750              0.099
Elapsed times: R all 0.230068 R down 0.208475 R alloc 0.198806
fluxion           1024                        1.481              0.098
Elapsed times: R all 0.474467 R down 0.421471 R alloc 0.412955
fluxion           2048                        3.038              0.090
Elapsed times: R all 0.899421 R down 0.845332 R alloc 0.832985
fluxion           4096                        6.018              0.098
Elapsed times: R all 1.89333 R down 1.66922 R alloc 1.69468
fluxion           8192                       12.195              0.094
Elapsed times: R all 3.67655 R down 3.43581 R alloc 3.28851
fluxion          16384                       24.251              0.098

R all is the time to find all resources, R down is the time for down resources, and R alloc the time to find allocated resources. I gathered the timings by wrapping, e.g.,

run_find (ctx, "status=up or status=down", "rv1_nosched", &R_all)

milroy commented 3 months ago

If I repeat the test twice in a row for each scheduler, the Fluxion find times decrease by over 50%:

SCHEDULER       NNODES     T(sched.resource-status)     T(sched.resource-status) T(resource.status)
sched-simple       128                        0.112                        0.115              0.102
sched-simple       256                        0.106                        0.127              0.102
sched-simple       512                        0.131                        0.130              0.095
sched-simple      1024                        0.201                        0.167              0.111
sched-simple      2048                        0.464                        0.254              0.108
sched-simple      4096                        0.703                        0.406              0.114
sched-simple      8192                        1.398                        0.624              0.084
sched-simple     16384                        2.820                        1.185              0.093
fluxion            128                        0.200                        0.185              0.090
fluxion            256                        0.399                        0.329              0.100
fluxion            512                        0.759                        0.428              0.092
fluxion           1024                        1.541                        0.758              0.093
fluxion           2048                        3.050                        1.418              0.091
fluxion           4096                        6.116                        2.699              0.089
fluxion           8192                       12.087                        5.355              0.093
fluxion          16384                       24.434                       10.514              0.110

Most of the relative speedup is likely due to Fluxion not needing to initialize the second time.

milroy commented 3 months ago

Here are the timings if I implement a cache on R all and R down (only traverse when resources change up or down states), but not R alloc:

SCHEDULER       NNODES     T(sched.resource-status)     T(sched.resource-status) T(resource.status)
sched-simple       128                        0.096                        0.095              0.083
sched-simple       256                        0.100                        0.098              0.086
sched-simple       512                        0.120                        0.123              0.095
sched-simple      1024                        0.176                        0.145              0.095
sched-simple      2048                        0.373                        0.225              0.084
sched-simple      4096                        0.713                        0.412              0.086
sched-simple      8192                        1.380                        0.651              0.088
sched-simple     16384                        2.843                        1.280              0.105
fluxion            128                        0.180                        0.116              0.090
fluxion            256                        0.390                        0.150              0.088
fluxion            512                        0.754                        0.198              0.088
fluxion           1024                        1.519                        0.298              0.092
fluxion           2048                        3.050                        0.537              0.103
fluxion           4096                        5.927                        0.910              0.102
fluxion           8192                       12.168                        1.771              0.089
fluxion          16384                       24.197                        3.412              0.095

There are probably some clever ways to cache R alloc as well.

milroy commented 3 months ago

After further thought, a much better way to get the allocated state is just to query the root planner_multi. This completely avoids a traversal of the resource graph and scales very well. Here are the timings for using a planner_multi query in place of a find traversal:

SCHEDULER       NNODES     T(sched.resource-status)     T(sched.resource-status) T(resource.status)
sched-simple       128                        0.119                        0.097              0.084
sched-simple       256                        0.098                        0.097              0.081
sched-simple       512                        0.115                        0.113              0.092
sched-simple      1024                        0.174                        0.142              0.085
sched-simple      2048                        0.332                        0.211              0.082
sched-simple      4096                        0.772                        0.380              0.088
sched-simple      8192                        1.377                        0.619              0.084
sched-simple     16384                        2.712                        1.238              0.085
fluxion            128                        0.152                        0.085              0.096
fluxion            256                        0.329                        0.084              0.081
fluxion            512                        0.640                        0.083              0.081
fluxion           1024                        1.323                        0.083              0.084
fluxion           2048                        2.630                        0.090              0.082
fluxion           4096                        5.329                        0.086              0.081
fluxion           8192                       10.623                        0.081              0.100
fluxion          16384                       21.286                        0.081              0.095

With R all and R down caching plus planner_multi query Fluxion is actually faster than sched-simple here and scales very well. There are a couple things to work out still.

The Flux config needs to specify pruning filters for the appropriate resources:

[resource]
noverify = true
norestrict = true

[[resource.config]]
hosts = "test[1-${NNODES}]"
cores = "0-63"
gpus = "0-8"

[sched-fluxion-resource]
prune-filters="ALL:node,ALL:core,ALL:gpu"

If it's not in use already, using pruning filters should be done in production as it will significantly accelerate matches and reservations.

The remaining difficulty has to do with formatting the output of planner_multi. The query just returns a vector of resources and a vector of the aggregate available counts per resource (which is exactly what resource list wants). However, the RPC response should be a JSON object:

{"version":1,"execution":{"R_lite":[{"rank":"0-127","children":{"core":"0-63","gpu":"0-8"}}],"nodelist":["test[1-128]"],"starttime":0,"expiration":0}}

@grondo do you have any suggestions for coercing an output like nodes: 4096, cores: 262144, gpus: 36864 into the expected format?

grondo commented 3 months ago

Nice result!

do you have any suggestions for coercing an output like nodes: 4096, cores: 262144, gpus: 36864 into the expected format?

I do not think it is possible to convert aggregate counts like this into Rv1. flux resource needs resources in Rv1 format so it can show which nodes are up/down/allocated, etc.

milroy commented 3 months ago

That's true, there isn't any way to map the counts back to the allocated resources without a traversal. Unfortunately updating allocation-to-resource mappings will require the approximately 3 second search at 16K nodes every time the allocations change (which will be every scheduler loop).

The two solutions I can think of right away are caching the allocation searches and then throttling the updates (e.g., updating mappings at most every X seconds). The other option would be to start a thread for find searches. The difficulty I think is that find is not a read-only operation on the resource graph. It can change the way match traversals execute. I can investigate how to refactor find to make it thread safe if we think that's a good route.

Any other suggestions are most welcome.

milroy commented 3 months ago

We should, of course, devote time to improving the performance of traversals. That's complementary to the two ideas and will be needed anyway.

milroy commented 3 months ago

I analyzed the perf output and saw many of the same hotspots as @trws has reported in the past. In particular, the subsystem_selector_t string comparisons in this function: https://github.com/flux-framework/flux-sched/blob/3abccaab8a47c7cdd28750c9848bf33458d1f01a/resource/schema/resource_graph.hpp#L54

and this comparison in the traverser: https://github.com/flux-framework/flux-sched/blob/3abccaab8a47c7cdd28750c9848bf33458d1f01a/resource/traversers/dfu_impl.cpp#L46

If I short-circuit the searches and comparisons (because they will be true for a single subsystem), performance improves about 25% for find:

SCHEDULER       NNODES     T(sched.resource-status)     T(sched.resource-status) T(resource.status)
sched-simple       128                        0.093                        0.092              0.096
sched-simple       256                        0.101                        0.116              0.122
sched-simple       512                        0.127                        0.138              0.084
sched-simple      1024                        0.206                        0.147              0.081
sched-simple      2048                        0.374                        0.214              0.086
sched-simple      4096                        0.674                        0.397              0.100
sched-simple      8192                        1.531                        0.870              0.169
sched-simple     16384                        3.701                        1.776              0.142
fluxion            128                        0.250                        0.182              0.145
fluxion            256                        0.410                        0.172              0.128
fluxion            512                        0.830                        0.227              0.134
fluxion           1024                        1.457                        0.272              0.111
fluxion           2048                        2.866                        0.408              0.116
fluxion           4096                        5.692                        0.675              0.086
fluxion           8192                       11.106                        1.347              0.090
fluxion          16384                       22.257                        2.553              0.108

That's with a cache on R all and R down, but not R alloc finds, with two searches to show the timings without startup.

grondo commented 3 months ago

If you think about the use case in a system instance, there are likely to be many users running flux resource list regularly, so caching the result even if it is only valid for a second makes sense to me here. Probably 2s response time for the RPC is reasonable, however, if many requests come in at roughly the same time, you probably do not want to interrupt the scheduler to return the same information.

I wonder if Fluxion could produce the R alloc as a side effect of the sceduler loop, and the RPC could always just return the most recent cached version?

I don't really understand the 20s initialization penalty here, Is it only a factor just after loading the module, or would there be a penalty after each scheduling loop changes the graph?

Another idea would be for us to somehow move the resource-status RPC out of the scheduler module. The flux-core resource module knows the set of up/down/all resources, and the allocated resource set could be derived from the alloc protocol between the job-manager and scheduler. Perhaps akin to the job-list module, we need an "eventually consistent" module that answers queries about resources by watching events from the resource and job-manager modules. Just a thought and I'm not sure how good it is...

milroy commented 3 months ago

so caching the result even if it is only valid for a second makes sense to me here. Probably 2s response time for the RPC is reasonable, however, if many requests come in at roughly the same time, you probably do not want to interrupt the scheduler to return the same information.

I think the validity time would probably need to be extended to 10s or beyond to be usable for hundreds of system users. Even at 10s that means about 25% of the scheduler time is absorbed by updating R alloc (edit: at the 16K simulated size).

I wonder if Fluxion could produce the R alloc as a side effect of the sceduler loop, and the RPC could always just return the most recent cached version?

I was thinking along those lines, too. Fluxion could accumulate R alloc deltas during the scheduling loop and then add the difference to the previous R alloc. I think that requires support for graph operations (addition/graph union and difference should be enough) on the various match writer formats.

I don't really understand the 20s initialization penalty here, Is it only a factor just after loading the module, or would there be a penalty after each scheduling loop changes the graph?

Sorry, my previous discussion wasn't very clear. The 20s times you see above include Fluxion initialization when loading the module plus the R all, R down, and R alloc search times. The ~15s initialization penalty (at the 16K node size) only needs to be paid once. At 16K simulated nodes, each of the R all, R down, and R alloc searches will take around 2.5s (with the optimizations I mentioned) when first searching or when a cache needs to be updated. For R all and R down I don't expect the cache will need to be updated often, so each of those ~2.5s penalties will happen infrequently.

Another idea would be for us to somehow move the resource-status RPC out of the scheduler module. The flux-core resource module knows the set of up/down/all resources, and the allocated resource set could be derived from the alloc protocol between the job-manager and scheduler.

I think this is a good idea, and I think would include assembling some sort of graph deltas that could be used to update R all, R down, and R alloc in an eventually consistent manner.

milroy commented 3 months ago

I could put together a PR with the caching and optimizations I performed so far pretty quickly. Assembling graph deltas and moving the resource-status RPC out of Fluxion would have a bigger payoff but will take longer in my estimation. It will also supersede several of the changes I tried in this discussion.

Do you think it's helpful to make a PR with the changes so far?

grondo commented 3 months ago

I think this is a good idea, and I think would include assembling some sort of graph deltas that could be used to update R all, R down, and R alloc in an eventually consistent manner.

I was thinking more of a service provided by flux-core which could operate on Rv1 directly. Since the flux resource user tool operates on Rv1 alone, it would not be necessary AFAICT to keep actual graph data up to date with deltas, but only a representaiton in Rv1. We have a library to apply updates available in flux-core so if the job-manager made alloc and free responses available to another module via some kind of journal, it could easily hold enough information to response to status requests of this type.

Do you think it's helpful to make a PR with the changes so far?

It definitely would be beneficial!

milroy commented 3 months ago

it would not be necessary AFAICT to keep actual graph data up to date with deltas, but only a representaiton in Rv1.

I think a lot in terms of graphs, so what I meant was updating Rv1 to me is a graph transformation (of course that can be done without explicit graph transformations). I don't think dealing with the Fluxion Boost resource graph will work well here, and agree that operating on Rv1 directly is a good idea.

We have a library to apply updates available in flux-core so if the job-manager made alloc and free responses available to another module via some kind of journal, it could easily hold enough information to response to status requests of this type.

That sounds like a very good approach! I'm happy to help with this if needed.

milroy commented 3 months ago

Even when I enable caching of all three outputs (all, down, and alloc) there is a considerable overhead from flux resource list beyond just the RPC. Setting the R alloc cache update to every 20s, here's flux resource list:

$ for i in {1..10}; do echo repetition $i; time -p flux resource list; done
repetition 1
     STATE NNODES   NCORES    NGPUS NODELIST
      free      1       64        9 test1
 allocated      0        0        0 
      down  16383  1048512   147447 test[2-16384]
real 11.05
user 8.54
sys 0.04
repetition 2
     STATE NNODES   NCORES    NGPUS NODELIST
      free      1       64        9 test1
 allocated      0        0        0 
      down  16383  1048512   147447 test[2-16384]
real 8.63
user 8.59
sys 0.03
repetition 3
     STATE NNODES   NCORES    NGPUS NODELIST
      free      1       64        9 test1
 allocated      0        0        0 
      down  16383  1048512   147447 test[2-16384]
real 8.78
user 8.74
sys 0.03
repetition 4
     STATE NNODES   NCORES    NGPUS NODELIST
      free      1       64        9 test1
 allocated      0        0        0 
      down  16383  1048512   147447 test[2-16384]
real 11.25
user 8.78
sys 0.03
repetition 5
     STATE NNODES   NCORES    NGPUS NODELIST
      free      1       64        9 test1
 allocated      0        0        0 
      down  16383  1048512   147447 test[2-16384]
real 8.96
user 8.92
sys 0.04
repetition 6
     STATE NNODES   NCORES    NGPUS NODELIST
      free      1       64        9 test1
 allocated      0        0        0 
      down  16383  1048512   147447 test[2-16384]
real 9.07
user 9.03
sys 0.03
repetition 7
     STATE NNODES   NCORES    NGPUS NODELIST
      free      1       64        9 test1
 allocated      0        0        0 
      down  16383  1048512   147447 test[2-16384]
real 11.34
user 8.80
sys 0.04
repetition 8
     STATE NNODES   NCORES    NGPUS NODELIST
      free      1       64        9 test1
 allocated      0        0        0 
      down  16383  1048512   147447 test[2-16384]
real 8.72
user 8.67
sys 0.04
repetition 9
     STATE NNODES   NCORES    NGPUS NODELIST
      free      1       64        9 test1
 allocated      0        0        0 
      down  16383  1048512   147447 test[2-16384]
real 9.22
user 9.17
sys 0.04
repetition 10
     STATE NNODES   NCORES    NGPUS NODELIST
      free      1       64        9 test1
 allocated      0        0        0 
      down  16383  1048512   147447 test[2-16384]
real 11.35
user 8.87
sys 0.04

As compared with the RPC which exhibits a maximum return time approximately equal to the R alloc find time:

$ for i in {1..11}; do echo repetition $i; time -p rpc "sched.resource-status"; sleep 2; done
repetition 1
real 2.57
user 0.09
sys 0.01
repetition 2
real 0.10
user 0.08
sys 0.01
repetition 3
real 0.09
user 0.08
sys 0.01
repetition 4
real 0.10
user 0.08
sys 0.01
repetition 5
real 0.09
user 0.07
sys 0.01
repetition 6
real 0.09
user 0.08
sys 0.01
repetition 7
real 0.11
user 0.09
sys 0.01
repetition 8
real 0.10
user 0.08
sys 0.02
repetition 9
real 0.11
user 0.09
sys 0.01
repetition 10
real 0.10
user 0.08
sys 0.01
repetition 11
real 2.64
user 0.08
sys 0.02

(Note the 2.64s time at the 11th iteration.)

milroy commented 3 months ago

@grondo I forgot to highlight the considerable additional overhead involved in flux resource list I reported above during the group meeting yesterday and wanted to make sure you'd seen it.

grondo commented 3 months ago

Yes, that is a known issue (though I'd forgotten how bad it was). The command is in Python and also has to go through gyrations to collect identical lines of output. Some improvements are probably required in the underlying rlist library as well.

I will try to open an issue if we don't have one already

grondo commented 2 months ago

Since flux-framework/flux-core#5796 is being merged, solving this issue is not a priority.