Open grondo opened 4 months ago
Related #1039
Thanks for the helpful reproducer @grondo . The first thing I notice is that while Fluxion is certainly slower than sched-simple, the majority of the time (>55%) is spent outside of find
ing the resources:
SCHEDULER NNODES T(sched.resource-status) T(resource.status)
sched-simple 128 0.097 0.091
sched-simple 256 0.101 0.086
sched-simple 512 0.118 0.090
sched-simple 1024 0.176 0.089
sched-simple 2048 0.337 0.086
sched-simple 4096 0.668 0.088
sched-simple 8192 1.374 0.089
sched-simple 16384 2.858 0.085
Elapsed times: R all 0.0302172 R down 0.0274029 R alloc 0.026943
fluxion 128 0.185 0.100
Elapsed times: R all 0.068368 R down 0.0651569 R alloc 0.061074
fluxion 256 0.395 0.088
Elapsed times: R all 0.119063 R down 0.103589 R alloc 0.102197
fluxion 512 0.750 0.099
Elapsed times: R all 0.230068 R down 0.208475 R alloc 0.198806
fluxion 1024 1.481 0.098
Elapsed times: R all 0.474467 R down 0.421471 R alloc 0.412955
fluxion 2048 3.038 0.090
Elapsed times: R all 0.899421 R down 0.845332 R alloc 0.832985
fluxion 4096 6.018 0.098
Elapsed times: R all 1.89333 R down 1.66922 R alloc 1.69468
fluxion 8192 12.195 0.094
Elapsed times: R all 3.67655 R down 3.43581 R alloc 3.28851
fluxion 16384 24.251 0.098
R all
is the time to find all resources, R down
is the time for down
resources, and R alloc
the time to find allocated resources. I gathered the timings by wrapping, e.g.,
run_find (ctx, "status=up or status=down", "rv1_nosched", &R_all)
If I repeat the test twice in a row for each scheduler, the Fluxion find
times decrease by over 50%:
SCHEDULER NNODES T(sched.resource-status) T(sched.resource-status) T(resource.status)
sched-simple 128 0.112 0.115 0.102
sched-simple 256 0.106 0.127 0.102
sched-simple 512 0.131 0.130 0.095
sched-simple 1024 0.201 0.167 0.111
sched-simple 2048 0.464 0.254 0.108
sched-simple 4096 0.703 0.406 0.114
sched-simple 8192 1.398 0.624 0.084
sched-simple 16384 2.820 1.185 0.093
fluxion 128 0.200 0.185 0.090
fluxion 256 0.399 0.329 0.100
fluxion 512 0.759 0.428 0.092
fluxion 1024 1.541 0.758 0.093
fluxion 2048 3.050 1.418 0.091
fluxion 4096 6.116 2.699 0.089
fluxion 8192 12.087 5.355 0.093
fluxion 16384 24.434 10.514 0.110
Most of the relative speedup is likely due to Fluxion not needing to initialize the second time.
Here are the timings if I implement a cache on R all
and R down
(only traverse when resources change up
or down
states), but not R alloc
:
SCHEDULER NNODES T(sched.resource-status) T(sched.resource-status) T(resource.status)
sched-simple 128 0.096 0.095 0.083
sched-simple 256 0.100 0.098 0.086
sched-simple 512 0.120 0.123 0.095
sched-simple 1024 0.176 0.145 0.095
sched-simple 2048 0.373 0.225 0.084
sched-simple 4096 0.713 0.412 0.086
sched-simple 8192 1.380 0.651 0.088
sched-simple 16384 2.843 1.280 0.105
fluxion 128 0.180 0.116 0.090
fluxion 256 0.390 0.150 0.088
fluxion 512 0.754 0.198 0.088
fluxion 1024 1.519 0.298 0.092
fluxion 2048 3.050 0.537 0.103
fluxion 4096 5.927 0.910 0.102
fluxion 8192 12.168 1.771 0.089
fluxion 16384 24.197 3.412 0.095
There are probably some clever ways to cache R alloc
as well.
After further thought, a much better way to get the allocated state is just to query the root planner_multi
. This completely avoids a traversal of the resource graph and scales very well. Here are the timings for using a planner_multi
query in place of a find
traversal:
SCHEDULER NNODES T(sched.resource-status) T(sched.resource-status) T(resource.status)
sched-simple 128 0.119 0.097 0.084
sched-simple 256 0.098 0.097 0.081
sched-simple 512 0.115 0.113 0.092
sched-simple 1024 0.174 0.142 0.085
sched-simple 2048 0.332 0.211 0.082
sched-simple 4096 0.772 0.380 0.088
sched-simple 8192 1.377 0.619 0.084
sched-simple 16384 2.712 1.238 0.085
fluxion 128 0.152 0.085 0.096
fluxion 256 0.329 0.084 0.081
fluxion 512 0.640 0.083 0.081
fluxion 1024 1.323 0.083 0.084
fluxion 2048 2.630 0.090 0.082
fluxion 4096 5.329 0.086 0.081
fluxion 8192 10.623 0.081 0.100
fluxion 16384 21.286 0.081 0.095
With R all
and R down
caching plus planner_multi
query Fluxion is actually faster than sched-simple here and scales very well. There are a couple things to work out still.
The Flux config needs to specify pruning filters for the appropriate resources:
[resource]
noverify = true
norestrict = true
[[resource.config]]
hosts = "test[1-${NNODES}]"
cores = "0-63"
gpus = "0-8"
[sched-fluxion-resource]
prune-filters="ALL:node,ALL:core,ALL:gpu"
If it's not in use already, using pruning filters should be done in production as it will significantly accelerate matches and reservations.
The remaining difficulty has to do with formatting the output of planner_multi
. The query just returns a vector of resources and a vector of the aggregate available counts per resource (which is exactly what resource list
wants). However, the RPC response should be a JSON object:
{"version":1,"execution":{"R_lite":[{"rank":"0-127","children":{"core":"0-63","gpu":"0-8"}}],"nodelist":["test[1-128]"],"starttime":0,"expiration":0}}
@grondo do you have any suggestions for coercing an output like nodes: 4096
, cores: 262144
, gpus: 36864
into the expected format?
Nice result!
do you have any suggestions for coercing an output like nodes: 4096, cores: 262144, gpus: 36864 into the expected format?
I do not think it is possible to convert aggregate counts like this into Rv1. flux resource
needs resources in Rv1 format so it can show which nodes are up/down/allocated, etc.
That's true, there isn't any way to map the counts back to the allocated resources without a traversal. Unfortunately updating allocation-to-resource mappings will require the approximately 3 second search at 16K nodes every time the allocations change (which will be every scheduler loop).
The two solutions I can think of right away are caching the allocation searches and then throttling the updates (e.g., updating mappings at most every X seconds). The other option would be to start a thread for find
searches. The difficulty I think is that find
is not a read-only operation on the resource graph. It can change the way match traversals execute. I can investigate how to refactor find
to make it thread safe if we think that's a good route.
Any other suggestions are most welcome.
We should, of course, devote time to improving the performance of traversals. That's complementary to the two ideas and will be needed anyway.
I analyzed the perf
output and saw many of the same hotspots as @trws has reported in the past. In particular, the subsystem_selector_t
string comparisons in this function: https://github.com/flux-framework/flux-sched/blob/3abccaab8a47c7cdd28750c9848bf33458d1f01a/resource/schema/resource_graph.hpp#L54
and this comparison in the traverser: https://github.com/flux-framework/flux-sched/blob/3abccaab8a47c7cdd28750c9848bf33458d1f01a/resource/traversers/dfu_impl.cpp#L46
If I short-circuit the searches and comparisons (because they will be true
for a single subsystem), performance improves about 25% for find
:
SCHEDULER NNODES T(sched.resource-status) T(sched.resource-status) T(resource.status)
sched-simple 128 0.093 0.092 0.096
sched-simple 256 0.101 0.116 0.122
sched-simple 512 0.127 0.138 0.084
sched-simple 1024 0.206 0.147 0.081
sched-simple 2048 0.374 0.214 0.086
sched-simple 4096 0.674 0.397 0.100
sched-simple 8192 1.531 0.870 0.169
sched-simple 16384 3.701 1.776 0.142
fluxion 128 0.250 0.182 0.145
fluxion 256 0.410 0.172 0.128
fluxion 512 0.830 0.227 0.134
fluxion 1024 1.457 0.272 0.111
fluxion 2048 2.866 0.408 0.116
fluxion 4096 5.692 0.675 0.086
fluxion 8192 11.106 1.347 0.090
fluxion 16384 22.257 2.553 0.108
That's with a cache on R all
and R down
, but not R alloc
find
s, with two searches to show the timings without startup.
If you think about the use case in a system instance, there are likely to be many users running flux resource list
regularly, so caching the result even if it is only valid for a second makes sense to me here. Probably 2s response time for the RPC is reasonable, however, if many requests come in at roughly the same time, you probably do not want to interrupt the scheduler to return the same information.
I wonder if Fluxion could produce the R alloc
as a side effect of the sceduler loop, and the RPC could always just return the most recent cached version?
I don't really understand the 20s initialization penalty here, Is it only a factor just after loading the module, or would there be a penalty after each scheduling loop changes the graph?
Another idea would be for us to somehow move the resource-status RPC out of the scheduler module. The flux-core resource
module knows the set of up/down/all resources, and the allocated resource set could be derived from the alloc protocol between the job-manager and scheduler. Perhaps akin to the job-list
module, we need an "eventually consistent" module that answers queries about resources by watching events from the resource
and job-manager
modules. Just a thought and I'm not sure how good it is...
so caching the result even if it is only valid for a second makes sense to me here. Probably 2s response time for the RPC is reasonable, however, if many requests come in at roughly the same time, you probably do not want to interrupt the scheduler to return the same information.
I think the validity time would probably need to be extended to 10s or beyond to be usable for hundreds of system users. Even at 10s that means about 25% of the scheduler time is absorbed by updating R alloc
(edit: at the 16K simulated size).
I wonder if Fluxion could produce the R alloc as a side effect of the sceduler loop, and the RPC could always just return the most recent cached version?
I was thinking along those lines, too. Fluxion could accumulate R alloc
deltas during the scheduling loop and then add the difference to the previous R alloc
. I think that requires support for graph operations (addition/graph union and difference should be enough) on the various match writer formats.
I don't really understand the 20s initialization penalty here, Is it only a factor just after loading the module, or would there be a penalty after each scheduling loop changes the graph?
Sorry, my previous discussion wasn't very clear. The 20s times you see above include Fluxion initialization when loading the module plus the R all
, R down
, and R alloc
search times. The ~15s initialization penalty (at the 16K node size) only needs to be paid once. At 16K simulated nodes, each of the R all
, R down
, and R alloc
searches will take around 2.5s (with the optimizations I mentioned) when first searching or when a cache needs to be updated. For R all
and R down
I don't expect the cache will need to be updated often, so each of those ~2.5s penalties will happen infrequently.
Another idea would be for us to somehow move the resource-status RPC out of the scheduler module. The flux-core resource module knows the set of up/down/all resources, and the allocated resource set could be derived from the alloc protocol between the job-manager and scheduler.
I think this is a good idea, and I think would include assembling some sort of graph deltas that could be used to update R all
, R down
, and R alloc
in an eventually consistent manner.
I could put together a PR with the caching and optimizations I performed so far pretty quickly. Assembling graph deltas and moving the resource-status RPC out of Fluxion would have a bigger payoff but will take longer in my estimation. It will also supersede several of the changes I tried in this discussion.
Do you think it's helpful to make a PR with the changes so far?
I think this is a good idea, and I think would include assembling some sort of graph deltas that could be used to update R all, R down, and R alloc in an eventually consistent manner.
I was thinking more of a service provided by flux-core which could operate on Rv1 directly. Since the flux resource
user tool operates on Rv1 alone, it would not be necessary AFAICT to keep actual graph data up to date with deltas, but only a representaiton in Rv1. We have a library to apply updates available in flux-core so if the job-manager made alloc and free responses available to another module via some kind of journal, it could easily hold enough information to response to status requests of this type.
Do you think it's helpful to make a PR with the changes so far?
It definitely would be beneficial!
it would not be necessary AFAICT to keep actual graph data up to date with deltas, but only a representaiton in Rv1.
I think a lot in terms of graphs, so what I meant was updating Rv1 to me is a graph transformation (of course that can be done without explicit graph transformations). I don't think dealing with the Fluxion Boost resource graph will work well here, and agree that operating on Rv1 directly is a good idea.
We have a library to apply updates available in flux-core so if the job-manager made alloc and free responses available to another module via some kind of journal, it could easily hold enough information to response to status requests of this type.
That sounds like a very good approach! I'm happy to help with this if needed.
Even when I enable caching of all three outputs (all
, down
, and alloc
) there is a considerable overhead from flux resource list
beyond just the RPC. Setting the R alloc
cache update to every 20s, here's flux resource list
:
$ for i in {1..10}; do echo repetition $i; time -p flux resource list; done
repetition 1
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 11.05
user 8.54
sys 0.04
repetition 2
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 8.63
user 8.59
sys 0.03
repetition 3
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 8.78
user 8.74
sys 0.03
repetition 4
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 11.25
user 8.78
sys 0.03
repetition 5
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 8.96
user 8.92
sys 0.04
repetition 6
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 9.07
user 9.03
sys 0.03
repetition 7
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 11.34
user 8.80
sys 0.04
repetition 8
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 8.72
user 8.67
sys 0.04
repetition 9
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 9.22
user 9.17
sys 0.04
repetition 10
STATE NNODES NCORES NGPUS NODELIST
free 1 64 9 test1
allocated 0 0 0
down 16383 1048512 147447 test[2-16384]
real 11.35
user 8.87
sys 0.04
As compared with the RPC which exhibits a maximum return time approximately equal to the R alloc
find
time:
$ for i in {1..11}; do echo repetition $i; time -p rpc "sched.resource-status"; sleep 2; done
repetition 1
real 2.57
user 0.09
sys 0.01
repetition 2
real 0.10
user 0.08
sys 0.01
repetition 3
real 0.09
user 0.08
sys 0.01
repetition 4
real 0.10
user 0.08
sys 0.01
repetition 5
real 0.09
user 0.07
sys 0.01
repetition 6
real 0.09
user 0.08
sys 0.01
repetition 7
real 0.11
user 0.09
sys 0.01
repetition 8
real 0.10
user 0.08
sys 0.02
repetition 9
real 0.11
user 0.09
sys 0.01
repetition 10
real 0.10
user 0.08
sys 0.01
repetition 11
real 2.64
user 0.08
sys 0.02
(Note the 2.64s time at the 11th iteration.)
@grondo I forgot to highlight the considerable additional overhead involved in flux resource list
I reported above during the group meeting yesterday and wanted to make sure you'd seen it.
Yes, that is a known issue (though I'd forgotten how bad it was). The command is in Python and also has to go through gyrations to collect identical lines of output. Some improvements are probably required in the underlying rlist library as well.
I will try to open an issue if we don't have one already
Since flux-framework/flux-core#5796 is being merged, solving this issue is not a priority.
As mentioned in a meeting yesterday, during preparation for scale testing it was noted that
flux resource list
response times were excessively slow when Fluxion was the loaded scheduler module. To probe the performance of this RPC, the following script was developed which can test the response times of the RPCs used byflux resource list
at different node sizes. Results forsched-simple
are included as well for comparison:And here are the results:
I would expect that Fluxion should have a similar response time as sched-simple, but instead it is approximately 8x slower (and even 4s is probably too slow for production use). The other worry is that most of this ~30s, the resource module is busy assembling the response and cannot work on scheduling jobs. On a production system it could be common for many resource listing queries per second to be issued to the scheduler, so this could keep the module occupied for a large percentage of the time.
As a first cut, it might be useful to think about making a cached response to this RPC that could be sent immediately when valid.