flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
89 stars 41 forks source link

String interner for subsystems and resource types and initial usage #1246

Closed trws closed 3 months ago

trws commented 4 months ago

This is a bit of a large patch. Most of it is in the "add Catch2" commit because that's adding an external checked in so that all of our builds on systems that don't have catch2 installed will build. I recommend installing catch2 version 3+ in the base distro for faster builds, but it's not required. There will be a follow-up PR that adds appropriate entries in the install scripts and dockerfiles and so-forth to make that happen for CI and our testenv containers.

The actual sched changes here are the addition of an interned string type that looks like this:

// create an interned_string type named subsystem_t that is a uint8_t and has a key of zero
// embedded in the type signature
using subsystem_t = interned_string<0, uint8_t>;

// add a string to the subsystem_t table
subsystem_t s("node");

The type is comparable, sortable, and will return a constant reference to the std::string on .get(). You can also access the id through .id() and generate ranges of these things. There is also a wrapper around boost::small_vector that uses interned_strings as keys to simulate a very small map without having to store the keys.

Why go through all of this? My many_point test, copied below, went from an end-to-end time of 49 seconds to 34 seconds. Throughput of jobs ingested from submit -cc... --wait-event=start went from ~30 to ~50 jobs per second, and the stop_explore method went from 38% of resource cycles to 0.2%. We got less of a memory usage benefit than I hoped, I think largely because there are still a number of std::map<> instances, where alignment causes the savings to be relatively small for now. Right now it's about 8-10% reduction in base memory usage for sched, no impact on per-allocation usage I don't think.

perf-test-many-point.sh ```sh #!/usr/bin/env bash # test_description=' ' . `dirname $0`/sharness.sh export TEST_UNDER_FLUX_QUORUM=1 export TEST_UNDER_FLUX_START_MODE=leader rpc() { flux python -c \ "import flux, json; print(flux.Flux().rpc(\"$1\").get_str())" } test_under_flux 16380 system test_expect_success 'unload sched-simple' ' flux module remove -f sched-simple ' test_expect_success 'update configuration' ' flux config load <<-'EOF' [[resource.config]] hosts = "fake[0-16380]" cores = "0-63" gpus = "0-3" [[resource.config]] hosts = "fake[0-9999]" properties = ["compute"] [[resource.config]] hosts = "fake[10000-16000]" properties = ["test"] [[resource.config]] hosts = "fake[16001-16380]" properties = ["debug"] [sched-fluxion-qmanager] queue-policy = "easy" [sched-fluxion-resource] match-policy = "firstnodex" prune-filters = "ALL:core,ALL:gpu,cluster:node,rack:node" match-format = "rv1_nosched" EOF ' test_expect_success 'reload resource with monitor-force-up' ' flux module reload -f resource noverify monitor-force-up ' #test_expect_success 'drain a few nodes' ' # flux resource drain 1-100 test with drained nodes #' test_expect_success 'load fluxion modules' ' flux module load sched-fluxion-resource && flux module load sched-fluxion-qmanager ' test_expect_success 'wait for fluxion to be ready' ' time flux python -c \ "import flux, json; print(flux.Flux().rpc(\"sched.resource-status\").get_str())" ' perf record --call-graph=fp -p $(pgrep flux-broker-0) -o $PERF_OUT sleep 60 & test_expect_success 'create a set of jobs with disparate end time points' ' # time=300 && \ # for i in $(seq 1 300) ; do \ # echo submitting $time && \ flux submit --cc=300-600 \ --quiet \ --requires="host:fake[{cc}]" \ -N 1 --exclusive \ -t "{cc}s" \ --progress \ --jps \ --wait-event=start \ --setattr=exec.test.run_duration="600s" \ sleep 600 # time=$((time + 1)) ; \ # done ' test_expect_success 'submit one more job that holds the node the next needs until after all other time points' ' holding_job=$(flux submit \ --job-name=holder \ --requires="host:fake[5]" \ -N 1 --exclusive \ -t "600s" \ --progress \ --jps \ --wait-event=start \ --setattr=exec.test.run_duration="600s" \ sleep 600 ) ' # PID="$!" test_expect_success 'create a set of 2 currently unreservable jobs' ' flux jobs -a flux resource info flux jobs -a flux submit --progress --jps --quiet -N1 \ --requires="host:fake[5]" \ --flags=waitable \ --setattr=exec.test.run_duration=0.1s \ sleep 0.1 flux submit --progress --jps --quiet -N1 \ --requires="host:fake[5]" \ --flags=waitable \ --setattr=exec.test.run_duration=0.1s \ sleep 0.1 ' # kill -2 $PID # wait $PID # cp perf.data /Users/scogland1/Repos/flux/flux-sched/build/mine test_expect_success 'get match stats' ' flux module stats sched-fluxion-resource # time flux job wait -av || true ' # flux resource undrain 1-100 && \ test_expect_success 'get match stats' ' flux jobs -Ano "{id} {duration}" && \ flux job cancel $holding_job && \ (time flux job wait -av || true) && \ flux jobs -Ano "{id} {duration}" && \ flux dmesg && \ flux jobs -a && \ rpc sched-fluxion-resource.stats-get | jq ' test_expect_success 'unload fluxion' ' flux module remove sched-fluxion-qmanager && flux module remove sched-fluxion-resource && flux module load sched-simple ' test_done ```
trws commented 4 months ago

I'm not sure what's up with the CI for jammy and focal, it's currently building fine and then seems like it just sits when it should start the check. Looking into it.

trws commented 4 months ago

Unfortunately the focal and jammy failures are legitimate. I'm digging into why right now, but I have a feeling this is an old libstdc++ or old boost biting us.

trws commented 4 months ago

More data here. While the failures are legitimate, it seems like they're being caused by a difference in how gcc is linking the libraries on these distros. If I compile the exact same code with clang on jammy it works, with gcc 11 (the default on jammy) it fails.

trws commented 4 months ago

I was making a mistake and linking the interner but not the so that actually initialized some of the interned strings into some of the objects. Most of the time this was fine because the externed symbols are required to deduplicate and work out, but some corner case made it not happen. šŸ¤·

This is more correct, hopefully should pass everywhere.

trws commented 4 months ago

@milroy, this is finally set, everything passes, even focal (though it needed a new compiler, at least we can kinda limp along with one still in the focal repos).

milroy commented 4 months ago

@trws I noticed the updates and build failures. Should I hold off on a review until everything is green, or should I review this as is?

trws commented 4 months ago

I'll fix it up this morning. The edit was meant to be to just shift from a vector to an unordered map to avoid a danger of over-allocation if someone chooses a very high group ID, but I accidentally backed out a change needed to tolerate focal's ancient compiler.

trws commented 4 months ago

Ok, the build failures should be fixed. As part of that this is now stacked on top of #1256, but should be ready for review.

codecov[bot] commented 3 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 75.3%. Comparing base (73b09b4) to head (49bb9de). Report is 124 commits behind head on master.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1246 +/- ## ======================================== + Coverage 74.4% 75.3% +0.9% ======================================== Files 104 107 +3 Lines 14940 15128 +188 ======================================== + Hits 11121 11405 +284 + Misses 3819 3723 -96 ``` [see 32 files with indirect coverage changes](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1246/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework)