Benchmark examples with codspeed

saulshanabrook commented 1 month ago

This PR adds benchmarking to CI by running all of our example files with codspeed.

This will comment on new PRs and say how much they affect the performance of running the example files.

Hopefully, it should not only help us see how PRs affect performance but also give us some profiling through codspeed to diagnose where things speed up or slow down.

saulshanabrook commented 1 month ago

I think this needs an admin on the egraphs-good org to approve the app in codspeed, I had to "request" adding it to this org.

I also would need admin permissions on this repo to add the CODSPEED_TOKEN secret, or someone else who already has them can create one and add them here!

saulshanabrook commented 1 month ago

In response to some questions offline on any downsides to adding codspeed:

It introduces some more complexity overall in terms of depending on external services, and will add some more noise to PRs about their performance impacts. It may also increase the time it takes to run the CI, from maybe 2 mins to 4 mins? If we find this to be too detrimental we can exclude some of our longer running examples from benchmarking.

Performance is also not an exact science, so it has the possibility of introducing some false assurances, when in fact the impact might be different when running in a different environment. Also, it means that we are judging performance based on the example files which might not representative of performance sensitive workloads.

Overall though I would judge the additional signal to be worth those, in helping us judge the impact of PRs on performance, especially those meant to make things faster. I think it will also encourage us to have representative example files.

I have found codspeed to be invaluable in the Python package to help diagnose the impact or large changes on end to end performance and it has helped me refine those changes so that the impacts are reasonable. It's built in flamegraphs have also helped me narrow down where the slowdowns occur and also see upstream slowdowns from egglog changes. So adding this instrumentation closer to the source would be helpful in getting that information sooner.

codspeed-hq[bot] commented 1 month ago

CodSpeed Performance Report

Congrats! CodSpeed is installed 🎉

🆕 87 new benchmarks were detected.

You will start to see performance impacts in the reports once the benchmarks are run from your default branch.

Detected benchmarks

- `antiunify` (2.1 ms) - `array` (26.8 ms) - `bdd` (15.2 ms) - `before-proofs` (1.6 ms) - `birewrite` (1.3 ms) - `bitwise` (678.3 µs) - `bool` (1.4 ms) - `calc` (5.4 ms) - `combinators` (18.4 ms) - `combined-nested` (1 ms) - `container-rebuild` (2 ms) - `cyk` (11.2 ms) - `cykjson` (338.4 ms) - `datatypes` (582.1 µs) - `delete` (642.5 µs) - `eggcc-extraction` (5.5 s) - `eqsat-basic` (1.6 ms) - `eqsolve` (31.1 ms) - `f64` (935.7 µs) - `fail_wrong_assertion` (1.3 ms) - `fibonacci` (1.6 ms) - `fibonacci-demand` (2.1 ms) - `fusion` (43.7 ms) - `herbie` (286.2 ms) - `herbie-tutorial` (12.7 ms) - `i64` (350.2 µs) - `include` (1.2 ms) - `integer_math` (12.2 ms) - `intersection` (1.9 ms) - `interval` (2.7 ms) - `knapsack` (5.7 ms) - `lambda` (145 ms) - `levenshtein-distance` (15.1 ms) - `list` (4.8 ms) - `map` (650.5 µs) - `math` (36.7 ms) - `math-microbenchmark` (4.2 s) - `matrix` (11.5 ms) - `merge-during-rebuild` (1 ms) - `merge-saturates` (3 ms) - `name-resolution` (1.2 ms) - `path` (1 ms) - `path-union` (1.1 ms) - `pathproof` (1.5 ms) - `points-to` (2 ms) - `primitives` (523.4 µs) - `prims` (5.4 ms) - `push-pop` (648 µs) - `rational` (890.2 µs) - `repro-define` (693.2 µs) - `repro-desugar-143` (8.6 ms) - `repro-empty-query` (618.4 µs) - `repro-equal-constant` (670 µs) - `repro-equal-constant2` (654.4 µs) - `repro-noteqbug` (745.2 µs) - `repro-primitive-query` (681.2 µs) - `repro-querybug` (929.3 µs) - `repro-querybug2` (669.1 µs) - `repro-querybug3` (2 ms) - `repro-querybug4` (711.8 µs) - `repro-should-saturate` (650.2 µs) - `repro-silly-panic` (918.8 µs) - `repro-typechecking-schedule` (462.1 µs) - `repro-unsound` (269.3 ms) - `repro-unsound-htutorial` (847.3 µs) - `repro-vec-unequal` (768.6 µs) - `resolution` (4.1 ms) - `rw-analysis` (39.8 ms) - `schedule-demo` (2.2 ms) - `semi_naive_set_function` (41.5 ms) - `set` (2.7 ms) - `stratified` (901.7 µs) - `string` (539.3 µs) - `string_quotes` (477.1 µs) - `subsume` (1.5 ms) - `test-combined` (1.3 ms) - `test-combined-steps` (2.9 ms) - `towers-of-hanoi` (5.3 ms) - `tricky-type-checking` (11.6 ms) - `type-constraints-tests` (578.4 µs) - `typecheck` (5.8 ms) - `typeinfer` (400.7 ms) - `unification-points-to` (7.8 ms) - `unify` (1.1 ms) - `unstable-fn` (5.7 ms) - `until` (2.7 ms) - `vec` (975.2 µs)

saulshanabrook commented 1 month ago

This is ready for a review now. There are lots of tradeoffs with how accurate we want benchmarks to be vs their overall time, but I tried to find a good balance for now.

Currently, the benchmarks CI task takes ~ 9 minutes to run. Why does it take so long? Codspeed instruments the code to measure CPU cycles to get more reliable benchmarks on flaky hardware like that in CI:

CodSpeed instruments your benchmarks to measure the performance of your code. A benchmark will be run only once and the CPU behavior will be simulated. This ensures that the measurement is as accurate as possible, taking into account not only the instructions executed but also the cache and memory access patterns. The simulation gives us an equivalent of the CPU cycles that includes cache and memory access.

In order to keep the timings reasonable, I reduce the number of times the eggcc-extraction schedules run from 5 to 2. Currently, on 2 it takes ~4 minutes for that benchmark alone to run. If I switch it even to 3 it takes 46 minutes, which seems too long. One question might be if the eggcc-extraction is doing similar enough work at say 2 runs opposed to 3 runs. Here is a screenshot of different profiles, the 3 on the left (taking around 10x as long) and the 2 on the right:

Screenshot 2024-10-15 at 5 09 35 PM

They look similar enough to probably have reasonable differences in performances based on the shorter one.

Here is a rough outline of where time is spent in the benchmarks currently:

output (2)

I hope that even if this isn't a perfect representation of production tasks, it can still be helpful as we iterate over PRs. If your PR is minor, you can always merge it in without waiting for the benchmarks to finish if you don't want to wait.

saulshanabrook commented 1 month ago

I had increased the "threshold" for codspeed to say whether the performance change was meaningful to 10%. I brought it back down to 5% so it can alert us to more changes:

Screenshot 2024-10-16 at 12 07 45 PM

I think the main thing it affects is the comments on PRs. If you click on the report, you can see a granular breakdown of all changes.

Alex-Fischman commented 1 month ago

@saulshanabrook Are you sure about this last change? The website says that the default is 10%, and 10% seems more in line with the variance that we're seeing.

saulshanabrook commented 1 month ago

Yeah I'm open to that! It just seems like a 5% speedup on the long running. Benchmarks is actually rather significant... And if we are trying to improve them then we might need many smaller performance improvements to get there. I don't see much variability at all on the larger benchmarks.

egraphs-good / egglog