Closed saulshanabrook closed 4 weeks ago
Comparing saulshanabrook:update-python-example
(0082005) with main
(b0db068)
β
6
untouched benchmarks
π 2
new benchmarks
Benchmark | main |
saulshanabrook:update-python-example |
Change | |
---|---|---|---|---|
π | python_array_optimize |
N/A | 6.6 s | N/A |
π | stresstest_large_expr |
N/A | 2.9 s | N/A |
This seems wrong to me. We want to be able to catch regressions in our whole codebase, not just in 6 slow tests, right?
I think the issue is that the shorter tests have too much variability to be deterministic enough to benchmark. Alternatively we could try a deterministic memory allocator for our benchmarks. There's a thread I link to above which outlines the issue.
My sense is that if a change doesn't effect our larger benchmarks it's probably fine to pull in... Like as long as they are representative enough of all the code.
If the large tests are representative then this is a good change; I just didn't think that they were.
Looking at the codspeed regressions on a different PR (#448) I'm seeing that a lot of the variance comes from the parser.
Aren't most of the regressions from short-running programs uninteresting because many are one-off costs like parsing, compilation, and bookkeeping? We care more about performance-critical components like query/apply/rebuild, which only become prominent in longer-running programs.
I think the old python example is a good stress test for the type checker. Maybe we can still keep it but under a different name?
I think that we should increase the regression threshold back to 10%: look at this report. Even on a benchmark that takes >10ms, the rebuild
method still gets significantly slower. (Or we should fix rebuild
to not be randomly slow? Unclear how possible this is)
Another option is I could "ignore" all benchmarks that aren't long running: https://docs.codspeed.io/features/ignoring-benchmarks/
That way they still get recorded and can be viewed manually but won't influence the reports.
The only downside here is that if we add additional short examples, they will be have to manually added to the ignore list in the UI.
That last option seems strictly better than the solution in this PR to me: opt-out vs. opt-in. Hopefully it would motivate us to create better benchmarks in the future.
That last option seems strictly better than the solution in this PR to me: opt-out vs. opt-in. Hopefully it would motivate us to create better benchmarks in the future.
OK sounds good! I will update this PR to just add the Python benchmark (and keep the old one) and then just ignore the smaller benchmarks for now :)
OK I think this is ready to merge
In this PR we change the benchmarks to only run long-running examples instead of all of them. This is to reduce variance caused by indeterminacy in the memory allocator, which is overly prominent in short benchmarks (https://github.com/oxc-project/backlog/issues/89).It also updates the Python example to more accurately reflect the current workload, including optimizing, and turning into a final Python string, all with CSE (so the size of the file is smaller than before). It should output a final string with something like this: