Closed ibraheemdev closed 1 week ago
๐
12 new benchmarks were detected.
You will start to see performance impacts in the reports once the benchmarks are run from your default branch.
Hmm it doesn't look like the benchmarks are running correctly under Codspeed, the performance report is showing the resolve/install benchmarks running in microseconds.
Hey @ibraheemdev, I am a co-founder at @CodSpeedHQ!
Hmm it doesn't look like the benchmarks are running correctly under Codspeed, the performance report is showing the resolve/install benchmarks running in microseconds.
Yes, running arbitrary executables in a benchmark with CodSpeed will not give out relevant results, as most of the compute is done in a new process that is not instrumented. It would be best to directly call the underlying functions of the library, without relying on the built executable.
For example, calling https://github.com/astral-sh/uv/blob/2af80c28a8e6a2da755ab78f3ea7b028e8b1510c/crates/uv/src/commands/pip_compile.rs#L52 instead of https://github.com/ibraheemdev/uv/blob/4ebdc40f60562c05559ac6331abe1a56275e2c8b/crates/bench/benches/uv.rs#L41-L42.
Hope that helps you a bit ๐
@adriencaccia Thanks! I suspected we would have to do this eventually, but didn't realize CodSpeed didn't support processing external commands at all.
Nice, thank you! Open to giving this a shot. Do we have any sense for what the variance/noise will look like?
I tested it out on my fork at https://github.com/adriencaccia/uv/pull/1, and I have the following variance results for 101 runs on the same commit:
Found 101 runs for adriencaccia/uv (fca26cde1b54f7467267ca4dff7a9b9cb6f10d29)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ (index) โ average โ standardDeviation โ varianceCoefficient โ range โ rangeCoefficient โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโค
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_tag_compatibility::wheelname_tag_compatibility[flyte-short-incompatible] โ '1.1 ยตs' โ '27.3 ns' โ '2.5%' โ '55.6 ns' โ '5.1%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_tag_compatibility::wheelname_tag_compatibility[flyte-short-compatible] โ '2.5 ยตs' โ '27.3 ns' โ '1.1%' โ '55.6 ns' โ '2.2%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_tag_compatibility::wheelname_tag_compatibility[flyte-long-compatible] โ '2.6 ยตs' โ '27.3 ns' โ '1.0%' โ '55.6 ns' โ '2.1%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_tag_compatibility::wheelname_tag_compatibility[flyte-long-incompatible] โ '1.8 ยตs' โ '13.6 ns' โ '0.7%' โ '27.8 ns' โ '1.5%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing_failure::wheelname_parsing_failure[flyte-short-extension] โ '2.6 ยตs' โ '13.6 ns' โ '0.5%' โ '27.8 ns' โ '1.1%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing_failure::wheelname_parsing_failure[flyte-long-extension] โ '2.6 ยตs' โ '13.6 ns' โ '0.5%' โ '27.8 ns' โ '1.1%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing::wheelname_parsing[flyte-short-compatible] โ '12 ยตs' โ '13.6 ns' โ '0.1%' โ '27.8 ns' โ '0.2%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing::wheelname_parsing[flyte-short-incompatible] โ '12.2 ยตs' โ '13.6 ns' โ '0.1%' โ '27.8 ns' โ '0.2%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_build_platform_tags::build_platform_tags[burntsushi-archlinux] โ '6.3 ms' โ '13.6 ns' โ '0.0%' โ '27.8 ns' โ '0.0%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing::wheelname_parsing[flyte-long-incompatible] โ '26.3 ยตs' โ '0 ns' โ '0.0%' โ '0 s' โ '0.0%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing::wheelname_parsing[flyte-long-compatible] โ '21 ยตs' โ '0 ns' โ '0.0%' โ '0 s' โ '0.0%' โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโ
It is fairly stable, so you should be able to set a low regression threshold to around 5% ๐
Awesome, thanks so much Adrien!
Hm I don't see the resolver benchmarks there โ I'd expect the distribution filename benches to be very stable but the resolver ones are probably less so.
Looks like there's something wrong and the resolver benches are missing on the latest commit.
I forgot to use the codspeed-criterion-compat
shim in the uv
benchmarks, but it looks like the crate doesn't support async runs.
@ibraheemdev let me know when you want me to run variance checks again on the new benchmarks
@adriencaccia can you run them now? I'm also curious why benchmarks seem to be running ~15x slower on CodSpeed than locally, is it using an aggregate time instead of per-run?
@adriencaccia can you run them now?
Alright, I started them. Will post the results once they are done ๐
I'm also curious why benchmarks seem to be running ~15x slower on CodSpeed than locally, is it using an aggregate time instead of per-run?
This is because we run the code with valgrind
, it adds a 4x to 10x overhead, sometimes more. But that is how we get those consistent measures and flamegraphs ๐
Results with the new benchmarks:
Found 101 runs for adriencaccia/uv (7bbc18a361ba078e21186db90b98d6b88b3a8a7c)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ (index) โ average โ standardDeviation โ varianceCoefficient โ range โ rangeCoefficient โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโค
โ crates/bench/benches/uv.rs::uv::resolve_warm_black::resolve_warm_black โ '15.3 ms' โ '527.5 ยตs' โ '3.4%' โ '2.3 ms' โ '15.0%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_tag_compatibility::wheelname_tag_compatibility[flyte-short-incompatible] โ '1.1 ยตs' โ '27.5 ns' โ '2.5%' โ '55.6 ns' โ '5.1%' โ
โ crates/bench/benches/uv.rs::uv::resolve_warm_jupyter::resolve_warm_jupyter โ '366.5 ms' โ '4.2 ms' โ '1.1%' โ '22.8 ms' โ '6.2%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_tag_compatibility::wheelname_tag_compatibility[flyte-short-compatible] โ '2.5 ยตs' โ '27.5 ns' โ '1.1%' โ '55.6 ns' โ '2.2%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_tag_compatibility::wheelname_tag_compatibility[flyte-long-compatible] โ '2.6 ยตs' โ '27.5 ns' โ '1.0%' โ '55.6 ns' โ '2.1%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_tag_compatibility::wheelname_tag_compatibility[flyte-long-incompatible] โ '1.9 ยตs' โ '13.7 ns' โ '0.7%' โ '27.8 ns' โ '1.5%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing_failure::wheelname_parsing_failure[flyte-short-extension] โ '2.6 ยตs' โ '13.7 ns' โ '0.5%' โ '27.8 ns' โ '1.1%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing_failure::wheelname_parsing_failure[flyte-long-extension] โ '2.6 ยตs' โ '13.7 ns' โ '0.5%' โ '27.8 ns' โ '1.1%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing::wheelname_parsing[flyte-short-compatible] โ '12 ยตs' โ '13.7 ns' โ '0.1%' โ '27.8 ns' โ '0.2%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing::wheelname_parsing[flyte-short-incompatible] โ '12.2 ยตs' โ '13.7 ns' โ '0.1%' โ '27.8 ns' โ '0.2%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_build_platform_tags::build_platform_tags[burntsushi-archlinux] โ '6.3 ms' โ '13.7 ns' โ '0.0%' โ '27.8 ns' โ '0.0%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing::wheelname_parsing[flyte-long-incompatible] โ '26.3 ยตs' โ '0 ns' โ '0.0%' โ '0 s' โ '0.0%' โ
โ crates/bench/benches/distribution_filename.rs::distribution_filename::benchmark_wheelname_parsing::wheelname_parsing[flyte-long-compatible] โ '21 ยตs' โ '0 ns' โ '0.0%' โ '0 s' โ '0.0%' โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโ
Indeed, it seems that crates/bench/benches/uv.rs::uv::resolve_warm_black::resolve_warm_black
is a bit more inconsistent
Maybe we remove the Black test? It seems like the variance is way higher than the Jupyter test.
I wonder why that is. It shouldn't be that different? (as far as variance)
@zanieb It's probably that the actual resolve step is faster, so the benchmark is more influenced by other factors (file I/O, etc.)
I'm going to go ahead and merge this with just the jupyter benchmark. We'll see how consistent/useful the reports are.
Summary
Runs resolver benchmarks in CI with codspeed.