Performance triage by bisecting the range

kotlarmilos commented 1 year ago

The idea is to improve performance triage for complex cases (i.e. https://github.com/dotnet/perf-autofiling-issues/issues/11539 and https://github.com/dotnet/perf-autofiling-issues/issues/11536) by bisecting the range and identifying a commit that caused the regression. Such tool could input commit range and microbenchmark grouping, and could output a commit that caused the regression with logs of a searched path.

The tool could be made available for local environment where it is limited to a particular os and architecture, which I consider as a low effort. Also, it could be added as an extension to the dashboard where for a particular set of microbenchmarks it can be executed, which I consider as a mid to high effort with broader scope and easy to use approach, but with security concerns.

I would like to obtain feedback before making it available for local environments. If such tool is considered as useful in the dashboard, I am ready to help.

sblom commented 1 year ago

Would love to see something like this! Are you imagining enhancing benchmarks_ci.py to do this?

Would be very interesting to try to factor this as a git bisect predicate so that git can do the bisection work, but we can help it narrow in, although I worry that without perfect test stability we might confuse git bisect.

kotlarmilos commented 1 year ago

It is a good idea to enhance benchmarks_ci.py. I agree that it would be interesting to factor it as a git bisect. For regressions with an order of magnitude difference even without perfect test stability we might get correct results.

I can propose changes to benchmarks_ci.py in a draft PR.

matouskozak commented 1 year ago

I would like to revive the discussion about the triage script.

I think that with recent advancements by @LoopedBard3 on https://github.com/dotnet/performance/blob/main/scripts/benchmarks_local.py. We could utilize this script to binary search thru commits to identify the source of regression (or improvement).

My idea for the script would be: Input: baseline hash, compare hash, filter for microbenchmarks Output: hash of commit causing regression

Algorithm steps:

Measure the magnitude of regression between baseline and compare hash. We could use the ResultsComparer tool https://github.com/dotnet/performance/tree/main/src/tools/ResultsComparer for measuring. The tool can save results to csv which could be processed by the script later.
Using binary search algorithm, try to find commit causing the same magnitude of regression as computed in 1.
- It is possible that the results between commits are not monotonous which could break the binary search algorithm. Small variance could be detected and ignored, if there would be bigger spikes than the script should either abort or "split" itself to work on subintervals.

Further ideas:

Reverting the found commit to verify the cause of regression.
Checking out new repository and building from zero for each tested commit might be too time consuming. Having 3 repos (left, mid, right) and just using git checkout <commit> without cleaning artifacts could make the search faster, however it could cause some unwanted behavior. I would make this an optional feature for user to decide.

What do you think? @sblom @kotlarmilos @LoopedBard3

kotlarmilos commented 1 year ago

Measure the magnitude of regression between baseline and compare hash. We could use the ResultsComparer tool https://github.com/dotnet/performance/tree/main/src/tools/ResultsComparer for measuring. The tool can save results to csv which could be processed by the script later.

Good idea. There is already --diff command that allows for easy comparison.

Using binary search algorithm, try to find commit causing the same magnitude of regression as computed in 1.

I would like to explore the possibility of implementing it using git bisect (https://git-scm.com/docs/git-bisect). Additionally, I think we can extend the script for local testing to support this feature.

@matouskozak @LoopedBard3 I suggest we schedule a call to discuss it further.

matouskozak commented 1 year ago

I selected a few benchmarks from Mono perf issues that regressed in the past and have stable results to be suitable for testing the bisecting algorithm.

The first set comes from https://github.com/dotnet/perf-autofiling-issues/issues/21818 (linux-x64, AOT):

System.Memory.ReadOnlySpan

baseline: 9d08b24d743d0c57203a55c3d0c6dd1bc472e57e compare: f6c592995a4f8526508da33761230a3850d942ff expected result: 4a09c82215399c27f52277a8db7178270410c693

System.Globalization.Tests.StringSearch

baseline: da4e544809b3b10b7db8dc170331988d817240d7 compare: f6c592995a4f8526508da33761230a3850d942ff expected result: 4a09c82215399c27f52277a8db7178270410c693

The second set comes from https://github.com/dotnet/perf-autofiling-issues/issues/15795 (linux-arm64, AOT):

System.Tests.Perf_DateTimeOffset, System.Tests.Perf_DateTime, System.Globalization.Tests.Perf_DateTimeCultureInfo

baseline: 0fc78e62bb0b8d824efe7421983702b97a60def6 compare: 86b48d7c6f081c12dcc9c048fb53de1b78c9966f expected result: ef1ba771347dc1d7d626907e4731b8f2e3cf78b3

The third set comes from https://github.com/dotnet/perf-autofiling-issues/issues/14570 (linux-arm64, AOT):

System.Runtime.Intrinsics.Tests.Perf_Vector128Of, System.Runtime.Intrinsics.Tests.Perf_Vector128Of, System.Runtime.Intrinsics.Tests.Perf_Vector128Float

baseline: dce343987e81ce6cf045d983ce62d8e117d2c142 compare: c6cc201665cb3c4d61851392bd8f8d8093688500 expected result: 9bf28fe5a0bc2344bac261e07b27e6c55033318f

The last set comes from https://github.com/dotnet/perf-autofiling-issues/issues/22688 (linux-x64, JIT):

This set represents improvements (in contrast with previous three that represented regression)

System.Text.Json.Serialization.Tests.WriteJson

baseline: 736dabeca728ccf8b911d96d1b4c575b4d0db7d2 compare: fc5c29692fc1a92426b7d1ce8c501e7696062bb6 expected result: 169e22c8f9f00719d87f0674954fee688b556b4a

kotlarmilos commented 1 year ago

Thanks @matouskozak! Let's create mock data based on your input and prototype the bisecting process using a moving average filter-like approach. @LoopedBard3 since you worked on the script for local benchmarking, feel free to reflect on missing features (comparer, bisect, etc).

dotnet / performance