dotnet / performance

This repo contains benchmarks used for testing the performance of all .NET Runtimes
MIT License
702 stars 272 forks source link

Memory Randomization #1602

Open adamsitnik opened 3 years ago

adamsitnik commented 3 years ago

The Problem

The problem that we are facing with our current setup are benchmarks that:

Goal

The goal of the study was to see how randomizing memory alignment could improve long-term benchmark stability. What might seem to be surprising is that we don't want to get a flat distribution when running locally, but get a full picture for all possible scenarios (aligned and not aligned).

The idea comes from stabilizer project, which was suggested as a possible solution by @AndyAyersMS a few years ago when we started discussing this problem.

Implementation

The "randomization" has been implemented as a new, optional feature in BenchmarkDotNet in the following way:

All the changes that have been required to introduce this feature in BenchmarkDotNet can be seen here

To make the existing benchmarks work with the new feature, the following changes were required:

Methodology

The following methodology was used:

Observations

Managed memory buffers

The Randomization has a very strong effect on benchmarks that use continuous managed memory buffers like arrays or spans for input and perform some simple operations on them. Example:

https://github.com/dotnet/performance/blob/8aed638c9ee65c034fe0cca4ea2bdc3a68d2a6b5/src/benchmarks/micro/libraries/System.Memory/Span.cs#L74-L75

With Randomization disabled we are always getting a flat distribution:

-------------------- Histogram --------------------
[19.319 ns ; 19.739 ns) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
---------------------------------------------------

With Randomization enabled, the distribution becomes multimodal:

-------------------- Histogram --------------------
[17.531 ns ; 19.318 ns) | @@@@@@@@@
[19.318 ns ; 20.833 ns) | @@@@@@@@@@@@@@@@@@@@@@@@@@@
[20.833 ns ; 22.428 ns) | @
[22.428 ns ; 23.943 ns) |
[23.943 ns ; 25.662 ns) |
[25.662 ns ; 27.177 ns) | @@@
---------------------------------------------------

Which better represents what we can see in the historical data:

obraz

The reporting system is currently using the average of all iterations for a given run to represent it as a single point on the chart. If we switch to Median, we should be able to flatten the charts. We can't display every iteration result as a separate point because we run every benchmark multiple times per day (each run gives us 15-20 iteration results) and existing charting libraries can't handle this amount of data (showing few hundreds data points per day for a period of a year (current .NET release cycle)).

The more inputs are used, the bigger the difference. The best example is copying one array to another:

https://github.com/dotnet/performance/blob/2bc578a6b60bf794d58669e94cc8bcd30cbbf5a8/src/benchmarks/micro/libraries/System.Collections/CopyTo.cs#L35-L36

Randomization disabled:

-------------------- Histogram --------------------
[87.431 ns ; 89.140 ns) | @@@@@@@@@@@@@
---------------------------------------------------
-------------------- Histogram --------------------
[ 79.009 ns ; 183.239 ns) | @@@@@@@@@@
[183.239 ns ; 287.047 ns) | @
[287.047 ns ; 391.277 ns) | @@@@@@@@
[391.277 ns ; 501.821 ns) | @
---------------------------------------------------

We have few modes here: none|all arrays aligned and source|desitnation only aligned.

Stack memory

Allocating random-size stack memory affects CPU bound benchmarks that don't use any managed memory as an input. An example:

https://github.com/dotnet/performance/blob/2bc578a6b60bf794d58669e94cc8bcd30cbbf5a8/src/benchmarks/micro/libraries/System.Numerics.Vectors/Perf_Matrix4x4.cs#L105-L106

Before this change, a single run with BenchmarkDotNet would be always producing a flat distribution:

-------------------- Histogram --------------------
[17.139 ns ; 17.937 ns) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
---------------------------------------------------

But with randomization enabled, the distribution is not flat anymore:

-------------------- Histogram --------------------
[17.313 ns ; 17.859 ns) | @@@@@@@@@@@@@@@@@@@@@@
[17.859 ns ; 18.315 ns) | @@@@@@@@@@
[18.315 ns ; 19.027 ns) | @@@
---------------------------------------------------

And is much closer to what we can see in the historical data:

The benchmark seems to be stable from a high-level perspective:

obraz

But has a 0.5-1 ns variance when we zoom-in:

obraz

Not a Silver Bullet

There are some unstable benchmarks like System.Collections.CreateAddAndClear.Stack where randomizing memory does not help. We still have a flat distribution, but the reported time slightly changes (most probably due to not having a perfectly warmed up memory or just being dependent on code alignment):

Before:

-------------------- Histogram --------------------
[1.846 us ; 1.960 us) | @@@@@@@@@@@@@@@
---------------------------------------------------

After:

-------------------- Histogram --------------------
[2.087 us ; 2.204 us) | @@@@@@@@@@@@@@@
---------------------------------------------------

BenchmarkDotNet

For a given distribution:

-------------------- Histogram --------------------
[ 79.009 ns ; 183.239 ns) | @@@@@@@@@@
[183.239 ns ; 287.047 ns) | @
[287.047 ns ; 391.277 ns) | @@@@@@@@
[391.277 ns ; 501.821 ns) | @
---------------------------------------------------

The 391.277 ns ; 501.821 ns bucket would be typically recognized by BenchmarkDotNet as an upper outlier and removed. So for benchmarks with randomization enabled, the outlier removal is disabled.

The performance repo is configured to run up to 20 iterations. To get a full picture, we need to run more iterations for benchmarks with randomization enabled.

Moreover, the default BenchmarkDotNet heuristic stops the benchmarking when the results are stable or it has executed 100 iterations. This is why this setting should not be enabled in BenchmarkDotNet by default.

Summary

Randomization is not a silver bullet that solves all our problems and it should not be enabled by default.

It can help a lot with unstable benchmarks that use continuous managed memory buffers like arrays or spans for input and perform some simple operations on them. Examples:

So far CPU-bound benchmarks with a 0-1 ns variance between runs were not unstable enough to cause a problem to the auto-filing bot and I believe that we should not enable this feature for them until it becomes a problem.

Proposed order of actions

@DrewScoggins @AndyAyersMS @kunalspathak @tannergooding if you agree with my findings then these are the steps that need to be taken to enable the randomization: (I can do all of them except the last one)

cc @andreyakinshin @billwert @Lxiamail @jeffhandley @danmosemsft

jkotas commented 3 years ago

change the Reporting System to use Medians instead of Averages

I do not follow the reasoning behind this. Does this carry a risk of missing regressions for the situations in the upper half of the spectrum?

adamsitnik commented 3 years ago

Does this carry a risk of missing regressions for the situations in the upper half of the spectrum?

Using Medians would only help to flatten the charts (visualization).

When it comes to missing regressions, then it depends on how the auto-filing bot is implemented. If it takes only medians under consideration, then as you said it might miss regressions for less common cases. The same is true for averages.

In a perfect world, the bot should take all the values under consideration. In the past @DrewScoggins mentioned that it would take too much time to fetch all the numbers from the database.

@DrewScoggins do you believe that we could optimize the DB queries and use full data? How it would affect the time it takes to detect regressions? Like from 5 minutes to half an hour or more like from an hour to ten hours? Maybe we could run it just once per week?

kunalspathak commented 3 years ago

This is great study and we should definitely go ahead with memory randomization. During my several investigations, (the most recent in https://github.com/dotnet/runtime/issues/44457#issuecomment-727063983) I have seen that the disassembly remains unchanged but the performance fluctuates and could happen because of memory alignment.

If others agree to the memory randomization, can you list down all the benchmarks (exact benchmark names) that improved, regressed or stabilized because of this change? That list will act as a guide for us whenever we see any improvements/regressions in benchmarks to see if they are impacted by memory alignment. I am not sure how you ran the benchmarks, but I would try it on the perf lab hardware (all available OS/arch) and use ResultComparer on that result to see how memory alignment impacts benchmarks on various configuration. It will be good to document that too so we can refer it in future. Lastly, this might reset the baseline of several benchmarks and our automated bug filing mechanism (@DrewScoggins ) should take that in account instead of flagging them.

DrewScoggins commented 3 years ago

In a perfect world, the bot should take all the values under consideration. In the past @DrewScoggins mentioned that it would take too much time to fetch all the numbers from the database.

@DrewScoggins do you believe that we could optimize the DB queries and use full data? How it would affect the time it takes to detect regressions? Like from 5 minutes to half an hour or more like from an hour to ten hours? Maybe we could run it just once per week?

The analysis engine really has two parts, doing a comparison between two builds to decide if we have a regression, and building a trend line for particular tests.

On the trend generation side I am not sure on exactly how much time downloading the full results would add to the analysis. I do know that we will have to pull down about 8x as much data as we do for using the summarized data row. Overall, I think the better approach would be to add median as something that we calculate when we insert the summarized data row. We could then use the median when appropriate and use average everywhere else when building our trend graphs.

We could however, switch over to using the full test data for the comparison between the two builds. This is actually five builds as we also look at the three builds before the comparison point to provide extra confidence, but that does not materially change much. Doing this would not only allow for average or median analysis, but also more advanced tests like Mann-Whitney. We would also not have to worry about the performance, as these five builds represent only about 1/8th of the total data that we pull down.

My recommendation is that we start adding median and a part of the summarized results data row, and make sure we have some way to mark a test as needing to use median instead of average as the comparison metric. This will have the smallest impact while getting us where we need to go.

AndyAyersMS commented 3 years ago

We expect memory randomization to produce broader/ more complex distributions on the tests where it matters, so summarizing those distributions by one number and then inferencing based on does seem problematic. At a minimum we could also look at standard deviation or IQR or some other non-parametric measure of breadth -- that would be two numbers per run instead of N.

AndyAyersMS commented 3 years ago

Just to echo what @jkotas said above...

As a thought experiment, imagine we have a method that behaves differently depending on alignment (say it has a top-level alignment dependent branch that chooses approach A or approach B). Say A is chosen with higher likelihood, and is faster than B. Will we be able to spot regressions in B?