BenchmarkDotNet as a performance tests runner.

ig-sinicyn commented 8 years ago

Hi!

As promised, an report on adopting BenchmarkDotNet to be used as a performance tests runner.

Bad part:

I've had to write a plenty of code to make it work.
It's not finished yet: there're some issues preventing us from pushing it into production (listed at the end).

Good part: it finally works and covers almost all of our use cases:)

Lets start with a short intro describing what perftest are and what they are not

At first, the benchmarks and the perftests ARE NOT the same

The difference is like between olympic running shoes and the hiking boots. There are some similar parts but the use cases are different, obviously:)

Performance tests are not the thing to find the sandbox-winner method. On the contrary, they're aimed to proof that in real-world conditions the code will not break limits set in the test. As with all other tests, perftests will be run on different machines, under different workload and they still have to produce repeatable results.

This means you cannot use absolute timings to set the limits for perftests. There's no sense to compare 0.1 sec run on a tablet with 0.05 sec when run on dedicated testserver (under same conditions the latter is 10x slower).

So, you have to include some reference (or baseline) method into the benchmark and to compare all other benchmark methods using a relative-to-the-baseline execution time metric. This approach is known as a competition perf-testing and it is used in all performance tests we do.

Second, you cannot use averages to compare the results.

These result in too optimistic estimates, percentiles to the resque. To be short, some links: http://www.goland.org/average_percentile_services/ https://msdn.microsoft.com/en-us/library/bb924370.aspx http://apmblog.dynatrace.com/2012/11/14/why-averages-suck-and-percentiles-are-great/

Also, I'd recommend the Average vs. Percentiles section from the avesome "Writing High-Performance .NET Code" book. To be honest, the entire Chapter 1 does worth the reading.

Third, you have to set BOTH upper and lower limits.

You DO want to detect situations like "Code B runs 1000x faster unexpectedly", believe me. 100 out of 100, the "Code B" was broken somehow.

Fourth, you will have a LOT of the perftests.

Our usual ratio is one perftest per 20 unit tests or so. That's not a goal, of course, just statistics from our real-world projects.

Let's say, you have a few hundreds of the perftests. This means that they should be FAST. Usual time limit is 5-10 secs for large tests and 1-2 sec for smaller ones. No one will wait for a hour:)

Fifth, all perftests should be auto-annotated.

Yes, there should be option (configurable via appconfig) to collect the statistics and to update the source with it. Also, benchmark should rerun automatically with a new limits and loose them if they are too tight. It allows not to bother with run-set limits-repeat loop and boosts the productivity in magnitude. As I've said above, there will be a lot of perftests.

And there should be way to store the annotation as attributes in the code or as a separate xml file. The last one is mandatory in case the tests are auto-generated (yes, we had these).

And the last but not least:

You should not monitor execution time only. Memory allocations and GC count should be checked too, as they has influence on the entire app performance.

Ook, looks like that's all:)

Ops, one more: the perftests SHOULD be compatible with any unit-testing framework. Different projects use different testing libraries and it'll be silly to require another one just to run the perftests.

And now, the great news:

Our current perftest implementation covers almost all of the above requirements and it's almost stable enough to be merged into BenchmarkDotNet.

If you're interested in it, of course:) The code is in https://github.com/rsdn/CodeJam/tree/master/Main/tests-performance/BenchmarkDotNet The example tests are in https://github.com/rsdn/CodeJam/tree/master/Main/tests-performance/CalibrationBenchmarks aaand its kinda working :cool:

The main show stopper

We need an ability to use a custom toolchain. It looks like it will allow us to enable in-process test running much faster than waiting for #140 to be closed:)

Also, I've delayed the implementation of memory-related limits until we will be sure all other parts are working fine. We definitely need an ability to collect the GC statistics directly from the benchmark process. It'll allow us to use same System.GC API we're using for the monitoring in production.

When all of it'll will be done I'm going to create a discussion about merging the competition tests infrastructure into BenchmarkDotNet.

At the end

A list of the things that are not critical but definitely should be included into Bench.Net codebase:

Percentile and scaled percentile columns.

The API to group Summary' benchmarks by same conditions (same job and same parameters). Use case: we've benchmark with different [Params()] and there's no sense to compare results from Count = 100 and result from Count = 1000. You already have similar check in the BaselineDiffColumn,

          var baselineBenchmark = summary.Benchmarks.
             Where(b => b.Job.GetFullInfo() == benchmark.Job.GetFullInfo()).
             Where(b => b.Parameters.FullInfo == benchmark.Parameters.FullInfo).
             FirstOrDefault(b => b.Target.Baseline);

I propose to extract it in into public API, something like

     /// <summary>
     /// Groups benchmarks being run under same conditions (job+parameters)
     /// </summary>
     public static ILookup<KeyValuePair<IJob, ParameterInstances>, Benchmark> SameConditionBenchmarks(this Summary summary)
         => summary.Benchmarks.ToLookup(b => new KeyValuePair<IJob, ParameterInstances>(b.Job, b.Parameters));

API to get BenchmarkReport from Summary and Benchmark. There was Summary.Reports in 0.9.3, but in 0.9.5 its type was changed from Dictionary<> to the array.
Ability to report benchmark errors from the analysers. Use case: unit test analyser should report error if the perf test does not fit into timing limits. Currently i just throw an exception but it does not fit well into the design of the BenchmarkDotNet.
Whoa! That's all for now

Any questions / suggestions are welcome:)

mattwarren commented 8 years ago

This is great, thanks so much for doing this, I think it'll be a great addition to BenchmarkDotNet

I need more time to look at it deeply, but on first glance it looks great!

With regards to:

Memory allocations and GC count should be checked too, as they has influence on the entire app performance.

and

Also, I've delayed the implementation of memory-related limits until we will be sure all other parts are working fine. We definitely need an ability to collect the GC statistics directly from the benchmark process. It'll allow us to use same System.GC API we're using for the monitoring in production.

We already have this, if you enable the GCDiagnoser you will get some extra columns in the summary statistics, e.g.

Or does this not meet your needs?

ig-sinicyn commented 8 years ago

@mattwarren, I'm not sure that the GC diagnoser will work if the benchmark will be runned in process.

If it will or if it's easy to adopt the diagnoser for this (very specific, as for me) scenario then it's definitely a way to go!

Anyway, I want to wait until all other things will be more or less stable, just to prevent myself from doing useless work:)

mattwarren commented 8 years ago

@ig-sinicyn I was just wondering how this was going?

Do you need any help from any of us, or is it just case of finding the time to do it (like we all seem to be struggling with!!)

ig-sinicyn commented 8 years ago

@mattwarren Well, almost no issues at all. The main trouble is there's a lot of things to be wired together to make the entire idea to work. And all of these should work fine, or the test runner would be useless.

As far I can see there's no production-grade perftesting suite for .Net. So, sometimes I had to reinvent the wheel basing on our experience in perftesting. Try-fix-repeat all the way down:)

Current results are quite promising. Most tests are running in a 5-7 seconds, they are accurate enough, and there's a lot of diagnostics to prevent some typical errors. E.g. environment validation and ability to rerun the tests to detect 'on-the-edge' limits (cases when test eventually does not fit into limits).

For example, the output for case "perftest was failed because was run under x86, not x64" looks like this:

Test Name:  RunCaseAggInlineEffective
Result Message: 
Execution failed, details below.
Errors:
    * Run #3: Method Test01Auto runs faster than 15.22x baseline. Actual ratio: 11.232x
    * Run #3: Method Test02NoInline runs faster than 16.34x baseline. Actual ratio: 11.04x
    * Run #3: Method Test03AggressiveInline runs faster than 7.21x baseline. Actual ratio: 5.28x
Warnings:
    * Run #3: Job X64_Jit-RyuJit_Warmup3_Target10_Process1_IterationTime10, property Platform: The current process is not run as x64.
    * Run #3: Job X64_Jit-RyuJit_Warmup3_Target10_Process1_IterationTime10, property Jit: The current setup does not support RyuJit.
    * Run #3: The benchmark was run 3 times (read log for details). Consider to adjust competition setup.
Diagnostic messages:
    * Run #1: Requesting 1 run(s): Competition validation failed.
    * Run #2: Requesting 1 run(s): Competition validation failed.

The output should be changed to something more readable but at least it works:)

mattwarren commented 8 years ago

As far I can see there's no production-grade perftesting suite for .Net. So, sometimes I had to reinvent the wheel basing on our experience in perftesting. Try-fix-repeat all the way down:)

Yeah that's probably true, the only one I've seen is NBench, but it's pretty new. Being able to do this with BenchmarkDotNet would be great!

Test Name:  RunCaseAggInlineEffective
Result Message: 
Execution failed, details below.
Errors:
    * Run #3: Method Test01Auto runs faster than 15.22x baseline. Actual ratio: 11.232x
    * Run #3: Method Test02NoInline runs faster than 16.34x baseline. Actual ratio: 11.04x
    * Run #3: Method Test03AggressiveInline runs faster than 7.21x baseline. Actual ratio: 5.28x
Warnings:
    * Run #3: Job X64_Jit-RyuJit_Warmup3_Target10_Process1_IterationTime10, property Platform: The current process is not run as x64.
    * Run #3: Job X64_Jit-RyuJit_Warmup3_Target10_Process1_IterationTime10, property Jit: The current setup does not support RyuJit.
    * Run #3: The benchmark was run 3 times (read log for details). Consider to adjust competition setup.
Diagnostic messages:
    * Run #1: Requesting 1 run(s): Competition validation failed.
    * Run #2: Requesting 1 run(s): Competition validation failed.

BTW That output looks fantastic, just the sort of thing you'd want to see.

ig-sinicyn commented 8 years ago

@mattwarren Yeah, I've checked NBench. For the moment It missed a lot of features, as far as I can remember there was no baseline support and it has a very weird API to specify limits for the benchmarks:


    [PerfBenchmark(Description = "Test to ensure that a minimal throughput test can be rapidly executed.", 
        NumberOfIterations = 3, RunMode = RunMode.Throughput, 
        RunTimeMilliseconds = 1000, TestMode = TestMode.Test)]
    [CounterThroughputAssertion("TestCounter", MustBe.GreaterThan, 10000000.0d)]
    [MemoryAssertion(MemoryMetric.TotalBytesAllocated, MustBe.LessThanOrEqualTo, ByteConstants.ThirtyTwoKb)]
    [GcTotalAssertion(GcMetric.TotalCollections, GcGeneration.Gen2, MustBe.ExactlyEqualTo, 0.0d)]
    public void Benchmark()
    {
        _counter.Increment();
    }

I'm pretty sure that attribute annotations can be made less verbose. At least I will try hard to do it:)

adamsitnik commented 8 years ago

@ig-sinicyn is there anything left to do for us to help you with this task?

ig-sinicyn commented 8 years ago

@adamsitnik Actually, no. Thanks for asking! :)

The code for the first version is complete (under complete I'm meaning ready-to-ship code quality), it runs in dogfooding for last two weeks without a problem. The last feature - app.connfig support - is ready but not pushed into repo, will be done tomorrow after code review.

After that I'll create beta nuget packages (will post here), will complete the docs and samples and will create request for comment thread here and on RSDN forum.

Two main issues for now:

unneeded roslyn dependency (fixed in 0.9.9 already, thanks!!!)
currently there's no memory assertions, delayed for v2. Not a technical problem, just have no time to do it the right way.

adamsitnik commented 8 years ago

I'm glad to hear that! Can't wait to start using as well!

Daniel-Svensson commented 8 years ago

What is the status for this issue, Any chance of being able to use it soon and if so do you have an example setup to look at?

This sounds almost exactly like what we are looking for. Would like to integrate it with our teamcity build server which should work fine as long as the results can be accessed

ig-sinicyn commented 8 years ago

@Daniel-Svensson

Pretty much the same:(

Good news: we have one in dogfooding since July, no major issues discovered yet.

Bad news: until now I had no free time (literally) to finish it. Today my team finally shipped major version of product we were working on and I hope I'll have more time for my pet projects. If everything goes well I'll release beta of perftests shortly after release of Bench.Net v.0.10.

If you do not want to wait, feel free to grab sources from here. Note that the code may not include some fixes & updates and there'll be breaking changes after upgrading to Bench.Net 0.10.

As example - tests running on appveyor (search for perftest, appveyor does not sorts nor groups test results).

adamsitnik commented 8 years ago

@ig-sinicyn what is blocking you? do you have some features unfinished or is it something else? maybe we could somehow help you?

ig-sinicyn commented 8 years ago

@adamsitnik no blockers actually.

The github version produces slightly unstable results when run on appveyor / low-end notebooks and I want to wait for 0.10 release before backporting. Actually I do hope the 0.10 release will allow us to use standard engine and remove our implementation entirely.

Vannevelj commented 7 years ago

Is there any update on this?

ig-sinicyn commented 7 years ago

@Vannevelj yep. It works and (almost) stable. The sad part is, I'm very busy this year and there's a lot of work before making public announce. Most of TODOs is related to documentation and samples so I may release a public beta if you're interested in it.

Vannevelj commented 7 years ago

@ig-sinicyn Definitely interested so if you find the time to do it, that would be great.

ig-sinicyn commented 7 years ago

@Vannevelj Okay, here we go. Please note this beta is targeting to BDN 10.5 and .net 4.6, I'll update it to 10.6 on weekends. Nuget packages: https://www.nuget.org/packages?q=CodeJam.PerfTests (install the one for unit test framework you are using).

Intro: https://github.com/rsdn/CodeJam/blob/master/PerfTests/docs/Intro.md

Small teaser:

The code: ```c# // A perf test class. [Category("PerfTests: NUnit examples")] [CompetitionAnnotateSources] // Opt-in feature: source annotations. [CompetitionBurstMode] // Use this for large-loops benchmark public class SimplePerfTest { private const int Count = CompetitionRunHelpers.BurstModeLoopCount; // Perf test runner method. [Test] public void RunSimplePerfTest() => Competition.Run(this); // Baseline competition member. // All relative metrics will be compared with metrics of the baseline method. [CompetitionBaseline] public void Baseline() => Thread.SpinWait(Count); // Competition member #1. Should take ~3x more time to run. [CompetitionBenchmark] public void SlowerX3() => Thread.SpinWait(3 * Count); // Competition member #2. Should take ~5x more time to run. [CompetitionBenchmark] public void SlowerX5() => Thread.SpinWait(5 * Count); // Competition member #3. Should take ~7x more time to run. [CompetitionBenchmark] public void SlowerX7() => Thread.SpinWait(7 * Count); } ``` first run (should take ~20 seconds: ```c# // Baseline competition member. // All relative metrics will be compared with metrics of the baseline method. [CompetitionBaseline] [GcAllocations(0)] public void Baseline() => Thread.SpinWait(Count); // Competition member #1. Should take ~3x more time to run. [CompetitionBenchmark(2.76, 3.28)] [GcAllocations(0)] public void SlowerX3() => Thread.SpinWait(3 * Count); // Competition member #2. Should take ~5x more time to run. [CompetitionBenchmark(4.74, 5.73)] [GcAllocations(0)] public void SlowerX5() => Thread.SpinWait(5 * Count); // Competition member #3. Should take ~7x more time to run. [CompetitionBenchmark(6.72, 7.34)] [GcAllocations(0)] public void SlowerX7() => Thread.SpinWait(7 * Count); ``` Then, comment out the `[CompetitionAnnotateSources]` aand (check the test output): ``` BenchmarkDotNet=v0.10.5, OS=Windows 10.0.15063 Processor=Intel Core i7-3537U CPU 2.00GHz (Ivy Bridge), ProcessorCount=4 Frequency=2435874 Hz, Resolution=410.5303 ns, Timer=TSC [Host] : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.7.2046.0 Job=CompetitionAnyCpu Platform=AnyCpu Force=False Toolchain=InProcessToolchain InvocationCount=1 LaunchCount=1 RunStrategy=Throughput TargetCount=300 UnrollFactor=1 WarmupCount=100 AdjustMetrics=True Method | Mean | StdDev | Scaled | Scaled-StdDev | GcAllocations | --------- |----------:|----------:|------- |-------------- |-------------- | Baseline | 79.27 us | 2.261 us | 1.00 | 0.00 | 0 B | SlowerX3 | 233.26 us | 10.288 us | 2.94 | 0.154 | 0 B | SlowerX5 | 391.40 us | 18.450 us | 4.93 | 0.28 | 0 B | SlowerX7 | 537.59 us | 26.017 us | 6.78 | 0.38 | 0 B | ============= SimplePerfTest ============= // ? CodeJam.Examples.PerfTests.SimplePerfTest, ConsoleApplication1 ---- Run 1, total runs (expected): 1 ----- // ? #1.1 02.541s, Informational@Analyser: All competition metrics are ok. ```

P.S. Please note this is the first public beta and there may be glitches here and there. Feel free to file an issue or ask for help in out gitter chat if you catch one:)

bonesoul commented 6 years ago

and updates on this?

AndreyAkinshin commented 6 years ago

@bonesoul, still working on this. I guess that the first version of performance testing API will be available in March.

ndrwrbgs commented 6 years ago

Has there been progress on this since February? @bonesoul ?

danielloganking commented 5 years ago

@AndreyAkinshin @ig-sinicyn I've not seen anything announcing a performance testing API for BenchmarkDotNet though you said one would be released (possibly in March 2018). Is there any progress on this or some place to help make this a reality?

natiki commented 5 years ago

I would also like to use BenchmarkDotNet with Nunit similair to what NBench provides. As NBench is not keeping up with NUnit progress I was hoping to find that BenchmarkDotNet would be ;-)

AndreyAkinshin commented 5 years ago

Hey everyone! I'm sorry that this feature takes so much time. Unfortunately, it's not so easy to implement a reliable performance runner that works with different kinds of benchmarks. Such a system should have an extremely low false-positive error rate (if we get too many false alarms, the performance tests will become untrustable); low false-negative error rate (if we skip most of the performance degradations, the system will become useless); execute as small number of iterations as possible (in case of macrobenchmarks which takes minutes, it doesn't make sense to execute 15-30 iterations each time); and work with different kind of distributions (including multimodal distributions with huge variance and extremely high outliers). Right now I'm working on such a system at JetBrains for our own suite of performance tests. And I have some good news: after a dozen unsuccessful attempts, I finally managed to create such a system (hopefully). Currently, we test it internally and fix different bugs. Once we get a stable reliable version, I will backport all the features to BenchmarkDotNet. Currently, I don't have any ETA, but I definitely want to finish it this year. Once again, sorry that it takes much more time that I exected. I just don't want to release a performance testing API which works only for the limited set of simple benchmarks. Thanks again for your patience.

natiki commented 5 years ago

@AndreyAkinshin Which JB tool will this surface in?

AndreyAkinshin commented 5 years ago

@natiki Rider.

abelbraaksma commented 4 years ago

Currently, I don't have any ETA, but I definitely want to finish it this year.

@AndreyAkinshin It'd make an awesome Xmas present ;). If there's anything we can do to help, let us know. I've a TeamCity configuration where I've (also) been unsuccessful with creating reliable perf unit tests (I'm now just charting the results and if they go up, it's bad, if they go down, it's good, but that's certainly not suitable as a performance unit test).

I understand this is complex and a lot of work, but if you need to help weed out some bugs, maybe we can set up a feature branch and go from there for a while until it is considered stable enough to include in BDN?

ndrwrbgs commented 4 years ago

Hello folks, just doing a casual ping. {Insert some relaxing joke about Christmas presents} :)

AndreyAkinshin commented 4 years ago

@abelbraaksma @ndrwrbgs sorry for the delay, the performance runner is still in progress. I take important steps forward every month, but it still does not work as well as I would like.

abelbraaksma commented 4 years ago

@AndreyAkinshin, that's good to hear! Is there something we can do to help? Maybe run it against our own test sets?

AndreyAkinshin commented 4 years ago

@abelbraaksma it will be much appreciated as soon as I finish the approach that works fine on my set of test cases.

bonesoul commented 4 years ago

watching this also 👍

natiki commented 4 years ago

@AndreyAkinshin @ig-sinicyn just wondering how this has come along?

I played with https://github.com/rsdn/CodeJam/tree/master/PerfTests%5BWIP%5D but unfortunately I need a .NET Core version. Well actually something that at also meets .Net Standard 2.0 as we are still on .Net Core 2.x

AndreyAkinshin commented 4 years ago

@natiki currently, I'm working on some mathematical approaches that should help to make the future performance runner reliable. You can find some of my recent results in my blog: https://aakinshin.net/posts/harrell-davis-double-mad-outlier-detector/ https://aakinshin.net/posts/nonparametric-effect-size/

ndrwrbgs commented 4 years ago

Last I touched this, I was willing to handle the statistical analysis myself the blocker was that BenchmarkDotNet wouldn't let me run in a UnitTest context. It seems from your results that you've already removed that blocker and are trying to put shipping it behind a full-on out-of-box polish. Is it possible to expose the "run inside unit test context" functionality as is to get more hands into the kitchen, to speak, of making a reliable statistical analysis of the results?

AndreyAkinshin commented 4 years ago

@ndrwrbgs, sorry but the reliable statistical analysis is still in progress ("reliable" is the hardest part). Meanwhile, I continue publishing blog post related to the subject: https://aakinshin.net/posts/weighted-quantiles/ https://aakinshin.net/posts/gumbel-mad/ https://aakinshin.net/posts/kde-bw/ https://aakinshin.net/posts/misleading-histograms/ https://aakinshin.net/posts/qrde-hd/ https://aakinshin.net/posts/lowland-multimodality-detection/

Most of the suggested approaches are already implemented in perfolizer, but it's not enough to provide reliable out-of-the-box performance checks. (This is not about polishing, I still have some research tasks that should be finished first).

ndrwrbgs commented 4 years ago

I think you read my message too swiftly and missed the purpose, you seemed to reply to what I stated was the NON purpose, so I'll repeat myself.

Last I checked, the library physically could not run from a unit test context. It seems you're trying to do polish on analyzing the OUTPUT of the library, which suggests you have changes locally that would let it run and could unblock many, and we could handle the outputs ourselves for our unique use cases as mathematicians and statisticians. Could you comment on this?

On Mon, Nov 9, 2020, 3:59 AM Andrey Akinshin notifications@github.com wrote:

@ndrwrbgs https://github.com/ndrwrbgs, sorry but the reliable statistical analysis is still in progress ("reliable" is the hardest part). Meanwhile, I continue publishing blog post related to the subject: https://aakinshin.net/posts/weighted-quantiles/ https://aakinshin.net/posts/gumbel-mad/ https://aakinshin.net/posts/kde-bw/ https://aakinshin.net/posts/misleading-histograms/ https://aakinshin.net/posts/qrde-hd/ https://aakinshin.net/posts/lowland-multimodality-detection/

Most of the suggested approaches are already implemented in perfolizer https://github.com/AndreyAkinshin/perfolizer, but it's not enough to provide reliable out-of-the-box performance checks. (This is not about polishing, I still have some research tasks that should be finished first).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dotnet/BenchmarkDotNet/issues/155#issuecomment-723969347, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACSHCOXSYAVJ33EHPTJ2BPDSO7KTTANCNFSM4CCYORPA .

Lonli-Lokli commented 3 years ago

Is there any working sample for performance tests based on benchmark.nwt? Eg running sample web app with predefined set of tests, producing some report with one codebase (eg version 1 on netcore 3) and possibility to run same set on same virtual hardware on net5?

AndreyAkinshin commented 3 years ago

@ndrwrbgs sorry for misreading your question, my bad.

While we are not recommending executing benchmarks from the unit test context (because a unit test runner may introduce some performance side effects), you can definitely do that. If you have any problems, please file a separate issue with all the details (the unit test framework title, it's version, how do you run tests, etc.).

AndreyAkinshin commented 3 years ago

@Lonli-Lokli

The "performance tests" feature is still in progress, there are no workable samples for now.
You can execute the same set of benchmarks against multiple runtimes using Job (see an example in README). Next, you can manually process the results or programmatically get the raw data from Summary (which is returned by BenchmarkRunner.Run) and process it any way you want.