.NET 5 performance regression (~10%) from .NET Core 3.1

Sergio0694 commented 4 years ago

Description

I have a .NET Standard 2.0 library implementing a high performance interpreter for the brainf*ck language. At the end of the day it is essentially just a Turing machine, so all the operations being performed are basically accesses to memory, incrementing values, looking up tables, etc. It's all "self-contained", with no external APIs being invoked, IO operations or anything.

I've run some benchmarks on .NET Core 2.1, .NET Core 3.1 and .NET 5 (Preview 4) and noticed that .NET 5 is consistently slower than .NET Core 3.1 in almost all cases. Especially in the mandlebrot test case, which is the most intensive, .NET 5 is almost one second slower, so about 10% slower.

I find this very surprising, as .NET 5 is supposed to be much more optimized than .NET Core 3.1.

NOTE: the solution is not open source for now, but I'd be more than happy to give access to the repo to anyone in the team wanting to take a look, just ping me and I'll send you the invite.

Pinging @EgorBo as you seemed interested in this when I shared some older benchmarks about this on Twitter (here) back when .NET 5 preview 1 had just been released, and cc. @tannergooding.

Configuration

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19619 Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET Core SDK=5.0.100-preview.4.20258.7 [Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT .NET Core 3.1 : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT .NET Core 5.0 : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT

Regression?

This is a regression in speed from .NET Core 3.1 to .NET 5 Preview 4, on the same machine.

Data

Runtime	Name	Mean	Error	StdDev
.NET Core 2.1	Division	56.82 us	0.810 us	0.758 us
.NET Core 3.1	Division	41.06 us	0.588 us	0.521 us
.NET Core 5.0	Division	42.95 us	0.511 us	0.453 us

.NET Core 2.1	Fibonacci	21,298.13 us	398.437 us	372.698 us
.NET Core 3.1	Fibonacci	13,197.50 us	65.613 us	58.164 us
.NET Core 5.0	Fibonacci	15,302.14 us	83.934 us	78.512 us

.NET Core 2.1	HelloWorld	17.61 us	0.296 us	0.341 us
.NET Core 3.1	HelloWorld	16.09 us	0.311 us	0.394 us
.NET Core 5.0	HelloWorld	15.68 us	0.305 us	0.510 us

.NET Core 2.1	Mandelbrot	10,221,974.39 us	131,372.509 us	116,458.307 us
.NET Core 3.1	Mandelbrot	7,799,742.74 us	65,822.677 us	58,350.088 us
.NET Core 5.0	Mandelbrot	8,544,272.25 us	8,136.804 us	7,611.172 us

.NET Core 2.1	Multiply	1,204.57 us	14.566 us	12.912 us
.NET Core 3.1	Multiply	792.31 us	5.899 us	4.926 us
.NET Core 5.0	Multiply	790.11 us	1.301 us	1.087 us

.NET Core 2.1	Sum	52.86 us	0.590 us	0.493 us
.NET Core 3.1	Sum	39.53 us	0.620 us	0.580 us
.NET Core 5.0	Sum	37.39 us	0.470 us	0.417 us

category:cq theme:jit-block-layout skill-level:expert cost:large

Lxiamail commented 4 years ago

@Sergio0694 thank you for reporting the issue. From your reported results, besides looks like Division, Fibonacci, and Mandelbrot may have potential regression on .NET5, the other cases have improvements. We can repro the Fibonacci regression on .NET5 and will investigate the root cause. For the other cases, we don't have repro showing regressions on .NET5. Can you share your microbenchmark tests with me so that we can do further investigation?

Sergio0694 commented 4 years ago

Hey @Lxiamail - thank you for chiming in!

I'm very confused by what you said here:

"We can repo the Fibonacci regression on .NET5 and will investigate the root cause"

What do you mean with "we can repo the Fibonacci regression"? All these tests were done specifically with my interpreter for that language as I mentioned, that "Fibonacci" test was not just the classic Fibonacci method written in C# 🤔

Anyway, I've sent you an invite to be added as a collaborato to my private repo Brainf_ckSharp. To repro the issue, you need to switch to the repro/net5-regression branch. There you'll find the Brainf_ckSharp.Profiler project (in the "Profiling") folder in the solution. It's already configured to run with both .NET Core 3.1 and .NET 5.

Let me know if there's anything else I can help you with, thanks again! 😊

AndyAyersMS commented 4 years ago

"Fibonacci" test was not just the classic Fibonacci method written in C#

We got our Fibonaccis crossed. We have an internal benchmark with that name that looks like it might have regressed.

Let me know if there's anything else I can help you with

Have you done any profiling to try and pin down where things got slower?

You can add me as a collaborator.

Sergio0694 commented 4 years ago

Ah, that explains it, sorry for the confusion! 😄

@AndyAyersMS I've added you to a collaborator as well! As mentioned before, the repro is in the repro/net5-regression branch, in the Brainf_ckSharp.Profiler project (under the "Profiling" folder).

I can say that 99.8% of the CPU time is spent in this function:

Brainf_ckSharp.Brainf_ckInterpreter.Release.Run<TExecutionContext>(
    ref TExecutionContext executionContext,
    ref Brainf_ckOperation opcodes,
    ref int jumpTable,
    ref Range functions,
    ref ushort definitions,
    ref StackFrame stackFrames,
    ref int depth,
    ref int totalOperations,
    ref int totalFunctions,
    ref StdinBuffer stdin,
    StdoutBuffer stdout,
    CancellationToken executionToken)
    where TExecutionContext : struct, IMachineStateExecutionContext
{ }

It's in the main Brainf_ckSharp project in the solution. Hope this helps, let me know if there's anything else I can do on my end!

Thank you again for your time to look into this!

Lxiamail commented 4 years ago

What do you mean with "we can repo the Fibonacci regression"?

Sorry for the typo. I meant repro. :)

AndyAyersMS commented 4 years ago

Here's what I get locally for preview3 vs 3.1.4.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.3.20216.6
  [Host]        : .NET Core 5.0.0 (CoreCLR 5.0.20.21406, CoreFX 5.0.20.21406), X64 RyuJIT
  .NET Core 3.1 : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT
  .NET Core 5.0 : .NET Core 5.0.0 (CoreCLR 5.0.20.21406, CoreFX 5.0.20.21406), X64 RyuJIT

|  Method |           Job |       Runtime |       Name |            Mean |          Error |         StdDev |          Median |  Gen 0 |  Gen 1 | Gen 2 | Allocated |
|-------- |-------------- |-------------- |----------- |----------------:|---------------:|---------------:|----------------:|-------:|-------:|------:|----------:|
| Release | .NET Core 3.1 | .NET Core 3.1 |   Division |        46.72 us |       1.221 us |       3.505 us |        45.69 us | 0.4272 | 0.0610 |     - |   1.98 KB |
| Release | .NET Core 5.0 | .NET Core 5.0 |   Division |        47.10 us |       0.932 us |       1.818 us |        46.78 us | 0.4272 | 0.0610 |     - |   1.98 KB |
|         |               |               |            |                 |                |                |                 |        |        |       |           |
| Release | .NET Core 3.1 | .NET Core 3.1 |  Fibonacci |    15,390.92 us |     278.227 us |     320.406 us |    15,340.11 us |      - |      - |     - |   1.88 KB |
| Release | .NET Core 5.0 | .NET Core 5.0 |  Fibonacci |    16,290.62 us |     304.622 us |     687.583 us |    16,217.32 us |      - |      - |     - |   1.88 KB |
|         |               |               |            |                 |                |                |                 |        |        |       |           |
| Release | .NET Core 3.1 | .NET Core 3.1 | Mandelbrot | 9,198,119.60 us | 178,346.549 us | 505,939.475 us | 9,036,080.90 us |      - |      - |     - |  34.92 KB |
| Release | .NET Core 5.0 | .NET Core 5.0 | Mandelbrot | 9,107,098.79 us |  85,771.875 us |  91,774.945 us | 9,107,802.05 us |      - |      - |     - |   56.3 KB |

So Fibonacci seems to have a clear regression; the other two are less definite.

I'll also try this on my coffee lake machine with preview 4.

Sergio0694 commented 4 years ago

It's really interesting that the result seems to be different compared than my notebook with the i7 8750H, and between tests as well. The code is pretty much always the same, the only difference is just the script being executed. Very curious to know what the cause of this regression is! 😊

Also if it helps, I've run the benchmark on my desktop as well:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.264 (2004/?/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.100-preview.4.20258.7
  [Host]        : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT
  .NET Core 3.1 : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT
  .NET Core 5.0 : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT

|  Method |           Job |       Runtime |       Name |             Mean |         Error |        StdDev |  Gen 0 |  Gen 1 | Gen 2 | Allocated |
|-------- |-------------- |-------------- |----------- |-----------------:|--------------:|--------------:|-------:|-------:|------:|----------:|
| Release | .NET Core 3.1 | .NET Core 3.1 |   Division |         58.02 us |      0.425 us |      0.377 us | 0.4272 | 0.0610 |     - |   1.98 KB |
| Release | .NET Core 5.0 | .NET Core 5.0 |   Division |         57.44 us |      0.218 us |      0.193 us | 0.4272 | 0.0610 |     - |   1.98 KB |
|         |               |               |            |                  |               |               |        |        |       |           |
| Release | .NET Core 3.1 | .NET Core 3.1 |  Fibonacci |     18,769.40 us |    188.645 us |    176.458 us |      - |      - |     - |   1.88 KB |
| Release | .NET Core 5.0 | .NET Core 5.0 |  Fibonacci |     19,512.48 us |     65.521 us |     54.713 us |      - |      - |     - |   1.88 KB |
|         |               |               |            |                  |               |               |        |        |       |           |
| Release | .NET Core 3.1 | .NET Core 3.1 | Mandelbrot | 10,193,671.39 us | 56,002.394 us | 49,644.663 us |      - |      - |     - |  34.92 KB |
| Release | .NET Core 5.0 | .NET Core 5.0 | Mandelbrot | 10,567,216.99 us | 55,330.619 us | 51,756.296 us |      - |      - |     - |  34.92 KB |

Here it looks like both Fibonacci and Mandelbrot have a regression on .NET 5 🤔

Lxiamail commented 4 years ago

v5.0 Preview4 on my desktop machine:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2) Intel Xeon CPU E5-1650 v3 3.50GHz, 1 CPU, 12 logical and 6 physical cores .NET Core SDK=5.0.100-preview.4.20258.7 [Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT .NET Core 3.1 : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT .NET Core 5.0 : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT

Method	Job	Runtime	Name	Mean	Error	StdDev	Median	Gen 0	Gen 1	Gen 2	Allocated
Release	.NET Core 3.1	.NET Core 3.1	Division	56.76 us	1.134 us	2.781 us	56.35 us	0.2441	-	-	1.98 KB
Release	.NET Core 5.0	.NET Core 5.0	Division	59.53 us	1.188 us	3.067 us	59.58 us	0.2441	0.0610	-	1.98 KB

Release	.NET Core 3.1	.NET Core 3.1	Fibonacci	18,712.15 us	361.648 us	816.299 us	18,670.31 us	-	-	-	1.88 KB
Release	.NET Core 5.0	.NET Core 5.0	Fibonacci	20,263.92 us	405.230 us	799.885 us	20,200.51 us	-	-	-	1.88 KB

Release	.NET Core 3.1	.NET Core 3.1	Mandelbrot	9,068,163.27 us	179,311.793 us	389,809.080 us	8,912,338.50 us	-	-	-	34.92 KB
Release	.NET Core 5.0	.NET Core 5.0	Mandelbrot	8,844,831.10 us	112,542.738 us	93,978.283 us	8,821,120.60 us	-	-	-	34.92 KB

Lxiamail commented 4 years ago

v5.0 preview4 on my laptop:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2) Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores .NET Core SDK=5.0.100-preview.4.20258.7 [Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT .NET Core 3.1 : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT .NET Core 5.0 : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT

Method	Job	Runtime	Name	Mean	Error	StdDev	Median	Gen 0	Gen 1	Gen 2	Allocated
Release	.NET Core 3.1	.NET Core 3.1	Division	53.78 us	1.883 us	5.524 us	54.77 us	0.4272	0.0610	-	1.98 KB
Release	.NET Core 5.0	.NET Core 5.0	Division	54.75 us	2.043 us	5.861 us	54.89 us	0.4272	0.0610	-	1.98 KB

Release	.NET Core 3.1	.NET Core 3.1	Fibonacci	17,393.43 us	484.839 us	1,383.273 us	18,087.78 us	-	-	-	1.88 KB
Release	.NET Core 5.0	.NET Core 5.0	Fibonacci	19,879.41 us	659.548 us	1,783.126 us	20,369.69 us	-	-	-	1.88 KB

Release	.NET Core 3.1	.NET Core 3.1	Mandelbrot	9,702,039.52 us	320,848.461 us	910,193.682 us	9,478,173.00 us	-	-	-	34.92 KB
Release	.NET Core 5.0	.NET Core 5.0	Mandelbrot	12,128,826.70 us	454,169.554 us	1,310,383.007 us	11,903,840.25 us	-	-	-	34.92 KB

AndyAyersMS commented 4 years ago

On a lot of these run the noise level is fairly high so getting a clear picture of what is going on may be a challenge. My kaby lake seems to have low noise levels so I'm going to start drilling in there.

Sergio0694 commented 4 years ago

I'm really surprised by how different the regression (or not) is between different systems, across the various test cases. That benchmark for Mandelbrot in the last run that @Lxiamail posted looks particularly gnarly, from 9.7s to 12.1s (+2.5s), that's over 20% 😶

@AndyAyersMS Thanks again for investigating this. On a side note (not an expert, just trying to learn as much as I possibly can), could you elaborate what you mean by "noise level"? Does that just refer to the variance between different runs, or is it a term that has a specific meaning in this context?

AndyAyersMS commented 4 years ago

could you elaborate what you mean by "noise level"? Does that just refer to the variance between different runs

Yes, the variance.

AndyAyersMS commented 4 years ago

As you said, all the time is spent in the Run method.

Comparing disassembly from 3.1 and 5.0 the main difference I see is the code is laid out differently. I haven't tried matching up all the parts yet to see if there are any big discrepancies in path lengths, but the 5.0 code is a bit more compact, so I suspect that analysis will (in general) favor 5.0.

My hunch is that this method's perf is very sensitive to branch alignments. The different inputs cause the branches in the method to be taken in different patterns and with different frequencies, so they can see different impacts; some benchmarks might run faster, others slower. And the impact can also vary from chip to chip as we've seen above.

I'll have to look with vtune or similar to see if this holds up -- basically if I'm right, we'd expect to see similar instructions retired numbers but different total cycles (and hence different IPC, instructions per clock), indicating the perf difference between 3.1 and 5.0 is attributable to micro-architectural stalls.

Sergio0694 commented 4 years ago

That's very interesting, thank you for the update and for sharing all those details, I appreciate it!

I'm curious to know if that idea you had is correct, I wouldn't have imagined this regression could've been caused by such low level and architecture specific implementation details 😄

If that's the case, as to my understanding the JIT will usually tend to favor smaller codegen as that'll result in better caching and better performance for the majority of methods, do you think this is just an unwanted but unavoidable regression with .NET 5 or could there be a fix to avoid this

As a possible workaround, assuming the slowdown is indeed caused by branch alignment, do you reckon just shuffling the order of those switch cases could help? I don't mean in code, as the JIT would still generate the same asm, but to actually alter the order of the contants representing each possible operation, eg. to move the less frequently used first in the hope that the others would end up being pushed down a bit in the codegen and maybe get closer to alingment again? 🤔

Thank you again for your time! 😊

AndyAyersMS commented 4 years ago

I wouldn't have imagined this regression could've been caused by such low level and architecture specific implementation details

These micro-architectural issues can surface whenever code is structured such that overall performance depends critically on just a few branches in the program.

I don't think there is a lot we can act on here for 5.0 so I'm going to mark this as future. When I have time (or if you have time) we should look at the behavior of this code using a low-level profiler like VTune to see if the hypothesis above is correct.

do you reckon just shuffling the order of those switch cases could help?

It might help, but it's somewhat of a random optimization strategy, since we don't know for sure this is the problem.

Sergio0694 commented 4 years ago

Hey Andy, thank you so much for the update! 😊

This does seem like something tricky to investigate (and arguably pretty niche too), so I understand it makes sense to just move the milestone to after 5.0, that's fair! I also plan to eventually open sourcing the whole repo, so that should also make things simpler in case anyone else wants to peek at the code and try things out too! Eg. I know Egor Bo was interested in this too, but he was understandably not enthusiastic about the issue not having a publicly available repro to just try out without hassle.

Regarding VTune, I'm afraid I won't be of much help in that area as I've never used that before 😅 I mean I can definitely give it a try, but given that I'm not familiar with it at all I'd imagine whatever I could figure out from there would easily be done by you in a fraction of the time, and with much more accuracy, given your expertise in this area.

Really looking forward to seeing how the investigation into this issues goes though, and I'm very curious to knew whether at the end of the day it'll be possible to somehow tweak this in the codegen, or manually from the code somehow. Thanks again for your time!

Sergio0694 commented 4 years ago

A small update - I just discovered that BenchmarkDotNet supports hardware counters in custom benchmarks, when adding the extra BenchmarkDotNet.Diagnostics.Windows package. I gave it a try on the two Fibonacci and Mandelbrot tests, and got this:

Runtime	Name	Mean	Error	StdDev
.NET Core 3.1	Fibonacci	12.30 ms	0.233 ms	0.218 ms
.NET Core 5.0	Fibonacci	14.89 ms	0.150 ms	0.133 ms

.NET Core 3.1	Mandelbrot	7,541.28 ms	9.967 ms	8.836 ms
.NET Core 5.0	Mandelbrot	8,004.61 ms	18.301 ms	15.283 ms

Runtime	BranchInstructions/Op	CacheMisses/Op	BranchMispredictions/Op
.NET Core 3.1	12,856,994	3,443	13,125
.NET Core 5.0	12,919,642	3,549	13,160

.NET Core 3.1	3,006,009,403	732,341	69,946,334
.NET Core 5.0	5,613,692,830	1,249,280	54,387,663

In particular I'm very confused to see the .NET 5 run for the Mandelbrot test case showing almost twice the branch instructions (even though it's exactly the same code..?), and most importantly, almost twice the cache misses..? 🤔

Not sure exactly what to make of this, but thought it might be interesting to share in case it helps!

NOTE (for those with repo access): you can see the results here are better than the original benchmarks I posted, this is because I made some further optimizations to the interpreter in the meantime (compared to the repro/net5-regression branch shared earlier). This bench was run from master, specifically from commit b0f8b6b8969235144fc8ce4d77539252f0188ee8.

AndyAyersMS commented 4 years ago

Comparing disassembly from 3.1 and 5.0 the main difference I see is the code is laid out differently.

This is likely the explanation for the increased number of branches executed. Note branch mispredictions are down substantially. So it might be from extra unconditional branches.

Instructions retired is probably the most interesting initial stat; I think BDN lets you measure that too.

Sergio0694 commented 4 years ago

This is likely the explanation for the increased number of branches executed. Note branch mispredictions are down substantially. So it might be from extra unconditional branches.

Oh that makes sense, I thought that counter only referred to conditional branches, so I was confused as I didn't get why .NET 5 should've had more of them, thanks! Could this hypothesis also explain that increased number of cache misses on .NET 5? I'm thinking, if it does have more jumps, the code could generate more cache misses in the instruction cache?

I've re-run the benchmarks ading the "instructions retired" counter as you asked, and... I'm more confused than I was before 😶

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.20161 Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET Core SDK=5.0.100-preview.5.20279.10 [Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.27801, CoreFX 5.0.20.27801), X64 RyuJIT .NET Core 3.1 : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT .NET Core 5.0 : .NET Core 5.0.0 (CoreCLR 5.0.20.27801, CoreFX 5.0.20.27801), X64 RyuJIT

Method	Job	Runtime	Name	Mean	Error	StdDev	InstructionRetired/Op
Release	.NET Core 3.1	.NET Core 3.1	Fibonacci	12.49 ms	0.223 ms	0.209 ms	170,287
Release	.NET Core 5.0	.NET Core 5.0	Fibonacci	14.74 ms	0.035 ms	0.030 ms	2,108

Release	.NET Core 3.1	.NET Core 3.1	Mandelbrot	7,691.65 ms	152.269 ms	192.572 ms	876,819,115
Release	.NET Core 5.0	.NET Core 5.0	Mandelbrot	8,057.37 ms	14.214 ms	11.097 ms	57,590,810

I'm seeing the .NET 5 runs have 10x or less the number of intructions retired per op, I'm really not sure what to make of this 🤔

Again this is the first time I try out this sort of performance investigation, so if you have time feel free to share any thoughts or comments you might have, I'm really interested in hearing your take on all this and learning more about the topic 😊

EDIT: I've also run the same benchmarks again using the "monitoring" strategy, just to gather more info.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.20161 Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET Core SDK=5.0.100-preview.5.20279.10 [Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.27801, CoreFX 5.0.20.27801), X64 RyuJIT Job-SKEYKV : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT Job-ONQUXV : .NET Core 5.0.0 (CoreCLR 5.0.20.27801, CoreFX 5.0.20.27801), X64 RyuJIT

RunStrategy=Monitoring

Method	Job	Runtime	Name	Mean	Error	StdDev	InstructionRetired/Op
Release	Job-SKEYKV	.NET Core 3.1	Fibonacci	12.86 ms	1.636 ms	1.082 ms	9,108,139
Release	Job-ONQUXV	.NET Core 5.0	Fibonacci	16.92 ms	0.456 ms	0.302 ms	106,496

Release	Job-SKEYKV	.NET Core 3.1	Mandelbrot	7,089.00 ms	34.911 ms	23.092 ms	1,479,251,870
Release	Job-ONQUXV	.NET Core 5.0	Mandelbrot	7,159.35 ms	50.153 ms	33.173 ms	775,731,493

A few points:

I'm really not sure how in the mandelbrot run, both .NET Core 3.1 and .NET 5 are much faster than in the previous benchmark.
The delta between .NET Core 3.1 and .NET 5 is greater in the fibonaccy benchmark, but much closer in the mandelbrot one (?)
The reported instructions retired are much more than in the previous benchmark. Maybe it's just a reporting thing?

I mean, I'm not sure how to interpret these results but I'm sure they might make more sense to you, hope this helps!

Sergio0694 commented 4 years ago

Wanted to provide an update for this. I've tested this again, this time with .NET 5 RC1, and unfortunatey it seems the regression is still there and also actually worse than before - in my Mandelbrot test case I'm now seeing a 25% performance delta between .NET Core 3.1 and .NET 5 😥

You can see the Mandelbrot test case goes from about 9.1s to over 11.7s when switching from .NET Core 3.1 to .NET 5 RC1. All the other benchmarks got worse as well, though with less of a dramatic difference. That might also be because they're much shorter in general, so the difference is less apparent, not sure.


BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.508 (2004/?/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.100-rc.1.20452.10
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.45114, CoreFX 5.0.20.45114), X64 RyuJIT
  Job-ADEFHG : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), X64 RyuJIT
  Job-EVTURJ : .NET Core 5.0.0 (CoreCLR 5.0.20.45114, CoreFX 5.0.20.45114), X64 RyuJIT
  Job-NTOTVU : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), X64 RyuJIT
  Job-BWYVBU : .NET Core 5.0.0 (CoreCLR 5.0.20.45114, CoreFX 5.0.20.45114), X64 RyuJIT

Job	Runtime	RunStrategy	UnrollFactor	Name	Mean	Error	StdDev
Job-ADEFHG	.NET Core 3.1	Throughput	16	Division	47.22 μs	0.406 μs	0.380 μs
Job-EVTURJ	.NET Core 5.0	Throughput	16	Division	52.66 μs	0.107 μs	0.100 μs

Job-ADEFHG	.NET Core 3.1	Throughput	16	Fibonacci	17,569.52 μs	98.708 μs	92.331 μs
Job-EVTURJ	.NET Core 5.0	Throughput	16	Fibonacci	18,236.68 μs	15.414 μs	14.419 μs

Job-ADEFHG	.NET Core 3.1	Throughput	16	HelloWorld	15.43 μs	0.026 μs	0.023 μs
Job-EVTURJ	.NET Core 5.0	Throughput	16	HelloWorld	15.78 μs	0.012 μs	0.011 μs

Job-NTOTVU	.NET Core 3.1	Monitoring	1	Mandelbrot	9,131,380.76 μs	38,620.570 μs	25,545.116 μs
Job-BWYVBU	.NET Core 5.0	Monitoring	1	Mandelbrot	11,767,363.84 μs	83,055.091 μs	54,935.800 μs

Job-ADEFHG	.NET Core 3.1	Throughput	16	Multiply	818.13 μs	4.411 μs	3.910 μs
Job-EVTURJ	.NET Core 5.0	Throughput	16	Multiply	902.35 μs	1.995 μs	1.866 μs

Job-ADEFHG	.NET Core 3.1	Throughput	16	Sum	43.45 μs	0.225 μs	0.210 μs
Job-EVTURJ	.NET Core 5.0	Throughput	16	Sum	45.11 μs	0.049 μs	0.046 μs

Sergio0694 commented 4 years ago

Small update - I made the repo open source, it's at https://github.com/Sergio0694/Brainf_ckSharp, so anyone can just clone it and run the benchmarks if interested. Those are in the Brainf_ckSharp.Profiler project, and I recommend to comment out the Debug method in the Brainf_ckBenchmarkBase class (here), as that benchmark takes a long time and it's not the one used to report the performance regression in this issue. All the previous benchmarks shown here were just for the Release method in that class.

EDIT: in case it's useful, I prepared a gist here with the full disassembly of the Run method (the hot path for the interpreter with the regression) both with .NET Core 3.1 (done through BenchmarkDotNet with [DisassemblyDiagnoser]) and with .NET 5 (done with disasmo with a local checked build of .NET 5 from the dotnet/runtime release/5.0 branch). Hope this helps! 😄

AndyAyersMS commented 4 years ago

It's possible the work @kunalspathak is doing might help by aligning of some of those internal branches.

Sergio0694 commented 3 years ago

As mentioned to @AndyAyersMS on Discord, leaving here a repro branch and some repro steps in case it helps 🙂

Clone the repo here: https://github.com/Sergio0694/Brainf_ckSharp/
Checkout to the dotnet-issue-36907 branch
Open the solution through the Brainf_ckSharp_Net5RegressionRepro.slnf file (no UWP stuff and tests)
There's the benchmark ready to go in the Brainf_ckSharp.Profiler project in the Profiling solution directory

adnan-kamili commented 3 years ago

After upgrading to ASP .NET 5.0 from 3.1 our Web API response time almost doubled from ~40ms to >80ms. The only change we made was that we upgraded the packages. I am surprised when the internet is flooded with posts about the improved performance of .NET 5.0, we saw a degraded performance.

danmoseley commented 3 years ago

@adnan-kamili if I'm not mistaken your issue is not connected with this one. Could you please open an issue in dotnet/aspnetcore where they can best help you? You should not see regressed perf and we would like to figure out why.

cjlotz commented 3 years ago

@adnan-kamili did you open an issue in dotnet/aspnetcore? if you so, can you please provide the link so that we can follow the detail. thx

adnan-kamili commented 3 years ago

@cjlotz Our application is very big, I am not sure if it would help as we can't share any sample code to replicate the issue.

cjlotz commented 3 years ago

@adnan-kamili I'm more interested to know whether you managed to work around the performance regression and what you did?

danmoseley commented 3 years ago

@adnan-kamili an issue over there could still be helpful even if you can't share repro code, eg., they can ask about your configuration, what changed, what you're seeing, etc. Others may notice the same problem, and a picture emerge. Either way, it's probably not connected with the codegen issue here (although nothing's impossible 🙂

In short though -- we do'nt want 5.0 to be slower for you, we want it to be faster. So we do want to hear about it and they're the right people to help you.

adnan-kamili commented 3 years ago

We didn't do anything to work around the performance regression. Just waiting for .NET 6.0, maybe that will improve the performance.

I have opened an issue for the same:

https://github.com/dotnet/aspnetcore/issues/29866

LeszekKalibrate commented 1 year ago

Is this fixed yet?

dotnet / runtime