Open Sergio0694 opened 4 years ago
@Sergio0694 thank you for reporting the issue. From your reported results, besides looks like Division, Fibonacci, and Mandelbrot may have potential regression on .NET5, the other cases have improvements. We can repro the Fibonacci regression on .NET5 and will investigate the root cause. For the other cases, we don't have repro showing regressions on .NET5. Can you share your microbenchmark tests with me so that we can do further investigation?
Hey @Lxiamail - thank you for chiming in!
I'm very confused by what you said here:
"We can repo the Fibonacci regression on .NET5 and will investigate the root cause"
What do you mean with "we can repo the Fibonacci regression"? All these tests were done specifically with my interpreter for that language as I mentioned, that "Fibonacci" test was not just the classic Fibonacci method written in C# 🤔
Anyway, I've sent you an invite to be added as a collaborato to my private repo Brainf_ckSharp
.
To repro the issue, you need to switch to the repro/net5-regression
branch.
There you'll find the Brainf_ckSharp.Profiler
project (in the "Profiling") folder in the solution.
It's already configured to run with both .NET Core 3.1 and .NET 5.
Let me know if there's anything else I can help you with, thanks again! 😊
"Fibonacci" test was not just the classic Fibonacci method written in C#
We got our Fibonaccis crossed. We have an internal benchmark with that name that looks like it might have regressed.
Let me know if there's anything else I can help you with
Have you done any profiling to try and pin down where things got slower?
You can add me as a collaborator.
Ah, that explains it, sorry for the confusion! 😄
@AndyAyersMS I've added you to a collaborator as well!
As mentioned before, the repro is in the repro/net5-regression
branch, in the Brainf_ckSharp.Profiler
project (under the "Profiling" folder).
I can say that 99.8% of the CPU time is spent in this function:
Brainf_ckSharp.Brainf_ckInterpreter.Release.Run<TExecutionContext>(
ref TExecutionContext executionContext,
ref Brainf_ckOperation opcodes,
ref int jumpTable,
ref Range functions,
ref ushort definitions,
ref StackFrame stackFrames,
ref int depth,
ref int totalOperations,
ref int totalFunctions,
ref StdinBuffer stdin,
StdoutBuffer stdout,
CancellationToken executionToken)
where TExecutionContext : struct, IMachineStateExecutionContext
{ }
It's in the main Brainf_ckSharp
project in the solution.
Hope this helps, let me know if there's anything else I can do on my end!
Thank you again for your time to look into this!
What do you mean with "we can repo the Fibonacci regression"?
Sorry for the typo. I meant repro. :)
Here's what I get locally for preview3 vs 3.1.4.
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.3.20216.6
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.21406, CoreFX 5.0.20.21406), X64 RyuJIT
.NET Core 3.1 : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT
.NET Core 5.0 : .NET Core 5.0.0 (CoreCLR 5.0.20.21406, CoreFX 5.0.20.21406), X64 RyuJIT
| Method | Job | Runtime | Name | Mean | Error | StdDev | Median | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------- |-------------- |-------------- |----------- |----------------:|---------------:|---------------:|----------------:|-------:|-------:|------:|----------:|
| Release | .NET Core 3.1 | .NET Core 3.1 | Division | 46.72 us | 1.221 us | 3.505 us | 45.69 us | 0.4272 | 0.0610 | - | 1.98 KB |
| Release | .NET Core 5.0 | .NET Core 5.0 | Division | 47.10 us | 0.932 us | 1.818 us | 46.78 us | 0.4272 | 0.0610 | - | 1.98 KB |
| | | | | | | | | | | | |
| Release | .NET Core 3.1 | .NET Core 3.1 | Fibonacci | 15,390.92 us | 278.227 us | 320.406 us | 15,340.11 us | - | - | - | 1.88 KB |
| Release | .NET Core 5.0 | .NET Core 5.0 | Fibonacci | 16,290.62 us | 304.622 us | 687.583 us | 16,217.32 us | - | - | - | 1.88 KB |
| | | | | | | | | | | | |
| Release | .NET Core 3.1 | .NET Core 3.1 | Mandelbrot | 9,198,119.60 us | 178,346.549 us | 505,939.475 us | 9,036,080.90 us | - | - | - | 34.92 KB |
| Release | .NET Core 5.0 | .NET Core 5.0 | Mandelbrot | 9,107,098.79 us | 85,771.875 us | 91,774.945 us | 9,107,802.05 us | - | - | - | 56.3 KB |
So Fibonacci seems to have a clear regression; the other two are less definite.
I'll also try this on my coffee lake machine with preview 4.
It's really interesting that the result seems to be different compared than my notebook with the i7 8750H, and between tests as well. The code is pretty much always the same, the only difference is just the script being executed. Very curious to know what the cause of this regression is! 😊
Also if it helps, I've run the benchmark on my desktop as well:
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.264 (2004/?/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.100-preview.4.20258.7
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT
.NET Core 3.1 : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT
.NET Core 5.0 : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT
| Method | Job | Runtime | Name | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------- |-------------- |-------------- |----------- |-----------------:|--------------:|--------------:|-------:|-------:|------:|----------:|
| Release | .NET Core 3.1 | .NET Core 3.1 | Division | 58.02 us | 0.425 us | 0.377 us | 0.4272 | 0.0610 | - | 1.98 KB |
| Release | .NET Core 5.0 | .NET Core 5.0 | Division | 57.44 us | 0.218 us | 0.193 us | 0.4272 | 0.0610 | - | 1.98 KB |
| | | | | | | | | | | |
| Release | .NET Core 3.1 | .NET Core 3.1 | Fibonacci | 18,769.40 us | 188.645 us | 176.458 us | - | - | - | 1.88 KB |
| Release | .NET Core 5.0 | .NET Core 5.0 | Fibonacci | 19,512.48 us | 65.521 us | 54.713 us | - | - | - | 1.88 KB |
| | | | | | | | | | | |
| Release | .NET Core 3.1 | .NET Core 3.1 | Mandelbrot | 10,193,671.39 us | 56,002.394 us | 49,644.663 us | - | - | - | 34.92 KB |
| Release | .NET Core 5.0 | .NET Core 5.0 | Mandelbrot | 10,567,216.99 us | 55,330.619 us | 51,756.296 us | - | - | - | 34.92 KB |
Here it looks like both Fibonacci and Mandelbrot have a regression on .NET 5 🤔
v5.0 Preview4 on my desktop machine:
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2) Intel Xeon CPU E5-1650 v3 3.50GHz, 1 CPU, 12 logical and 6 physical cores .NET Core SDK=5.0.100-preview.4.20258.7 [Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT .NET Core 3.1 : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT .NET Core 5.0 : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT
Method | Job | Runtime | Name | Mean | Error | StdDev | Median | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
Release | .NET Core 3.1 | .NET Core 3.1 | Division | 56.76 us | 1.134 us | 2.781 us | 56.35 us | 0.2441 | - | - | 1.98 KB |
Release | .NET Core 5.0 | .NET Core 5.0 | Division | 59.53 us | 1.188 us | 3.067 us | 59.58 us | 0.2441 | 0.0610 | - | 1.98 KB |
Release | .NET Core 3.1 | .NET Core 3.1 | Fibonacci | 18,712.15 us | 361.648 us | 816.299 us | 18,670.31 us | - | - | - | 1.88 KB |
Release | .NET Core 5.0 | .NET Core 5.0 | Fibonacci | 20,263.92 us | 405.230 us | 799.885 us | 20,200.51 us | - | - | - | 1.88 KB |
Release | .NET Core 3.1 | .NET Core 3.1 | Mandelbrot | 9,068,163.27 us | 179,311.793 us | 389,809.080 us | 8,912,338.50 us | - | - | - | 34.92 KB |
Release | .NET Core 5.0 | .NET Core 5.0 | Mandelbrot | 8,844,831.10 us | 112,542.738 us | 93,978.283 us | 8,821,120.60 us | - | - | - | 34.92 KB |
v5.0 preview4 on my laptop:
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2) Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores .NET Core SDK=5.0.100-preview.4.20258.7 [Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT .NET Core 3.1 : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT .NET Core 5.0 : .NET Core 5.0.0 (CoreCLR 5.0.20.25106, CoreFX 5.0.20.25106), X64 RyuJIT
Method | Job | Runtime | Name | Mean | Error | StdDev | Median | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
Release | .NET Core 3.1 | .NET Core 3.1 | Division | 53.78 us | 1.883 us | 5.524 us | 54.77 us | 0.4272 | 0.0610 | - | 1.98 KB |
Release | .NET Core 5.0 | .NET Core 5.0 | Division | 54.75 us | 2.043 us | 5.861 us | 54.89 us | 0.4272 | 0.0610 | - | 1.98 KB |
Release | .NET Core 3.1 | .NET Core 3.1 | Fibonacci | 17,393.43 us | 484.839 us | 1,383.273 us | 18,087.78 us | - | - | - | 1.88 KB |
Release | .NET Core 5.0 | .NET Core 5.0 | Fibonacci | 19,879.41 us | 659.548 us | 1,783.126 us | 20,369.69 us | - | - | - | 1.88 KB |
Release | .NET Core 3.1 | .NET Core 3.1 | Mandelbrot | 9,702,039.52 us | 320,848.461 us | 910,193.682 us | 9,478,173.00 us | - | - | - | 34.92 KB |
Release | .NET Core 5.0 | .NET Core 5.0 | Mandelbrot | 12,128,826.70 us | 454,169.554 us | 1,310,383.007 us | 11,903,840.25 us | - | - | - | 34.92 KB |
On a lot of these run the noise level is fairly high so getting a clear picture of what is going on may be a challenge. My kaby lake seems to have low noise levels so I'm going to start drilling in there.
I'm really surprised by how different the regression (or not) is between different systems, across the various test cases. That benchmark for Mandelbrot in the last run that @Lxiamail posted looks particularly gnarly, from 9.7s to 12.1s (+2.5s), that's over 20% 😶
@AndyAyersMS Thanks again for investigating this. On a side note (not an expert, just trying to learn as much as I possibly can), could you elaborate what you mean by "noise level"? Does that just refer to the variance between different runs, or is it a term that has a specific meaning in this context?
could you elaborate what you mean by "noise level"? Does that just refer to the variance between different runs
Yes, the variance.
As you said, all the time is spent in the Run
method.
Comparing disassembly from 3.1 and 5.0 the main difference I see is the code is laid out differently. I haven't tried matching up all the parts yet to see if there are any big discrepancies in path lengths, but the 5.0 code is a bit more compact, so I suspect that analysis will (in general) favor 5.0.
My hunch is that this method's perf is very sensitive to branch alignments. The different inputs cause the branches in the method to be taken in different patterns and with different frequencies, so they can see different impacts; some benchmarks might run faster, others slower. And the impact can also vary from chip to chip as we've seen above.
I'll have to look with vtune or similar to see if this holds up -- basically if I'm right, we'd expect to see similar instructions retired numbers but different total cycles (and hence different IPC, instructions per clock), indicating the perf difference between 3.1 and 5.0 is attributable to micro-architectural stalls.
That's very interesting, thank you for the update and for sharing all those details, I appreciate it!
I'm curious to know if that idea you had is correct, I wouldn't have imagined this regression could've been caused by such low level and architecture specific implementation details 😄
If that's the case, as to my understanding the JIT will usually tend to favor smaller codegen as that'll result in better caching and better performance for the majority of methods, do you think this is just an unwanted but unavoidable regression with .NET 5 or could there be a fix to avoid this
As a possible workaround, assuming the slowdown is indeed caused by branch alignment, do you reckon just shuffling the order of those switch cases could help? I don't mean in code, as the JIT would still generate the same asm, but to actually alter the order of the contants representing each possible operation, eg. to move the less frequently used first in the hope that the others would end up being pushed down a bit in the codegen and maybe get closer to alingment again? 🤔
Thank you again for your time! 😊
I wouldn't have imagined this regression could've been caused by such low level and architecture specific implementation details
These micro-architectural issues can surface whenever code is structured such that overall performance depends critically on just a few branches in the program.
I don't think there is a lot we can act on here for 5.0 so I'm going to mark this as future. When I have time (or if you have time) we should look at the behavior of this code using a low-level profiler like VTune to see if the hypothesis above is correct.
do you reckon just shuffling the order of those switch cases could help?
It might help, but it's somewhat of a random optimization strategy, since we don't know for sure this is the problem.
Hey Andy, thank you so much for the update! 😊
This does seem like something tricky to investigate (and arguably pretty niche too), so I understand it makes sense to just move the milestone to after 5.0, that's fair! I also plan to eventually open sourcing the whole repo, so that should also make things simpler in case anyone else wants to peek at the code and try things out too! Eg. I know Egor Bo was interested in this too, but he was understandably not enthusiastic about the issue not having a publicly available repro to just try out without hassle.
Regarding VTune, I'm afraid I won't be of much help in that area as I've never used that before 😅 I mean I can definitely give it a try, but given that I'm not familiar with it at all I'd imagine whatever I could figure out from there would easily be done by you in a fraction of the time, and with much more accuracy, given your expertise in this area.
Really looking forward to seeing how the investigation into this issues goes though, and I'm very curious to knew whether at the end of the day it'll be possible to somehow tweak this in the codegen, or manually from the code somehow. Thanks again for your time!
A small update - I just discovered that BenchmarkDotNet
supports hardware counters in custom benchmarks, when adding the extra BenchmarkDotNet.Diagnostics.Windows
package. I gave it a try on the two Fibonacci and Mandelbrot tests, and got this:
Runtime | Name | Mean | Error | StdDev |
---|---|---|---|---|
.NET Core 3.1 | Fibonacci | 12.30 ms | 0.233 ms | 0.218 ms |
.NET Core 5.0 | Fibonacci | 14.89 ms | 0.150 ms | 0.133 ms |
.NET Core 3.1 | Mandelbrot | 7,541.28 ms | 9.967 ms | 8.836 ms |
.NET Core 5.0 | Mandelbrot | 8,004.61 ms | 18.301 ms | 15.283 ms |
Runtime | BranchInstructions/Op | CacheMisses/Op | BranchMispredictions/Op |
---|---|---|---|
.NET Core 3.1 | 12,856,994 | 3,443 | 13,125 |
.NET Core 5.0 | 12,919,642 | 3,549 | 13,160 |
.NET Core 3.1 | 3,006,009,403 | 732,341 | 69,946,334 |
.NET Core 5.0 | 5,613,692,830 | 1,249,280 | 54,387,663 |
In particular I'm very confused to see the .NET 5 run for the Mandelbrot test case showing almost twice the branch instructions (even though it's exactly the same code..?), and most importantly, almost twice the cache misses..? 🤔
Not sure exactly what to make of this, but thought it might be interesting to share in case it helps!
NOTE (for those with repo access): you can see the results here are better than the original benchmarks I posted, this is because I made some further optimizations to the interpreter in the meantime (compared to the repro/net5-regression
branch shared earlier). This bench was run from master
, specifically from commit b0f8b6b8969235144fc8ce4d77539252f0188ee8
.
Comparing disassembly from 3.1 and 5.0 the main difference I see is the code is laid out differently.
This is likely the explanation for the increased number of branches executed. Note branch mispredictions are down substantially. So it might be from extra unconditional branches.
Instructions retired is probably the most interesting initial stat; I think BDN lets you measure that too.
This is likely the explanation for the increased number of branches executed. Note branch mispredictions are down substantially. So it might be from extra unconditional branches.
Oh that makes sense, I thought that counter only referred to conditional branches, so I was confused as I didn't get why .NET 5 should've had more of them, thanks! Could this hypothesis also explain that increased number of cache misses on .NET 5? I'm thinking, if it does have more jumps, the code could generate more cache misses in the instruction cache?
I've re-run the benchmarks ading the "instructions retired" counter as you asked, and... I'm more confused than I was before 😶
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.20161 Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET Core SDK=5.0.100-preview.5.20279.10 [Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.27801, CoreFX 5.0.20.27801), X64 RyuJIT .NET Core 3.1 : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT .NET Core 5.0 : .NET Core 5.0.0 (CoreCLR 5.0.20.27801, CoreFX 5.0.20.27801), X64 RyuJIT
Method | Job | Runtime | Name | Mean | Error | StdDev | InstructionRetired/Op |
---|---|---|---|---|---|---|---|
Release | .NET Core 3.1 | .NET Core 3.1 | Fibonacci | 12.49 ms | 0.223 ms | 0.209 ms | 170,287 |
Release | .NET Core 5.0 | .NET Core 5.0 | Fibonacci | 14.74 ms | 0.035 ms | 0.030 ms | 2,108 |
Release | .NET Core 3.1 | .NET Core 3.1 | Mandelbrot | 7,691.65 ms | 152.269 ms | 192.572 ms | 876,819,115 |
Release | .NET Core 5.0 | .NET Core 5.0 | Mandelbrot | 8,057.37 ms | 14.214 ms | 11.097 ms | 57,590,810 |
I'm seeing the .NET 5 runs have 10x or less the number of intructions retired per op, I'm really not sure what to make of this 🤔
Again this is the first time I try out this sort of performance investigation, so if you have time feel free to share any thoughts or comments you might have, I'm really interested in hearing your take on all this and learning more about the topic 😊
EDIT: I've also run the same benchmarks again using the "monitoring" strategy, just to gather more info.
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.20161 Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET Core SDK=5.0.100-preview.5.20279.10 [Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.27801, CoreFX 5.0.20.27801), X64 RyuJIT Job-SKEYKV : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT Job-ONQUXV : .NET Core 5.0.0 (CoreCLR 5.0.20.27801, CoreFX 5.0.20.27801), X64 RyuJIT
RunStrategy=Monitoring
Method | Job | Runtime | Name | Mean | Error | StdDev | InstructionRetired/Op |
---|---|---|---|---|---|---|---|
Release | Job-SKEYKV | .NET Core 3.1 | Fibonacci | 12.86 ms | 1.636 ms | 1.082 ms | 9,108,139 |
Release | Job-ONQUXV | .NET Core 5.0 | Fibonacci | 16.92 ms | 0.456 ms | 0.302 ms | 106,496 |
Release | Job-SKEYKV | .NET Core 3.1 | Mandelbrot | 7,089.00 ms | 34.911 ms | 23.092 ms | 1,479,251,870 |
Release | Job-ONQUXV | .NET Core 5.0 | Mandelbrot | 7,159.35 ms | 50.153 ms | 33.173 ms | 775,731,493 |
A few points:
I mean, I'm not sure how to interpret these results but I'm sure they might make more sense to you, hope this helps!
Wanted to provide an update for this. I've tested this again, this time with .NET 5 RC1, and unfortunatey it seems the regression is still there and also actually worse than before - in my Mandelbrot test case I'm now seeing a 25% performance delta between .NET Core 3.1 and .NET 5 😥
You can see the Mandelbrot test case goes from about 9.1s to over 11.7s when switching from .NET Core 3.1 to .NET 5 RC1. All the other benchmarks got worse as well, though with less of a dramatic difference. That might also be because they're much shorter in general, so the difference is less apparent, not sure.
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.508 (2004/?/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.100-rc.1.20452.10
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.45114, CoreFX 5.0.20.45114), X64 RyuJIT
Job-ADEFHG : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), X64 RyuJIT
Job-EVTURJ : .NET Core 5.0.0 (CoreCLR 5.0.20.45114, CoreFX 5.0.20.45114), X64 RyuJIT
Job-NTOTVU : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), X64 RyuJIT
Job-BWYVBU : .NET Core 5.0.0 (CoreCLR 5.0.20.45114, CoreFX 5.0.20.45114), X64 RyuJIT
Job | Runtime | RunStrategy | UnrollFactor | Name | Mean | Error | StdDev |
---|---|---|---|---|---|---|---|
Job-ADEFHG | .NET Core 3.1 | Throughput | 16 | Division | 47.22 μs | 0.406 μs | 0.380 μs |
Job-EVTURJ | .NET Core 5.0 | Throughput | 16 | Division | 52.66 μs | 0.107 μs | 0.100 μs |
Job-ADEFHG | .NET Core 3.1 | Throughput | 16 | Fibonacci | 17,569.52 μs | 98.708 μs | 92.331 μs |
Job-EVTURJ | .NET Core 5.0 | Throughput | 16 | Fibonacci | 18,236.68 μs | 15.414 μs | 14.419 μs |
Job-ADEFHG | .NET Core 3.1 | Throughput | 16 | HelloWorld | 15.43 μs | 0.026 μs | 0.023 μs |
Job-EVTURJ | .NET Core 5.0 | Throughput | 16 | HelloWorld | 15.78 μs | 0.012 μs | 0.011 μs |
Job-NTOTVU | .NET Core 3.1 | Monitoring | 1 | Mandelbrot | 9,131,380.76 μs | 38,620.570 μs | 25,545.116 μs |
Job-BWYVBU | .NET Core 5.0 | Monitoring | 1 | Mandelbrot | 11,767,363.84 μs | 83,055.091 μs | 54,935.800 μs |
Job-ADEFHG | .NET Core 3.1 | Throughput | 16 | Multiply | 818.13 μs | 4.411 μs | 3.910 μs |
Job-EVTURJ | .NET Core 5.0 | Throughput | 16 | Multiply | 902.35 μs | 1.995 μs | 1.866 μs |
Job-ADEFHG | .NET Core 3.1 | Throughput | 16 | Sum | 43.45 μs | 0.225 μs | 0.210 μs |
Job-EVTURJ | .NET Core 5.0 | Throughput | 16 | Sum | 45.11 μs | 0.049 μs | 0.046 μs |
Small update - I made the repo open source, it's at https://github.com/Sergio0694/Brainf_ckSharp, so anyone can just clone it and run the benchmarks if interested. Those are in the Brainf_ckSharp.Profiler
project, and I recommend to comment out the Debug
method in the Brainf_ckBenchmarkBase
class (here), as that benchmark takes a long time and it's not the one used to report the performance regression in this issue. All the previous benchmarks shown here were just for the Release
method in that class.
EDIT: in case it's useful, I prepared a gist here with the full disassembly of the Run
method (the hot path for the interpreter with the regression) both with .NET Core 3.1 (done through BenchmarkDotNet with [DisassemblyDiagnoser]
) and with .NET 5 (done with disasmo with a local checked build of .NET 5 from the dotnet/runtime release/5.0
branch). Hope this helps! 😄
It's possible the work @kunalspathak is doing might help by aligning of some of those internal branches.
As mentioned to @AndyAyersMS on Discord, leaving here a repro branch and some repro steps in case it helps 🙂
dotnet-issue-36907
branchBrainf_ckSharp_Net5RegressionRepro.slnf
file (no UWP stuff and tests)Brainf_ckSharp.Profiler
project in the Profiling
solution directoryAfter upgrading to ASP .NET 5.0 from 3.1 our Web API response time almost doubled from ~40ms to >80ms. The only change we made was that we upgraded the packages. I am surprised when the internet is flooded with posts about the improved performance of .NET 5.0, we saw a degraded performance.
@adnan-kamili if I'm not mistaken your issue is not connected with this one. Could you please open an issue in dotnet/aspnetcore where they can best help you? You should not see regressed perf and we would like to figure out why.
@adnan-kamili did you open an issue in dotnet/aspnetcore? if you so, can you please provide the link so that we can follow the detail. thx
@cjlotz Our application is very big, I am not sure if it would help as we can't share any sample code to replicate the issue.
@adnan-kamili I'm more interested to know whether you managed to work around the performance regression and what you did?
@adnan-kamili an issue over there could still be helpful even if you can't share repro code, eg., they can ask about your configuration, what changed, what you're seeing, etc. Others may notice the same problem, and a picture emerge. Either way, it's probably not connected with the codegen issue here (although nothing's impossible 🙂
In short though -- we do'nt want 5.0 to be slower for you, we want it to be faster. So we do want to hear about it and they're the right people to help you.
We didn't do anything to work around the performance regression. Just waiting for .NET 6.0, maybe that will improve the performance.
I have opened an issue for the same:
Is this fixed yet?
Description
I have a .NET Standard 2.0 library implementing a high performance interpreter for the brainf*ck language. At the end of the day it is essentially just a Turing machine, so all the operations being performed are basically accesses to memory, incrementing values, looking up tables, etc. It's all "self-contained", with no external APIs being invoked, IO operations or anything.
I've run some benchmarks on .NET Core 2.1, .NET Core 3.1 and .NET 5 (Preview 4) and noticed that .NET 5 is consistently slower than .NET Core 3.1 in almost all cases. Especially in the mandlebrot test case, which is the most intensive, .NET 5 is almost one second slower, so about 10% slower.
I find this very surprising, as .NET 5 is supposed to be much more optimized than .NET Core 3.1.
Pinging @EgorBo as you seemed interested in this when I shared some older benchmarks about this on Twitter (here) back when .NET 5 preview 1 had just been released, and cc. @tannergooding.
Configuration
Regression?
This is a regression in speed from .NET Core 3.1 to .NET 5 Preview 4, on the same machine.
Data
category:cq theme:jit-block-layout skill-level:expert cost:large