Open mrange opened 4 years ago
The extra no-ops should be gone in current master after https://github.com/dotnet/coreclr/pull/26740/ .
cc @adamsitnik
We should verify perf is recovered. Let's take a look as part of .NET 5.
@erozenfeld Can you verify if any work remains here?
I still see a regression with a fairly recent P8 build:
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.959 (1909/November2018Update/19H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.8.20359.6
[Host] : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT DEBUG
Job-DTXTLF : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT
Job-XPSVFA : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT
Job-KQQRIJ : .NET Core 5.0.0 (CoreCLR 5.0.20.35809, CoreFX 5.0.20.35809), X64 RyuJIT
MaxIterationCount=20 MinIterationCount=15 WarmupCount=30
| Method | Job | Runtime | Toolchain | Count | Mean | Error | StdDev | Ratio | RatioSD |
|--------------------- |----------- |-------------- |-------------- |------ |------------:|------------:|------------:|------:|--------:|
| SimplePushStreamTest | Job-DTXTLF | .NET 4.8 | net48 | 100 | 372.9 ns | 4.34 ns | 4.06 ns | 1.00 | 0.00 |
| SimplePushStreamTest | Job-XPSVFA | .NET Core 3.1 | netcoreapp3.1 | 100 | 406.1 ns | 9.33 ns | 9.58 ns | 1.09 | 0.03 |
| SimplePushStreamTest | Job-KQQRIJ | .NET Core 5.0 | netcoreapp50 | 100 | 439.1 ns | 23.10 ns | 26.60 ns | 1.18 | 0.08 |
| | | | | | | | | | |
| SimplePushStreamTest | Job-DTXTLF | .NET 4.8 | net48 | 10000 | 29,770.6 ns | 531.09 ns | 470.80 ns | 1.00 | 0.00 |
| SimplePushStreamTest | Job-XPSVFA | .NET Core 3.1 | netcoreapp3.1 | 10000 | 30,609.8 ns | 402.64 ns | 356.93 ns | 1.03 | 0.02 |
| SimplePushStreamTest | Job-KQQRIJ | .NET Core 5.0 | netcoreapp50 | 10000 | 37,312.9 ns | 1,136.50 ns | 1,167.10 ns | 1.25 | 0.04 |
@briansull if you haven't started looking at this, I'll take a look.
Sure, go ahead and take a look
Suspect this is some residual overhead from tiered compilation. Disabling that gives
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.959 (1909/November2018Update/19H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.8.20359.6
[Host] : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT DEBUG
Job-FZFFZC : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT
Job-FIZZFG : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT
Job-HDIAOD : .NET Core 5.0.0 (CoreCLR 5.0.20.35809, CoreFX 5.0.20.35809), X64 RyuJIT
MaxIterationCount=20 MinIterationCount=15 WarmupCount=30
| Method | Job | Runtime | Toolchain | Count | Mean | Error | StdDev | Ratio | RatioSD | Code Size |
|--------------------- |----------- |-------------- |-------------- |------ |------------:|----------:|----------:|------:|--------:|----------:|
| SimplePushStreamTest | Job-FZFFZC | .NET 4.8 | net48 | 100 | 375.1 ns | 4.80 ns | 4.71 ns | 1.00 | 0.00 | 272 B |
| SimplePushStreamTest | Job-FIZZFG | .NET Core 3.1 | netcoreapp3.1 | 100 | 388.9 ns | 15.25 ns | 15.66 ns | 1.03 | 0.04 | 272 B |
| SimplePushStreamTest | Job-HDIAOD | .NET Core 5.0 | netcoreapp5.0 | 100 | 389.1 ns | 7.26 ns | 6.79 ns | 1.04 | 0.02 | 273 B |
| | | | | | | | | | | |
| SimplePushStreamTest | Job-FZFFZC | .NET 4.8 | net48 | 10000 | 31,360.3 ns | 545.98 ns | 510.71 ns | 1.00 | 0.00 | 272 B |
| SimplePushStreamTest | Job-FIZZFG | .NET Core 3.1 | netcoreapp3.1 | 10000 | 30,618.3 ns | 217.74 ns | 193.02 ns | 0.98 | 0.02 | 272 B |
| SimplePushStreamTest | Job-HDIAOD | .NET Core 5.0 | netcoreapp5.0 | 10000 | 31,787.3 ns | 412.13 ns | 385.51 ns | 1.01 | 0.02 | 273 B |
Differential profiles of 4.8 vs 5.0 show identical codegen in all the hot methods, eg:
;; 4.8
;; Program+PerformanceRegression+Tests+SimplePushStreamTest@42-6.Invoke(Int64)
00007ff8`f2816040 488b4908 mov rcx,qword ptr [rcx+8]
00007ff8`f2816044 48ffc2 inc rdx
00007ff8`f2816047 488b01 mov rax,qword ptr [rcx]
00007ff8`f281604a 488b4040 mov rax,qword ptr [rax+40h]
00007ff8`f281604e 488b4020 mov rax,qword ptr [rax+20h]
00007ff8`f2816052 48ffe0 jmp rax
;; 5.0
;; Assembly listing for method SimplePushStreamTest@42-6:Invoke(long):Unit:this
G_M35068_IG01:
;; bbWeight=1 PerfScore 0.00
G_M35068_IG02:
mov rcx, gword ptr [rcx+8]
inc rdx
mov rax, qword ptr [rcx]
mov rax, qword ptr [rax+64]
mov rax, qword ptr [rax+32]
;; bbWeight=1 PerfScore 8.25
G_M35068_IG03:
rex.jmp rax
cc @kouvel
Going to relabel this as VM.
This seems to have something to do with JIT timing. I have seen some cases before where different loop alignment led to noticeable regressions when tiering is enabled. It could also be better at random, but naturally the regressions are more noticed. Adjusting timing of rejits by changing the call count threshold seems to change the perf significantly. Code locality may also be relevant, as I noticed that using R2R'ed runtime binaries versus using IL-only runtime binaries also affects perf significantly regardless of tiering. Tiering timings have changed between 3.1 and 5.0, typically it would be for the better, though there may be benchmarks where slight differences in timing may realize as larger differences in performance. I suspect this is not a regression due to a bug but rather a regression due to chance. It would need more investigation to determine the root causes, and if my theory is right, to determine how to reduce the chance factor and make it more deterministic.
I see large differences in time spent in a function when the function's code crosses a cache line boundary. For small functions it may help to align them such that they would fit within a cache line.
I see, I was also thinking something like this:
0 < size <= 16
then align at any 16-byte offset in a cache line16 < size <= 32
then align at offsets 0, 16, or 32, but not at 4832 < size <= 48
then align at offsets 0 or 1648 < size <= 64
then align at offset 0 onlyMight allow for a bit more density but #2249 also mentioned that crossing a 32-byte boundary has perf issues for loops, so 32-byte alignment may work better in some cases.
Hi.
I saw a performance regression going from .NET 4.8 to dotnet core 3.1. It's small so in practice this might not hurt most users but I thought it's better to create an issue than keeping mum.
I noticed it when discussing my other issue: https://github.com/dotnet/runtime/issues/2191 so the code will be similar although I don't think this is tail call related but I don't know for sure of course.
When setting up a simple push stream pipeline
Benchmark dotnet reports:
.NET 4.8 performs between 10% to 20% faster than dotnet core 3.1.
I dug a bit into the jitted assembler and found the following differences
It seems that in dotnet core there's an extra nop at the start of each method. I suspected tiered compilation but after much messing about trying to disable tiered compilation it's either unrelated or I wasn't able to disable tiered compilation.
It surprises me that the nop adds this much overhead but I can't spot anything else of significance.
The code is here: https://github.com/mrange/TryNewDisassembler/tree/fsharpPerformanceRegression
And here:
category:cq theme:optimization skill-level:intermediate cost:medium