dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.29k stars 4.74k forks source link

Minor performance regression going to dotnet core 3.1 from .NET 4.8 #31613

Open mrange opened 4 years ago

mrange commented 4 years ago

Hi.

I saw a performance regression going from .NET 4.8 to dotnet core 3.1. It's small so in practice this might not hurt most users but I thought it's better to create an issue than keeping mum.

I noticed it when discussing my other issue: https://github.com/dotnet/runtime/issues/2191 so the code will be similar although I don't think this is tail call related but I don't know for sure of course.

When setting up a simple push stream pipeline

// Minimalistic PushStream
//  A PushStream accepts a receiver function that will be called
//  with each value in the PushStream
type 'T PushStream = ('T -> unit) -> unit

module PushStream =
  let inline zero      ()       = LanguagePrimitives.GenericZero
  let inline push      r v      = r v

  // Creates a PushStream with all integers from b to e (inclusive)
  let inline fromRange b e    r = for i = b to e do push r i
  // Maps all values in ps using mapping function f
  let inline map       f   ps r = ps (fun v -> push r (f v))
  // Filters all values in ps using filter function f
  let inline filter    f   ps r = ps (fun v -> if f v then push r v)
  // Sums all values in ps
  let inline sum           ps   = let mutable s = zero () in ps (fun v -> s <- s + v); s

[<DisassemblyDiagnoser>]
type Benchmarks () =
  [<Params (10000, 100)>] 
  member val public Count = 100 with get, set

  [<Benchmark>]
  member x.SimplePushStreamTest () =
    PushStream.fromRange  0 x.Count
    |> PushStream.map     int64
    |> PushStream.filter  (fun v -> (v &&& 1L) = 0L)
    |> PushStream.map     ((+) 1L)
    |> PushStream.sum

Benchmark dotnet reports:

$ dotnet run -c Release -f netcoreapp3.1 --filter '*' --runtimes net48 netcoreapp3.1
...
|               Method |       Runtime |     Toolchain | Count |        Mean |     Error |    StdDev | Ratio | RatioSD | Code Size |
|--------------------- |-------------- |-------------- |------ |------------:|----------:|----------:|------:|--------:|----------:|
| SimplePushStreamTest |      .NET 4.8 |         net48 |   100 |    400.6 ns |   3.92 ns |   3.67 ns |  1.00 |    0.00 |     272 B |
| SimplePushStreamTest | .NET Core 3.1 | netcoreapp3.1 |   100 |    439.3 ns |   4.35 ns |   4.07 ns |  1.10 |    0.02 |     273 B |
|                      |               |               |       |             |           |           |       |         |           |
| SimplePushStreamTest |      .NET 4.8 |         net48 | 10000 | 33,542.5 ns | 143.25 ns | 133.99 ns |  1.00 |    0.00 |     272 B |
| SimplePushStreamTest | .NET Core 3.1 | netcoreapp3.1 | 10000 | 39,449.8 ns | 259.08 ns | 242.35 ns |  1.18 |    0.01 |     273 B |

.NET 4.8 performs between 10% to 20% faster than dotnet core 3.1.

I dug a bit into the jitted assembler and found the following differences

--- dotnetcore.asm
+++ net48.asm
@@ -1,4 +1,4 @@
-; dotnet core 3.1
+; .net v48

 ; PushStream.fromRange  0 x.Count
 LOOP:
@@ -12,7 +12,6 @@
 jne     LOOP

 ; PushStream.map     int64
-nop     dword ptr [rax+rax]
 mov     rcx,qword ptr [rcx+8]
 movsxd  rdx,edx
 mov     rax,qword ptr [rcx]
@@ -21,8 +20,7 @@
 jmp     rax

 ; PushStream.filter  (fun v -> (v &&& 1L) = 0L)
-nop     dword ptr [rax+rax]
-mov     eax,edx
+mov     rax,rdx
 test    al,1
 jne     BAILOUT
 mov     rcx,qword ptr [rcx+8]
@@ -35,7 +33,6 @@
 ret

 ; PushStream.map     ((+) 1L)
-nop     dword ptr [rax+rax]
 mov     rcx,qword ptr [rcx+8]
 inc     rdx
 mov     rax,qword ptr [rcx]
@@ -44,11 +41,9 @@
 jmp     rax

 ; PushStream.sum
-nop     dword ptr [rax+rax]
 mov     rax,qword ptr [rcx+8]
 mov     rcx,rax
 add     rdx,qword ptr [rax+8]
 mov     qword ptr [rcx+8],rdx
 xor     eax,eax
 ret
-

It seems that in dotnet core there's an extra nop at the start of each method. I suspected tiered compilation but after much messing about trying to disable tiered compilation it's either unrelated or I wasn't able to disable tiered compilation.

It surprises me that the nop adds this much overhead but I can't spot anything else of significance.

The code is here: https://github.com/mrange/TryNewDisassembler/tree/fsharpPerformanceRegression

And here:

module PerformanceRegression =
  open System
  open System.Linq
  open System.Diagnostics

  // Minimalistic PushStream
  //  A PushStream accepts a receiver function that will be called
  //  with each value in the PushStream
  type 'T PushStream = ('T -> unit) -> unit

  module PushStream =
    let inline zero      ()       = LanguagePrimitives.GenericZero
    let inline push      r v      = r v

    // Creates a PushStream with all integers from b to e (inclusive)
    let inline fromRange b e    r = for i = b to e do push r i
    // Maps all values in ps using mapping function f
    let inline map       f   ps r = ps (fun v -> push r (f v))
    // Filters all values in ps using filter function f
    let inline filter    f   ps r = ps (fun v -> if f v then push r v)
    // Sums all values in ps
    let inline sum           ps   = let mutable s = zero () in ps (fun v -> s <- s + v); s

  module Tests =
    open BenchmarkDotNet.Attributes
    open BenchmarkDotNet.Configs
    open BenchmarkDotNet.Jobs
    open BenchmarkDotNet.Horology
    open BenchmarkDotNet.Running
    open BenchmarkDotNet.Diagnostics.Windows.Configs

    [<DisassemblyDiagnoser>]
    type Benchmarks () =
      [<Params (10000, 100)>] 
      member val public Count = 100 with get, set

      [<Benchmark>]
      member x.SimplePushStreamTest () =
        PushStream.fromRange  0 x.Count
        |> PushStream.map     int64
        |> PushStream.filter  (fun v -> (v &&& 1L) = 0L)
        |> PushStream.map     ((+) 1L)
        |> PushStream.sum

    let run argv = 
      let job = Job.Default
                    .WithWarmupCount(30)
                    .WithIterationTime(TimeInterval.FromMilliseconds(250.0)) // the default is 0.5s per iteration, which is slighlty too much for us
                    .WithMinIterationCount(15)
                    .WithMaxIterationCount(20)
                    .AsDefault()
      let config = DefaultConfig.Instance.AddJob(job)
      let b = BenchmarkSwitcher [|typeof<Benchmarks>|]
      let summary = b.Run(argv, config)
      printfn "%A" summary

// Run with: dotnet run -c Release -f netcoreapp3.1 --filter '*' --runtimes net48 netcoreapp3.1
[<EntryPoint>]
let main argv =
  PerformanceRegression.Tests.run argv
  0

category:cq theme:optimization skill-level:intermediate cost:medium

jkotas commented 4 years ago

The extra no-ops should be gone in current master after https://github.com/dotnet/coreclr/pull/26740/ .

jkotas commented 4 years ago

cc @adamsitnik

AndyAyersMS commented 4 years ago

We should verify perf is recovered. Let's take a look as part of .NET 5.

BruceForstall commented 4 years ago

@erozenfeld Can you verify if any work remains here?

AndyAyersMS commented 4 years ago

I still see a regression with a fairly recent P8 build:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.959 (1909/November2018Update/19H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.8.20359.6
  [Host]     : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT DEBUG
  Job-DTXTLF : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT
  Job-XPSVFA : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT
  Job-KQQRIJ : .NET Core 5.0.0 (CoreCLR 5.0.20.35809, CoreFX 5.0.20.35809), X64 RyuJIT

MaxIterationCount=20  MinIterationCount=15  WarmupCount=30

|               Method |        Job |       Runtime |     Toolchain | Count |        Mean |       Error |      StdDev | Ratio | RatioSD |
|--------------------- |----------- |-------------- |-------------- |------ |------------:|------------:|------------:|------:|--------:|
| SimplePushStreamTest | Job-DTXTLF |      .NET 4.8 |         net48 |   100 |    372.9 ns |     4.34 ns |     4.06 ns |  1.00 |    0.00 |
| SimplePushStreamTest | Job-XPSVFA | .NET Core 3.1 | netcoreapp3.1 |   100 |    406.1 ns |     9.33 ns |     9.58 ns |  1.09 |    0.03 |
| SimplePushStreamTest | Job-KQQRIJ | .NET Core 5.0 |  netcoreapp50 |   100 |    439.1 ns |    23.10 ns |    26.60 ns |  1.18 |    0.08 |
|                      |            |               |               |       |             |             |             |       |         |
| SimplePushStreamTest | Job-DTXTLF |      .NET 4.8 |         net48 | 10000 | 29,770.6 ns |   531.09 ns |   470.80 ns |  1.00 |    0.00 |
| SimplePushStreamTest | Job-XPSVFA | .NET Core 3.1 | netcoreapp3.1 | 10000 | 30,609.8 ns |   402.64 ns |   356.93 ns |  1.03 |    0.02 |
| SimplePushStreamTest | Job-KQQRIJ | .NET Core 5.0 |  netcoreapp50 | 10000 | 37,312.9 ns | 1,136.50 ns | 1,167.10 ns |  1.25 |    0.04 |
AndyAyersMS commented 4 years ago

@briansull if you haven't started looking at this, I'll take a look.

briansull commented 4 years ago

Sure, go ahead and take a look

AndyAyersMS commented 4 years ago

Suspect this is some residual overhead from tiered compilation. Disabling that gives

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.959 (1909/November2018Update/19H2)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.8.20359.6
  [Host]     : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT DEBUG
  Job-FZFFZC : .NET Framework 4.8 (4.8.4180.0), X64 RyuJIT
  Job-FIZZFG : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT
  Job-HDIAOD : .NET Core 5.0.0 (CoreCLR 5.0.20.35809, CoreFX 5.0.20.35809), X64 RyuJIT

MaxIterationCount=20  MinIterationCount=15  WarmupCount=30

|               Method |        Job |       Runtime |     Toolchain | Count |        Mean |     Error |    StdDev | Ratio | RatioSD | Code Size |
|--------------------- |----------- |-------------- |-------------- |------ |------------:|----------:|----------:|------:|--------:|----------:|
| SimplePushStreamTest | Job-FZFFZC |      .NET 4.8 |         net48 |   100 |    375.1 ns |   4.80 ns |   4.71 ns |  1.00 |    0.00 |     272 B |
| SimplePushStreamTest | Job-FIZZFG | .NET Core 3.1 | netcoreapp3.1 |   100 |    388.9 ns |  15.25 ns |  15.66 ns |  1.03 |    0.04 |     272 B |
| SimplePushStreamTest | Job-HDIAOD | .NET Core 5.0 | netcoreapp5.0 |   100 |    389.1 ns |   7.26 ns |   6.79 ns |  1.04 |    0.02 |     273 B |
|                      |            |               |               |       |             |           |           |       |         |           |
| SimplePushStreamTest | Job-FZFFZC |      .NET 4.8 |         net48 | 10000 | 31,360.3 ns | 545.98 ns | 510.71 ns |  1.00 |    0.00 |     272 B |
| SimplePushStreamTest | Job-FIZZFG | .NET Core 3.1 | netcoreapp3.1 | 10000 | 30,618.3 ns | 217.74 ns | 193.02 ns |  0.98 |    0.02 |     272 B |
| SimplePushStreamTest | Job-HDIAOD | .NET Core 5.0 | netcoreapp5.0 | 10000 | 31,787.3 ns | 412.13 ns | 385.51 ns |  1.01 |    0.02 |     273 B |

Differential profiles of 4.8 vs 5.0 show identical codegen in all the hot methods, eg:


;; 4.8
;; Program+PerformanceRegression+Tests+SimplePushStreamTest@42-6.Invoke(Int64)

00007ff8`f2816040 488b4908        mov     rcx,qword ptr [rcx+8]
00007ff8`f2816044 48ffc2          inc     rdx
00007ff8`f2816047 488b01          mov     rax,qword ptr [rcx]
00007ff8`f281604a 488b4040        mov     rax,qword ptr [rax+40h]
00007ff8`f281604e 488b4020        mov     rax,qword ptr [rax+20h]
00007ff8`f2816052 48ffe0          jmp     rax

;; 5.0
;; Assembly listing for method SimplePushStreamTest@42-6:Invoke(long):Unit:this

G_M35068_IG01:
                        ;; bbWeight=1    PerfScore 0.00
G_M35068_IG02:
       mov      rcx, gword ptr [rcx+8]
       inc      rdx
       mov      rax, qword ptr [rcx]
       mov      rax, qword ptr [rax+64]
       mov      rax, qword ptr [rax+32]
                        ;; bbWeight=1    PerfScore 8.25
G_M35068_IG03:
       rex.jmp  rax

cc @kouvel

Going to relabel this as VM.

kouvel commented 4 years ago

This seems to have something to do with JIT timing. I have seen some cases before where different loop alignment led to noticeable regressions when tiering is enabled. It could also be better at random, but naturally the regressions are more noticed. Adjusting timing of rejits by changing the call count threshold seems to change the perf significantly. Code locality may also be relevant, as I noticed that using R2R'ed runtime binaries versus using IL-only runtime binaries also affects perf significantly regardless of tiering. Tiering timings have changed between 3.1 and 5.0, typically it would be for the better, though there may be benchmarks where slight differences in timing may realize as larger differences in performance. I suspect this is not a regression due to a bug but rather a regression due to chance. It would need more investigation to determine the root causes, and if my theory is right, to determine how to reduce the chance factor and make it more deterministic.

kouvel commented 4 years ago

I see large differences in time spent in a function when the function's code crosses a cache line boundary. For small functions it may help to align them such that they would fit within a cache line.

AndyAyersMS commented 4 years ago

2249 added 32 byte alignment for tier1 methods with loops; might be interesting to try this for all tier1 methods.

kouvel commented 4 years ago

I see, I was also thinking something like this:

Might allow for a bit more density but #2249 also mentioned that crossing a 32-byte boundary has perf issues for loops, so 32-byte alignment may work better in some cases.