dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

RyuJIT: Tight loop performance regression #71628

Open ptr-null opened 2 years ago

ptr-null commented 2 years ago

Description

Consider the following method containing tight loop:

int[] array = new int[1000000];

public int Sum()
{
    int sum = 0;

    for (int i = 0; i < array.Length; i++)
        sum += array[i];

    return sum;
}

The method is modified then by adding a short check before the loop, as follows:

long state = 1;

public int Sum2()
{
    int sum = 0;

    if (Interlocked.Read(ref state) == 0)
        return sum;

    for (int i = 0; i < array.Length; i++)
        sum += array[i];

    return sum;
}

It is reasonable to expect (given that array is sufficiently long) that the check added should not affect method performance dramatically, right?

But it does:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1766 (21H2)
Intel Core i5-4690 CPU 3.50GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
.NET SDK=6.0.301
  [Host]     : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
  DefaultJob : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
Method Mean Error StdDev Ratio RatioSD
Sum 590,207.1 ns 9,395.29 ns 8,328.68 ns 1.00 0.00
Sum2 1,943,186.9 ns 34,793.55 ns 32,545.90 ns 3.29 0.06

Modified method becomes ~3x slower, which suggests that the loop is performed slower.

Analysis

Indeed, it is. Inspecting IL does not expose any difference (loop bodies are identical in both cases). Code generated by RyuJIT for the loop is different however.

In the first case, the sum is accumulated into eax register:

...
add       eax,[r8+r9*4+10]        // sum += array[i]
...

Where as in the second one it is accumulated into r9d register, which is loaded from and saved to stack on each iteration of the loop:

...
mov       r9d,[rsp+24]            // sum <-- [stack]
add       r9d,[rcx+r8*4+10]       // sum += array[i]
...
mov       [rsp+24],r9d            // [stack] <-- sum
...

There is one additional jump in the second case also. See complete generated codes at the end.

Regression

As one may see from the benchmark results below, this is the regression actually:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1766 (21H2)
Intel Core i5-4690 CPU 3.50GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
.NET SDK=6.0.301
  [Host]             : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
  .NET 6.0           : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
  .NET 5.0           : .NET 5.0.17 (5.0.1722.21314), X64 RyuJIT
  .NET Core 3.1      : .NET Core 3.1.26 (CoreCLR 4.700.22.26002, CoreFX 4.700.22.26801), X64 RyuJIT
  .NET Core 2.2      : .NET Core 2.2.8 (CoreCLR 4.6.28207.03, CoreFX 4.6.28208.02), X64 RyuJIT
  .NET Framework 4.8 : .NET Framework 4.8 (4.8.4515.0), X64 RyuJIT
Method Runtime Mean Error StdDev Ratio RatioSD
Sum .NET 6.0 587,219.9 ns 11,480.13 ns 10,738.52 ns 1.00 0.00
Sum2 .NET 6.0 1,943,794.5 ns 15,662.22 ns 13,078.66 ns 3.33 0.04
Sum .NET 5.0 584,441.7 ns 9,659.41 ns 9,035.41 ns 1.00 0.00
Sum2 .NET 5.0 1,955,357.4 ns 18,500.14 ns 16,399.89 ns 3.35 0.06
Sum .NET Core 3.1 597,269.8 ns 8,984.42 ns 8,404.03 ns 1.00 0.00
Sum2 .NET Core 3.1 1,987,871.5 ns 35,514.92 ns 33,220.67 ns 3.33 0.09
Sum .NET Core 2.2 585,353.0 ns 7,527.64 ns 7,041.36 ns 1.00 0.00
Sum2 .NET Core 2.2 576,804.1 ns 2,975.21 ns 2,637.45 ns 0.99 0.01
Sum .NET Framework 4.8 580,485.7 ns 5,634.76 ns 5,270.76 ns 1.00 0.00
Sum2 .NET Framework 4.8 591,283.8 ns 8,772.34 ns 8,205.65 ns 1.02 0.02

Both methods were on par in the .NET Framework 4.8 and .NET Core 2.2

Generated code

.NET 6.0.6 (6.0.622.26707), X64 RyuJIT

; Sum()
       sub       rsp,28
       xor       eax,eax
       xor       edx,edx
       mov       rcx,[rcx+8]
       cmp       dword ptr [rcx+8],0
       jle       short M00_L01
       nop       dword ptr [rax]
       nop       dword ptr [rax]
M00_L00:
       mov       r8,rcx
       cmp       edx,[r8+8]
       jae       short M00_L02
       movsxd    r9,edx
       add       eax,[r8+r9*4+10]
       inc       edx
       cmp       [rcx+8],edx
       jg        short M00_L00
M00_L01:
       add       rsp,28
       ret
M00_L02:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 67
; Sum2()
       sub       rsp,28
       xor       eax,eax
       mov       [rsp+24],eax
       cmp       [rcx],ecx
       lea       rdx,[rcx+10]
       xor       r8d,r8d
       xor       eax,eax
       lock cmpxchg [rdx],r8
       test      rax,rax
       jne       short M00_L00
       xor       eax,eax
       add       rsp,28
       ret
M00_L00:
       xor       eax,eax
       mov       rdx,[rcx+8]
       cmp       dword ptr [rdx+8],0
       jle       short M00_L04
M00_L01:
       mov       rcx,rdx
       cmp       eax,[rcx+8]
       jae       short M00_L05
       movsxd    r8,eax
       mov       r9d,[rsp+24]
       add       r9d,[rcx+r8*4+10]
       inc       eax
       cmp       [rdx+8],eax
       jg        short M00_L03
M00_L02:
       mov       eax,r9d
       add       rsp,28
       ret
M00_L03:
       mov       [rsp+24],r9d
       jmp       short M00_L01
M00_L04:
       mov       r9d,[rsp+24]
       jmp       short M00_L02
M00_L05:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 106

category:cq theme:register-allocator skill-level:expert cost:large impact:large

gfoidl commented 2 years ago

Beside the codegen you should write the method like

public int Sum()
{
    if (Interlocked.Read(ref _state) == 0)
    {
        return 0;
    }

    int sum = 0;
    int[] array = _array;

    for (int i = 0; i < array.Length; ++i)
        sum += array[i];

    return sum;
}

It's faster then, and codegen will be as expected.

So key points are:

ghost commented 2 years ago

This issue has been marked needs-author-action and may be missing some important information.

JulieLeeMSFT commented 2 years ago

@ptr-null please try what @gfoidl suggested and let us know if it removes the regression. cc @dotnet/jit-contrib.

kunalspathak commented 2 years ago

This is an epic problem of LSRA's including resolution moves in the middle of the loop where spill/reload happens. The generated code that gfoidl suggested is much better, but regardless the LSRA problem should be handled in runtime.

ptr-null commented 2 years ago

@gfoidl, the example I used here is quite reduced form of the real code piece (still reproducing the issue). Workaround is not something to worry about (I have several for my real problem). Thanks for the array-trick hint though. I've seen it before, but forgot happily. Those range checks are especially annoying to see in my case, because of I have array as readonly field of the instance (hello to https://github.com/dotnet/runtime/issues/11797). The other annoying thing I noticed is this != null added (https://github.com/dotnet/runtime/issues/44087 once again). cc @JulieLeeMSFT

kunalspathak commented 1 year ago

This is likely won't get time during .NET 8, but I will mark it as Pri3 for .NET 8.

kunalspathak commented 2 months ago

This falls in the category of "resolution phase" of LSRA noted in https://github.com/dotnet/runtime/issues/47194.