dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.25k stars 4.73k forks source link

JIT Performance regression between 1.1 and 2.0 and between 2.0 and 2.1 #10466

Open saucecontrol opened 6 years ago

saucecontrol commented 6 years ago

I've been trying to come up with a simple repro for a perf regression I've been fighting, and I think I have it trimmed down as much as possible while still showing the real-world impact. What I have here is a stripped-down version of my Blake2b hashing algorithm implementation.

The mixSimplified method here shows the problem. Code size has grown with each new JIT since 1.1, and performance has dropped.

Runtime x86 Code Size x64 Code Size
netcoreapp1.1 1534 577
netcoreapp2.0 2906 602
netcoreapp2.1 2956 604
netcoreapp3.1 3057 632
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.18362
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]        : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), 64bit RyuJIT
  netcoreapp1.1 : .NET Core 1.1.13 (CoreCLR 4.6.27618.02, CoreFX 4.6.24705.01), 64bit RyuJIT
  netcoreapp2.0 : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT
  netcoreapp2.1 : .NET Core 2.1.14 (CoreCLR 4.6.28207.04, CoreFX 4.6.28208.01), 64bit RyuJIT
  netcoreapp3.1 : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), 64bit RyuJIT
Runtime Platform Mean Error StdDev Scaled
netcoreapp1.1 X64 1.002 ms 0.0090 ms 0.0084 ms 1.00
netcoreapp2.0 X64 1.039 ms 0.0182 ms 0.0152 ms 1.04
netcoreapp2.1 X64 1.039 ms 0.0200 ms 0.0178 ms 1.04
netcoreapp3.1 X64 1.154 ms 0.0222 ms 0.0256 ms 1.15
netcoreapp1.1 X86 2.944 ms 0.0154 ms 0.0144 ms 1.00
netcoreapp2.0 X86 5.780 ms 0.0375 ms 0.0313 ms 1.96
netcoreapp2.1 X86 6.545 ms 0.0329 ms 0.0292 ms 2.22
netcoreapp3.1 X86 6.768 ms 0.0422 ms 0.0374 ms 2.30

JitDisasm and JitDump output for all versions are here: simplifiedhash_jitdumps.zip

category:cq theme:register-allocator skill-level:expert cost:large

janvorli commented 6 years ago

cc: @dotnet/jit-contrib

AndyAyersMS commented 6 years ago

I'll take a quick look. Given that it is an x86 regression my guess it might be similar to what we see over in dotnet/runtime#9833.

saucecontrol commented 6 years ago

Thanks, @AndyAyersMS. I did some further optimization of that inner loop code in my main project that altered results slightly. Don't know if it will help track down the regression. Basically, I added more local variables to hold the data block being hashed. Of course those end up getting spilled, but having them and the accumulator variables in the same stack memory area speeds things up slightly. I also re-ordered some of the operations to reduce register dependencies.

Anyway, the updated version of the function now performs the same on 1.1 and 2.0 on the x64 runtimes but has a consistent 4% slowdown in 2.1. x86 shows the same major slowdown between 1.1 and 2.0 and again between 2.0 and 2.1.

Method Job Platform IsBaseline Mean Error StdDev Scaled Allocated
Blake2bFast netcoreapp1.1 X64 True 10.69 ms 0.0147 ms 0.0123 ms 1.00 0 B
Blake2bFast netcoreapp2.0 X64 Default 10.68 ms 0.0610 ms 0.0571 ms 1.00 0 B
Blake2bFast netcoreapp2.1 X64 Default 11.14 ms 0.0383 ms 0.0359 ms 1.04 0 B
Blake2bFast netcoreapp1.1 X86 True 69.34 ms 0.4273 ms 0.3997 ms 1.00 0 B
Blake2bFast netcoreapp2.0 X86 Default 108.07 ms 0.6709 ms 0.6275 ms 1.56 0 B
Blake2bFast netcoreapp2.1 X86 Default 129.12 ms 0.7999 ms 0.7483 ms 1.86 0 B
AndyAyersMS commented 6 years ago

One would hope x86 is ~2x slower than x64 as primary data type is long/ulong. But no...

Looking at x86, 2.0 vs 2.1 there is a large size regression. Likely RA.

2.0
; Total bytes of code 49767, prolog size 31 for method Blake2bContext:mixManualInline(int):this

2.1
; Total bytes of code 60608, prolog size 31 for method Blake2bContext:mixManualInline(int):this

Suffice to say this is something of a torture test for an x86 allocator, leading to long stretches of stack shuffling:

       89B574FCFFFF mov      dword ptr [ebp-38CH], esi
       8BB578FCFFFF mov      esi, dword ptr [ebp-388H]
       81F66BBD41FB xor      esi, 0xFB41BD6B
       89B578FCFFFF mov      dword ptr [ebp-388H], esi
       8BB574FCFFFF mov      esi, dword ptr [ebp-38CH]
       81F6ABD9831F xor      esi, 0x1F83D9AB
       89B574FCFFFF mov      dword ptr [ebp-38CH], esi
       8BB578FCFFFF mov      esi, dword ptr [ebp-388H]
       89B594FEFFFF mov      dword ptr [ebp-16CH], esi
       8BB574FCFFFF mov      esi, dword ptr [ebp-38CH]
       89B590FEFFFF mov      dword ptr [ebp-170H], esi
       8BF1         mov      esi, ecx
       03B5F4FEFFFF add      esi, dword ptr [ebp-10CH]
       89B574FCFFFF mov      dword ptr [ebp-38CH], esi
       8BF2         mov      esi, edx
       13B5F0FEFFFF adc      esi, dword ptr [ebp-110H]
       89B578FCFFFF mov      dword ptr [ebp-388H], esi
       8B75F0       mov      esi, dword ptr [ebp-10H]
       8B8574FCFFFF mov      eax, dword ptr [ebp-38CH]
       0306         add      eax, dword ptr [esi]
       898574FCFFFF mov      dword ptr [ebp-38CH], eax
       8B8578FCFFFF mov      eax, dword ptr [ebp-388H]
       134604       adc      eax, dword ptr [esi+4]
       8BB574FCFFFF mov      esi, dword ptr [ebp-38CH]

The methods are too big to make sense of as is (especially given the semi-arbitrary IG breaks) so we should cut them down to a representative sample. Probably just one round of mixing would be plenty to surface the challenges the jit must overcome.

Also need to look into what kind of code jit32 produced in 1.1.

AndyAyersMS commented 6 years ago

@saucecontrol Unrolled hash computations are often difficult cases for the allocator because everything ends up highly interdependent. But clearly we have a lot of room for improvement on x86 both on a release/release basis and by comparison to where we are on x64.

Speaking of comparisons, do you have a native (C/C++) implementation that you use as a reference point? It also would be interesting to see how far off our x86/x64 performance is from pure native code.

saucecontrol commented 6 years ago

Sure do. There's a project called Blake2.Bench in that same repo that has comparisons with the reference C version called via PInvoke. My C# version started out with about the same performance on x64, but after optimization, it's quite a bit faster. Performance on x86 has never been as good, but 1.1 did admirably.

Here's another run of that last bench with the native results inlcuded

Method Job Platform IsBaseline Mean Error StdDev Scaled Allocated
Blake2bFast netcoreapp1.1 X64 True 10.73 ms 0.0682 ms 0.0638 ms 1.00 0 B
Blake2bFast netcoreapp2.0 X64 Default 10.79 ms 0.0524 ms 0.0490 ms 1.01 0 B
Blake2bFast netcoreapp2.1 X64 Default 11.19 ms 0.0609 ms 0.0569 ms 1.04 0 B
Blake2bRefNative netcoreapp2.1 X64 Default 16.12 ms 0.0334 ms 0.0279 ms 1.50 0 B
Blake2bFast netcoreapp1.1 X86 True 69.10 ms 0.2009 ms 0.1781 ms 1.00 0 B
Blake2bFast netcoreapp2.0 X86 Default 108.28 ms 0.6423 ms 0.5694 ms 1.57 0 B
Blake2bFast netcoreapp2.1 X86 Default 128.76 ms 0.3564 ms 0.3159 ms 1.86 0 B
Blake2bRefNative netcoreapp2.1 X86 Default 50.00 ms 0.2553 ms 0.2132 ms 0.72 0 B

Since register allocation is such an issue, I reckon it's expected x86 would be more than 2x slower since it has less than half the registers to work with. Combine that with the ulong calculations, and 4x slower is probably about right for a target.

saucecontrol commented 6 years ago

I don't know if this is useful or not, but I also have an implementation of the Blake2s algorithm in the main project in that repo. It's the exact same thing but operates on uint words instead of ulong and does 10 mixing rounds instead of 12.

Method Job Platform IsBaseline Mean Error StdDev Scaled Allocated
Blake2sFast netcoreapp1.1 X64 True 17.06 ms 0.2094 ms 0.1959 ms 1.00 0 B
Blake2sFast netcoreapp2.0 X64 Default 17.34 ms 0.1110 ms 0.1038 ms 1.02 0 B
Blake2sFast netcoreapp2.1 X64 Default 17.23 ms 0.0149 ms 0.0107 ms 1.01 0 B
Blake2sRefNative netcoreapp2.1 X64 Default 25.89 ms 0.0216 ms 0.0168 ms 1.52 0 B
Blake2sFast netcoreapp1.1 X86 True 67.97 ms 0.5795 ms 0.5420 ms 1.00 0 B
Blake2sFast netcoreapp2.0 X86 Default 26.92 ms 0.0720 ms 0.0638 ms 0.40 0 B
Blake2sFast netcoreapp2.1 X86 Default 30.73 ms 0.2364 ms 0.2211 ms 0.45 0 B
Blake2sRefNative netcoreapp2.1 X86 Default 32.10 ms 0.0524 ms 0.0464 ms 0.47 0 B

Interesting that RyuJIT-32 does so much better between 1.1 and 2.0 here, but there's a definite regression between 2.0 and 2.1.

AndyAyersMS commented 6 years ago

Core 1.1 used the older "jit32" jit for x86, and Core 2.0 and later uses RyuJit.

Generally speaking this switch to RyuJit resulted in improved performance for x86. But not always.

I think the perf issues in the initial long version of your code that show up in 2.0 and 2.1 are related to how RyuJit handles the 64 bit long computations on x86. The main challenge here is that these need to be decomposed into two 32 bit computations and various complications arise depending on when and how this is done.

Long decomposition doubles the number of things that deserve registers (vs same computation on x64) while at the same time the allocator is trying to cope with the fact that on x86 there are only ~6 allocatable registers, while on x64 there are ~13. Twice as much demand and half as much supply.

I don't have anything specific to call out just yet though, or a good explanation of why it started off poorly and has gotten worse.

saucecontrol commented 6 years ago

Core 1.1 used the older "jit32" jit for x86, and Core 2.0 and later uses RyuJit.

Ohhhhhhh… I feel like I knew that at one point and it fell out of my brain. Doesn't help that BDN will only run Job.RyuJitX86 with their netcore toolchains 😉. That does at least explain some of it.

saucecontrol commented 6 years ago

I found another case that might be related and is much more simple to follow.

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Xeon CPU E3-1505M v6 3.00GHz, 1 CPU, 8 logical and 4 physical cores
Frequency=2929685 Hz, Resolution=341.3336 ns, Timer=TSC
.NET Core SDK=2.1.301
  [Host]        : .NET Core 2.1.1 (CoreCLR 4.6.26606.02, CoreFX 4.6.26606.05), 32bit RyuJIT
  net46         : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.3110.0
  netcoreapp1.1 : .NET Core 1.1.8 (CoreCLR 4.6.26328.01, CoreFX 4.6.24705.01), 32bit RyuJIT
  netcoreapp2.0 : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 32bit RyuJIT
  netcoreapp2.1 : .NET Core 2.1.1 (CoreCLR 4.6.26606.02, CoreFX 4.6.26606.05), 32bit RyuJIT
Method Job Runtime IsBaseline Mean Error StdDev Scaled
FixedPoint net46 Clr Default 17.10 us 0.0742 us 0.0694 us 1.00
FixedPoint netcoreapp1.1 Core True 17.15 us 0.0985 us 0.0921 us 1.00
FixedPoint netcoreapp2.0 Core Default 18.37 us 0.0978 us 0.0915 us 1.07
FixedPoint netcoreapp2.1 Core Default 18.56 us 0.1381 us 0.1292 us 1.08

This one shows a consistent 6-8% regression between jit32 and RyuJIT on this machine and more like 10-20% on another I tested.

class GreyConverter
{
    unsafe static void ToGreyFixedPoint(byte* prgb, byte* pgrey, int cb)
    {
        const ushort scale = 1 << 15;
        const ushort round = 1 << 14;
        const ushort rw = (ushort)(0.299f * scale + 0.5f);
        const ushort gw = (ushort)(0.587f * scale + 0.5f);
        const ushort bw = (ushort)(0.114f * scale + 0.5f);

        byte* end = prgb + cb;
        while (prgb < end)
        {
            int val = prgb[0] * rw + prgb[1] * gw + prgb[2] * bw;
            *pgrey = (byte)((val + round) >> 15);

            prgb += 3;
            pgrey++;
        }
    }

    byte[] Grey;
    byte[] Rgb;

    public GreyConverter()
    {
        Grey = new byte[16384];
        Rgb = new byte[Grey.Length * 3];

        new Random(42).NextBytes(Rgb);
    }

    [Benchmark]
    unsafe public void FixedPoint()
    {
        fixed (byte* prgb = &Rgb[0], pgrey = &Grey[0])
            ToGreyFixedPoint(prgb, pgrey, Rgb.Length);
    }
}

Here's the BDN assembly dump from the legacy JIT:

081f9690 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
081f9697 8b7d08          mov     edi,dword ptr [ebp+8]
081f969a 03f9            add     edi,ecx
081f969c 3bcf            cmp     ecx,edi
081f969e 7334            jae     081f96d4
081f96a0 0fb611          movzx   edx,byte ptr [ecx]
081f96a3 69d246260000    imul    edx,edx,2646h
081f96a9 0fb64101        movzx   eax,byte ptr [ecx+1]
081f96ad 69c0234b0000    imul    eax,eax,4B23h
081f96b3 03d0            add     edx,eax
081f96b5 0fb64102        movzx   eax,byte ptr [ecx+2]
081f96b9 69c0980e0000    imul    eax,eax,0E98h
081f96bf 03d0            add     edx,eax
081f96c1 81c200400000    add     edx,4000h
081f96c7 c1fa0f          sar     edx,0Fh
081f96ca 8816            mov     byte ptr [esi],dl
081f96cc 83c103          add     ecx,3
081f96cf 46              inc     esi
081f96d0 3bcf            cmp     ecx,edi
081f96d2 72cc            jb      081f96a0
081f96d4 5e              pop     esi

And RyuJIT:

05649198 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
0564919e 8bc1            mov     eax,ecx
056491a0 034508          add     eax,dword ptr [ebp+8]
056491a3 3bc8            cmp     ecx,eax
056491a5 7336            jae     056491dd
056491a7 0fb619          movzx   ebx,byte ptr [ecx]
056491aa 69f346260000    imul    esi,ebx,2646h
056491b0 0fb65901        movzx   ebx,byte ptr [ecx+1]
056491b4 69fb234b0000    imul    edi,ebx,4B23h
056491ba 03f7            add     esi,edi
056491bc 0fb65902        movzx   ebx,byte ptr [ecx+2]
056491c0 69fb980e0000    imul    edi,ebx,0E98h
056491c6 03f7            add     esi,edi
056491c8 81c600400000    add     esi,4000h
056491ce 8bde            mov     ebx,esi
056491d0 c1fb0f          sar     ebx,0Fh
056491d3 881a            mov     byte ptr [edx],bl
056491d5 83c103          add     ecx,3
056491d8 42              inc     edx
056491d9 3bc8            cmp     ecx,eax
056491db 72ca            jb      056491a7
056491dd 5b              pop     ebx

The two differences that jump out at me: 1) RyuJIT uses an extra register and has that extra mov ebx, esi 2) jit32 sets up the first two movzx imul combinations so they can run in parallel whereas RyuJIT uses ebx for both.

mikedn commented 6 years ago

jit32 sets up the first two movzx imul combinations so they can run in parallel whereas RyuJIT uses ebx for both.

The fact that the same register is used for both shouldn't make any differences, CPUs handle such cases by register renaming.

RyuJIT uses an extra register and has that extra mov ebx, esi

That may be. In general, register to register moves should have very low cost, in many cases being handled by register renaming and thus generating no actual μops. Still, they can degrade performance, if only because of the increased code size.

saucecontrol commented 6 years ago

Hmmm... I didn't think about the register renaming, but if the CPU is doing it in both cases, I'm surprised a 2 byte code size increase would have such a large perf impact. I'll add more cases if I see anything different come up.

AndyAyersMS commented 6 years ago

Perf difference might be caused by loop top alignment, note the backedge target addresses:

;;; legacy jit
081f96d2 72cc            jb      081f96a0

;;; ryjit
056491db 72ca            jb      056491a7

There is a perf benefit to having frequent branch targets be addresses == 0 mod 16 (or these days mod 32).

Not sure if legacy jit gets lucky here or would pad above the loops with a slightly different example.

saucecontrol commented 6 years ago

Looks like the alignment was purely coincidental. I added an extra increment before the loop to offset each of them by 1. The RyuJIT version still came out 8% slower with its loop top 16-byte aligned

Method Job IsBaseline Mean Error StdDev Scaled
FixedPoint netcoreapp1.1 True 17.37 us 0.0726 us 0.0679 us 1.00
FixedPoint netcoreapp2.1 Default 18.78 us 0.1559 us 0.1458 us 1.08

jit32 asm

074e9690 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
074e9697 41              inc     ecx
074e9698 8b7d08          mov     edi,dword ptr [ebp+8]
074e969b 03f9            add     edi,ecx
074e969d 3bcf            cmp     ecx,edi
074e969f 7334            jae     074e96d5
074e96a1 0fb611          movzx   edx,byte ptr [ecx]
074e96a4 69d246260000    imul    edx,edx,2646h
074e96aa 0fb64101        movzx   eax,byte ptr [ecx+1]
074e96ae 69c0234b0000    imul    eax,eax,4B23h
074e96b4 03d0            add     edx,eax
074e96b6 0fb64102        movzx   eax,byte ptr [ecx+2]
074e96ba 69c0980e0000    imul    eax,eax,0E98h
074e96c0 03d0            add     edx,eax
074e96c2 81c200400000    add     edx,4000h
074e96c8 c1fa0f          sar     edx,0Fh
074e96cb 8816            mov     byte ptr [esi],dl
074e96cd 83c103          add     ecx,3
074e96d0 46              inc     esi
074e96d1 3bcf            cmp     ecx,edi
074e96d3 72cc            jb      074e96a1
074e96d5 5e              pop     esi

and RyuJIT

084e91a0 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
084e91a6 41              inc     ecx
084e91a7 8bc1            mov     eax,ecx
084e91a9 034508          add     eax,dword ptr [ebp+8]
084e91ac 3bc8            cmp     ecx,eax
084e91ae 7336            jae     084e91e6
084e91b0 0fb619          movzx   ebx,byte ptr [ecx]
084e91b3 69f346260000    imul    esi,ebx,2646h
084e91b9 0fb65901        movzx   ebx,byte ptr [ecx+1]
084e91bd 69fb234b0000    imul    edi,ebx,4B23h
084e91c3 03f7            add     esi,edi
084e91c5 0fb65902        movzx   ebx,byte ptr [ecx+2]
084e91c9 69fb980e0000    imul    edi,ebx,0E98h
084e91cf 03f7            add     esi,edi
084e91d1 81c600400000    add     esi,4000h
084e91d7 8bde            mov     ebx,esi
084e91d9 c1fb0f          sar     ebx,0Fh
084e91dc 881a            mov     byte ptr [edx],bl
084e91de 83c103          add     ecx,3
084e91e1 42              inc     edx
084e91e2 3bc8            cmp     ecx,eax
084e91e4 72ca            jb      084e91b0
084e91e6 5b              pop     ebx
erozenfeld commented 6 years ago

In newer processors crossing a 32-byte boundary in a loop makes a difference because of µop cache. It's definitely the case for loops that are 32 bytes long or smaller. Here the loops are larger than 32 bytes but they cross 32-byte boundary once in jit32 case and twice in RyuJit case. That may affect perf. Can you try aligning the loops at 32 bytes in both cases?

See 9.3 in http://www.agner.org/optimize/microarchitecture.pdf for details about µop cache and 32-bytes alignment.

saucecontrol commented 6 years ago

I managed to get both loop tops 32-byte aligned, but I'm not seeing any changes to the perf numbers. And the perf delta holds up even if I force the jit32 output to be misaligned.

Method Job IsBaseline Mean Error StdDev Scaled
FixedPoint netcoreapp1.1 True 17.36 us 0.1503 us 0.1406 us 1.00
FixedPoint netcoreapp2.1 Default 18.64 us 0.1099 us 0.0974 us 1.07

jit32

07a59690 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
07a59697 8b7d08          mov     edi,dword ptr [ebp+8]
07a5969a 03f9            add     edi,ecx
07a5969c 3bcf            cmp     ecx,edi
07a5969e 7334            jae     07a596d4
07a596a0 0fb611          movzx   edx,byte ptr [ecx]
07a596a3 69d246260000    imul    edx,edx,2646h
07a596a9 0fb64101        movzx   eax,byte ptr [ecx+1]
07a596ad 69c0234b0000    imul    eax,eax,4B23h
07a596b3 03d0            add     edx,eax
07a596b5 0fb64102        movzx   eax,byte ptr [ecx+2]
07a596b9 69c0980e0000    imul    eax,eax,0E98h
07a596bf 03d0            add     edx,eax
07a596c1 81c200400000    add     edx,4000h
07a596c7 c1fa0f          sar     edx,0Fh
07a596ca 8816            mov     byte ptr [esi],dl
07a596cc 83c103          add     ecx,3
07a596cf 46              inc     esi
07a596d0 3bcf            cmp     ecx,edi
07a596d2 72cc            jb      07a596a0
07a596d4 5e              pop     esi

RyuJIT

07a191a0 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
07a191a9 81c100010000    add     ecx,100h
07a191af 81c200010000    add     edx,100h
07a191b5 83c010          add     eax,10h
07a191b8 8d440110        lea     eax,[ecx+eax+10h]
07a191bc 3bc8            cmp     ecx,eax
07a191be 7336            jae     07a191f6
07a191c0 0fb619          movzx   ebx,byte ptr [ecx]
07a191c3 69f346260000    imul    esi,ebx,2646h
07a191c9 0fb65901        movzx   ebx,byte ptr [ecx+1]
07a191cd 69fb234b0000    imul    edi,ebx,4B23h
07a191d3 03f7            add     esi,edi
07a191d5 0fb65902        movzx   ebx,byte ptr [ecx+2]
07a191d9 69fb980e0000    imul    edi,ebx,0E98h
07a191df 03f7            add     esi,edi
07a191e1 81c600400000    add     esi,4000h
07a191e7 8bde            mov     ebx,esi
07a191e9 c1fb0f          sar     ebx,0Fh
07a191ec 881a            mov     byte ptr [edx],bl
07a191ee 83c103          add     ecx,3
07a191f1 42              inc     edx
07a191f2 3bc8            cmp     ecx,eax
07a191f4 72ca            jb      07a191c0
07a191f6 5b              pop     ebx
sgf commented 6 years ago

json-benchmark (java&.net) Java‘s DSL-JSON is so fast ...

The benchmark almost include the fastest .net json library,like Jil,NetJson.... There was hardly any .NET JSON library can be higher.

when could be CLR can be faster more than JVM?

shahid-pk commented 6 years ago

@sgf that benchmark is not using .net core though. It might still be slower. But just pointing out.

saucecontrol commented 5 years ago

I saw there were some recent LSRA changes and decided to revisit this to see if there was any improvement. I managed to distill my original repro down to a much more reasonable size, with only half a round of hash mixing and with just enough math to touch all the variables. Current code is here and the current numbers look like this:


BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview-010184
  [Host]        : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
  netcoreapp1.1 : .NET Core 1.1.10 (CoreCLR 4.6.26906.01, CoreFX 4.6.24705.01), 64bit RyuJIT
  netcoreapp2.0 : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT
  netcoreapp2.1 : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
  netcoreapp3.0 : .NET Core 3.0.0-preview-27324-5 (CoreCLR 4.6.27322.0, CoreFX 4.7.19.7311), 64bit RyuJIT
Method Job Platform Mean Error StdDev Scaled ScaledSD
SimplifiedHash netcoreapp1.1 X64 1.054 ms 0.0135 ms 0.0126 ms 1.00 0.00
SimplifiedHash netcoreapp2.0 X64 1.070 ms 0.0157 ms 0.0147 ms 1.02 0.02
SimplifiedHash netcoreapp2.1 X64 1.070 ms 0.0170 ms 0.0159 ms 1.02 0.02
SimplifiedHash netcoreapp3.0 X64 1.068 ms 0.0137 ms 0.0128 ms 1.01 0.02
SimplifiedHash netcoreapp1.1 X86 3.005 ms 0.0224 ms 0.0199 ms 1.00 0.00
SimplifiedHash netcoreapp2.0 X86 5.881 ms 0.0203 ms 0.0170 ms 1.96 0.01
SimplifiedHash netcoreapp2.1 X86 6.705 ms 0.0364 ms 0.0322 ms 2.23 0.02
SimplifiedHash netcoreapp3.0 X86 6.739 ms 0.0519 ms 0.0485 ms 2.24 0.02

The x64 regression between 1.1 and 2.0 is barely visible, but there is a code size increase to back it up. The x86 regression is quite clear both in the numbers and in the asm. I went ahead and captured JitDisasm and JitDump output for the simplified mixing method on every runtime version (3.0 is from current master) to make this easier to review. Dumps are here: jitdumps.zip

AndyAyersMS commented 5 years ago

I get similar numbers:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.18323
Intel Core i7-4770HQ CPU 2.20GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview4-010539
  [Host]        : .NET Core 3.0.0-preview4-27427-5 (CoreCLR 4.6.27427.71, CoreFX 4.7.19.12613), 64bit RyuJIT
  netcoreapp1.1 : .NET Core 1.1.2 (CoreCLR 4.6.25211.01, CoreFX 4.6.24705.01), 64bit RyuJIT
  netcoreapp2.0 : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT
  netcoreapp2.1 : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
  netcoreapp3.0 : .NET Core 3.0.0-preview4-27427-5 (CoreCLR 4.6.27427.71, CoreFX 4.7.19.12613), 64bit RyuJIT
Method Job Platform IsBaseline Mean Error StdDev Scaled ScaledSD
SimplifiedHash netcoreapp1.1 X64 True 1.259 ms 0.0142 ms 0.0111 ms 1.00 0.00
SimplifiedHash netcoreapp2.0 X64 Default 1.342 ms 0.0126 ms 0.0099 ms 1.07 0.01
SimplifiedHash netcoreapp2.1 X64 Default 1.316 ms 0.0229 ms 0.0214 ms 1.05 0.02
SimplifiedHash netcoreapp3.0 X64 Default 1.351 ms 0.0141 ms 0.0125 ms 1.07 0.01
SimplifiedHash netcoreapp1.1 X86 True 4.063 ms 0.0655 ms 0.0613 ms 1.00 0.00
SimplifiedHash netcoreapp2.0 X86 Default 8.563 ms 0.0485 ms 0.0379 ms 2.11 0.03
SimplifiedHash netcoreapp2.1 X86 Default 8.597 ms 0.0266 ms 0.0236 ms 2.12 0.03
SimplifiedHash netcoreapp3.0 X86 Default 8.534 ms 0.1033 ms 0.0915 ms 2.10 0.04

For the latest RyuJit on x86, we've gotten ourselves into a situation where the only available register for long stretches of code is esi. Not surprisingly, this leads to a lot of shuffling...

       lea      esi, [ecx+176]
       mov      dword ptr [ebp-D8H], esi
       mov      esi, dword ptr [esi]
       mov      dword ptr [ebp-ECH], esi
       mov      esi, dword ptr [ebp-D8H]

G_M44156_IG04:
       mov      esi, dword ptr [esi+4]
       mov      dword ptr [ebp-F0H], esi
       mov      esi, dword ptr [ebp-ECH]
       mov      dword ptr [ebp-78H], esi
       mov      esi, dword ptr [ebp-F0H]
       mov      dword ptr [ebp-7CH], esi
       lea      esi, [ecx+184]
       mov      dword ptr [ebp-DCH], esi
       mov      esi, dword ptr [esi]
       mov      dword ptr [ebp-F0H], esi
       mov      esi, dword ptr [ebp-DCH]
       mov      esi, dword ptr [esi+4]
       mov      dword ptr [ebp-ECH], esi
       mov      esi, dword ptr [ebp-F0H]
       mov      dword ptr [ebp-80H], esi
       mov      esi, dword ptr [ebp-ECH]
       mov      dword ptr [ebp-84H], esi
       lea      esi, [ecx+192]
       mov      dword ptr [ebp-E0H], esi
       mov      esi, dword ptr [esi]
       mov      dword ptr [ebp-ECH], esi
       mov      esi, dword ptr [ebp-E0H]
       mov      esi, dword ptr [esi+4]
       mov      dword ptr [ebp-F0H], esi
       mov      esi, dword ptr [ebp-ECH]
       xor      esi, 0xD1FFAB1E
       mov      dword ptr [ebp-ECH], esi
       mov      esi, dword ptr [ebp-F0H]
       xor      esi, 0xD1FFAB1E

I think part of the issue here is that when we decompose a long GT_IND, we save the address to a temp, even if that address is something we could trivially recompute. So we see

       lea      esi, [ecx+176]                  // address of low part of some long 
       mov      dword ptr [ebp-D8H], esi        // spill the address
       mov      esi, dword ptr [esi]            // load low part
       mov      dword ptr [ebp-ECH], esi        // spill the low part
       mov      esi, dword ptr [ebp-D8H]        // reload the address

G_M44156_IG04:
       mov      esi, dword ptr [esi+4]          // bump address to point at high part
       mov      dword ptr [ebp-F0H], esi        // load the high part

when we could just as easily have the following, and save a bit of register pressure:

       mov      esi, dword ptr [ecx + 176]      // load low part
       mov      dword ptr [ebp-ECH], esi        // spill the low part

G_M44156_IG04:
       mov      esi, dword ptr [ecx + 180]      // bump to point at high part
       mov      dword ptr [ebp-F0H], esi        // load the high part

The other thing I saw was that when a long load is dead but might cause an exception, we don't simplify to something like a null check like jit32 does. Again this saves needing a register.

Let me look into long decomposition and see how hard it would be to teach it not to create a temp for simple address forms.

AndyAyersMS commented 5 years ago

I have a prototype here OptimizeDecomposeInd... it helps somewhat but we're still nowhere close to jit32.

PMI Diffs for D:\repos\Blake2Fast\tests\JitRegression\bin\Release\netcoreapp3.0\publish\JitRegression.dll for x86 default jit
Summary:
(Lower is better)
Total bytes of diff: -181 (-2.09% of base)
    diff is an improvement.
Top file improvements by size (bytes):
        -181 : JitRegression.dasm (-2.09% of base)
1 total files with size differences (1 improved, 0 regressed), 0 unchanged.
Top method improvements by size (bytes):
        -168 (-5.65% of base) : JitRegression.dasm - Blake2bContext:mixSimplified(int,int)
          -6 (-1.19% of base) : JitRegression.dasm - Blake2bContext:Update(ref):this
          -3 (-4.11% of base) : JitRegression.dasm - Blake2bContext:addLength(int):this
          -2 (-0.45% of base) : JitRegression.dasm - Blake2bContext:Init(int,ref):this
          -2 (-0.70% of base) : JitRegression.dasm - Blake2bContext:Finish():ref:this
Top method improvements by size (percentage):
        -168 (-5.65% of base) : JitRegression.dasm - Blake2bContext:mixSimplified(int,int)
          -3 (-4.11% of base) : JitRegression.dasm - Blake2bContext:addLength(int):this
          -6 (-1.19% of base) : JitRegression.dasm - Blake2bContext:Update(ref):this
          -2 (-0.70% of base) : JitRegression.dasm - Blake2bContext:Finish():ref:this
          -2 (-0.45% of base) : JitRegression.dasm - Blake2bContext:Init(int,ref):this
5 total methods with size differences (5 improved, 0 regressed), 36 unchanged.

Can do similar for stores but since those are all at the end I don't expect that to have a large impact in this case.

Also reasonably broad impact across the frameworks....

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x86 default jit
Summary:
(Lower is better)
Total bytes of diff: -13859 (-0.05% of base)
    diff is an improvement.
Top file regressions by size (bytes):
          20 : System.Net.WebSockets.dasm (0.06% of base)
Top file improvements by size (bytes):
       -1621 : System.Private.CoreLib.dasm (-0.04% of base)
       -1526 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.07% of base)
       -1146 : CommandLine.dasm (-0.41% of base)
        -826 : Microsoft.CodeAnalysis.dasm (-0.07% of base)
        -791 : System.Memory.dasm (-0.38% of base)
83 total files with size differences (82 improved, 1 regressed), 46 unchanged.
Top method regressions by size (bytes):
         381 (11.88% of base) : System.Private.CoreLib.dasm - DecCalc:VarDecDiv(byref,byref)
         137 ( 6.24% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Analysis.TraceGarbageCollector:Calculate():this
          89 ( 4.55% of base) : System.Private.CoreLib.dasm - Grisu3:TryDigitGenShortest(byref,byref,byref,struct,byref,byref):bool
          64 ( 5.74% of base) : Microsoft.CodeAnalysis.dasm - Microsoft.CodeAnalysis.Compilation:ConstructModuleSerializationProperties(ref,ref,struct):ref:this
          46 (12.23% of base) : System.Linq.Parallel.dasm - LongMinMaxAggregationOperatorEnumerator`1[Double][System.Double]:MoveNextCore(byref):bool:this
Top method improvements by size (bytes):
        -183 (-4.11% of base) : System.Private.Uri.dasm - System.Uri:CheckAuthorityHelper(int,ushort,ushort,byref,byref,ref,byref):ushort:this
        -137 (-2.20% of base) : System.Reflection.Metadata.dasm - System.Reflection.PortableExecutable.PEHeader:.ctor(byref):this
         -65 (-7.58% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[Int64,Int64][System.Int64,System.Int64]:ToString():ref:this
         -47 (-1.88% of base) : System.Private.Xml.dasm - System.Xml.Serialization.XmlSerializationWriterCodeGen:WriteEnumMethod(ref):this
         -43 (-17.77% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[Int64,Int64][System.Int64,System.Int64]:Equals(ref):bool:this
Top method regressions by size (percentage):
          13 (24.07% of base) : System.Linq.Expressions.dasm - <>c__3`2[Int32,Int64][System.Int32,System.Int64]:<RunOnEmptyStack>b__3_0(ref):ref:this
          46 (12.23% of base) : System.Linq.Parallel.dasm - LongMinMaxAggregationOperatorEnumerator`1[Double][System.Double]:MoveNextCore(byref):bool:this
         381 (11.88% of base) : System.Private.CoreLib.dasm - DecCalc:VarDecDiv(byref,byref)
          27 ( 9.68% of base) : System.Linq.Parallel.dasm - System.Linq.Parallel.RepeatEnumerable`1[Int64][System.Int64]:GetPartitions(int):ref:this
           6 ( 7.50% of base) : System.Collections.Immutable.dasm - System.Linq.ImmutableArrayExtensions:First(struct):long
Top method improvements by size (percentage):
         -11 (-37.93% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Etlx.TraceModuleFile:get_ImageEnd():long:this
         -15 (-28.85% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[__Canon,Int64][System.__Canon,System.Int64]:get_index():long:this
         -15 (-28.85% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[Int32,Int64][System.Int32,System.Int64]:get_index():long:this
         -15 (-28.85% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[Double,Int64][System.Double,System.Int64]:get_index():long:this
         -15 (-28.85% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[Vector`1,Int64][System.Numerics.Vector`1[System.Single],System.Int64]:get_index():long:this
4642 total methods with size differences (4418 improved, 224 regressed), 221321 unchanged.
AndyAyersMS commented 5 years ago

Will need to spend more time looking into this to figure out what else must happen to get x86 up to par (and perhaps arm32 too).

Moving this to future as I don't see any quick resolution that would fit into 3.0.

saucecontrol commented 4 years ago

Was just giving this another look with 3.1 RTM, and whatever was going wrong with LSRA on x86 is now impacting x64 as well. The code size really exploded in 3.1.

Runtime x86 Code Size x64 Code Size
netcoreapp1.1 1534 577
netcoreapp2.0 2906 602
netcoreapp2.1 2956 604
netcoreapp3.1 3926 2505

And the benchmarks are showing the expected perf drop.

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.18362
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.100
  [Host]        : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), 64bit RyuJIT
  netcoreapp1.1 : .NET Core 1.1.13 (CoreCLR 4.6.27618.02, CoreFX 4.6.24705.01), 64bit RyuJIT
  netcoreapp2.0 : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT
  netcoreapp2.1 : .NET Core 2.1.14 (CoreCLR 4.6.28207.04, CoreFX 4.6.28208.01), 64bit RyuJIT
  netcoreapp3.1 : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), 64bit RyuJIT
Runtime Platform Mean Error StdDev Scaled
netcoreapp1.1 X64 1.002 ms 0.0090 ms 0.0084 ms 1.00
netcoreapp2.0 X64 1.039 ms 0.0182 ms 0.0152 ms 1.04
netcoreapp2.1 X64 1.039 ms 0.0200 ms 0.0178 ms 1.04
netcoreapp3.1 X64 1.154 ms 0.0222 ms 0.0256 ms 1.15
netcoreapp1.1 X86 2.944 ms 0.0154 ms 0.0144 ms 1.00
netcoreapp2.0 X86 5.780 ms 0.0375 ms 0.0313 ms 1.96
netcoreapp2.1 X86 6.545 ms 0.0329 ms 0.0292 ms 2.22
netcoreapp3.1 X86 6.768 ms 0.0422 ms 0.0374 ms 2.30

JitDisasm and JitDump output for all versions are here: simplifiedhash_jitdumps.zip

AndyAyersMS commented 4 years ago

cc @CarolEidt -- some interesting stress cases for LSRA

mikedn commented 4 years ago

The x64 disassembly for 3.1 shows that tier 0 was used to generate code...

; Tier-0 compilation
; compiler->opts.MinOpts() is true
saucecontrol commented 4 years ago

D'oh. Been a while since that got me.

Updated code sizes still show another increase but not nearly so extreme. Benchmark numbers were correct.

Runtime x86 Code Size x64 Code Size
netcoreapp1.1 1534 577
netcoreapp2.0 2906 602
netcoreapp2.1 2956 604
netcoreapp3.1 3057 632

And here are the updated dumps: simplifiedhash_jitdumps.zip

mikedn commented 4 years ago

Thanks, a quick look over the disassembly diff would indicate that this is more likely to be an issue with null check elimination. I see lots of extra null checks in various places, like here:

-       mov      r14, qword ptr [rcx+144]
-       mov      r15, qword ptr [rcx+152]
-       mov      r12, qword ptr [rcx+160]
-       mov      r13, qword ptr [rcx+168]
-       mov      rbx, qword ptr [rcx+176]
-       mov      rdi, qword ptr [rcx+184]
-       mov      rsi, 0xD1FFAB1E
-       xor      rsi, qword ptr [rcx+192]
+       mov      rbp, rdx
+       cmp      dword ptr [rcx], ecx
+       mov      r14, qword ptr [rcx+136]
+       cmp      dword ptr [rcx], ecx
+       mov      r15, qword ptr [rcx+144]
+       cmp      dword ptr [rcx], ecx
+       mov      r12, qword ptr [rcx+152]
+       cmp      dword ptr [rcx], ecx
+       mov      r13, qword ptr [rcx+160]
+       cmp      dword ptr [rcx], ecx
+       mov      rbx, qword ptr [rcx+168]
+       cmp      dword ptr [rcx], ecx
+       mov      rdi, qword ptr [rcx+176]
+       cmp      dword ptr [rcx], ecx
+       mov      rsi, qword ptr [rcx+184]
saucecontrol commented 4 years ago

Thanks for looking into it. You're right, of course.

I actually ran across that issue a few months back and updated the main project (from which the sample is distilled) to replace the struct pointer argument with an interior pointer to the hash state. Running the benchmarks against the main project, there doesn't appear to be a new regression in 3.1... but the drop-off from legacy JIT is still severe.

CarolEidt commented 4 years ago

The x86 issue is the same as dotnet/runtime#8846, and is something I plan to be working on soon (see dotnet/runtime#9399). The former is a loop, and may well need additional work beyond what's proposed in dotnet/runtime#9399, but it suffers from the same fundamental issue that the register allocator will use a bad register that's free (even one that will immediately have to be spilled) rather than spill a register that's occupied.

mikedn commented 4 years ago

I've looked a bit more into those null checks and the reason they're not removed is pretty hilarious:

    [ 0] 161 (0x0a1) ldarg.0
    [ 1] 162 (0x0a2) ldflda 0400000A
    [ 1] 167 (0x0a7) ldflda 04000014
    [ 1] 172 (0x0ac) ldc.i4.2 2              ; constant 
    [ 2] 173 (0x0ad) conv.i                   ; constant 
    [ 2] 174 (0x0ae) ldc.i4.8 8              ; constant 
    [ 3] 175 (0x0af) mul                       ; constant 
    [ 2] 176 (0x0b0) add
    [ 1] 177 (0x0b1) ldind.i8
    [ 1] 178 (0x0b2) stloc.s 10

So the C# compiler somehow manages to emit a constant tree and the JIT gets rid of it too late:

fgMorphTree BB01, STMT00018 (before)
               [000183] -A-XG-------              *  ASG       long  
               [000182] D------N----              +--*  LCL_VAR   long   V12 loc10        
               [000181] *--XG-------              \--*  IND       long  
               [000180] ---XG-------                 \--*  ADD       long  
               [000174] ---XG-------                    +--*  ADDR      long  
               [000173] ---XG-------                    |  \--*  FIELD     long   FixedElementField
               [000172] ---XG-------                    |     \--*  ADDR      long  
               [000171] ---XG-------                    |        \--*  FIELD     struct h
               [000170] ------------                    |           \--*  LCL_VAR   long   V00 arg0         
               [000179] ------------                    \--*  MUL       long  
               [000176] ------------                       +--*  CAST      long <- int
               [000175] ------------                       |  \--*  CNS_INT   int    2
               [000178] ------------                       \--*  CAST      long <- int
               [000177] ------------                          \--*  CNS_INT   int    8

Before calling fgAddFieldSeqForZeroOffset:
               [000173] ---XG--N----              *  IND       long  
               [000172] ---XG-------              \--*  ADDR      long  
               [000171] ---XG-------                 \--*  FIELD     struct h
               [000170] ------------                    \--*  LCL_VAR   long   V00 arg0         

fgAddFieldSeqForZeroOffset for Fseq[FixedElementField]
addr (Before)
               [000172] ---XG-------                ADDR      long  
     (After)
               [000172] ---XG-------                ADDR      long   Zero Fseq[FixedElementField]
Before explicit null check morphing:
               [000171] ---XG--N----              *  FIELD     struct h
               [000170] ------------              \--*  LCL_VAR   long   V00 arg0         
After adding explicit null check:
               [000171] ---XG--N----              *  IND       struct
               [000694] ---X--------              \--*  COMMA     long  
               [000690] ---X---N----                 +--*  NULLCHECK byte  
               [000689] ------------                 |  \--*  LCL_VAR   long   V00 arg0         

The reason for the null check is mac->m_allConstantOffsets being false due to that constant tree that's not yet a CNS_INT.

I haven't checked how come this worked before 3.0 nor if there's any quick fix for this. It looks to me that the main reason for the MAC's mess is the existence of ADDR nodes so... back to "get rid of ADDR" work.

The generated code also contains a bunch of extra LEAs that I suspect my trivial forward substitution experiment may eliminate, I'll check that another time.

AndyAyersMS commented 4 years ago

@CarolEidt you might want to look at this one again as at the heart of it all there are likely LSRA challenges (made worse by clunky long decomposition).

Hash computations tend to be pretty stressful for the allocator.