Performance regression of XXHash128 for large buffers with .NET 8

xoofx commented 1 year ago

Hey, just wanted to double check the performance of .NET 8 with XXHash128 from PR #77944, and running with a large 1MB buffer seems to be significantly slower than the .NET 7 version now:

BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1702/22H2/2022Update/SunValley2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.4.23260.5
  [Host]     : .NET 7.0.2 (7.0.222.60605), X64 RyuJIT AVX2
  Job-YPKMNH : .NET 7.0.5 (7.0.523.17405), X64 RyuJIT AVX2
  Job-XOTNUZ : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2

|          Method |  Runtime |          data |     Mean |    Error |   StdDev | Ratio |
|---------------- |--------- |-------------- |---------:|---------:|---------:|------:|
| XXHash128Native | .NET 7.0 | Byte[1048576] | 27.33 us | 0.016 us | 0.015 us |  1.00 |
|       XXHash128 | .NET 7.0 | Byte[1048576] | 31.67 us | 0.100 us | 0.093 us |  1.16 |
|       XXHash128 | .NET 8.0 | Byte[1048576] | 45.37 us | 0.165 us | 0.154 us |  1.66 |

I haven't dug into why, but considering the performance drop, my first suspect would be less inlining.

ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details

Hey, just wanted to double check the performance of .NET 8 with XXHash128 from PR #77944, and running with a large 1MB buffer seems to be significantly slower than the .NET 7 version now: ``` BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1702/22H2/2022Update/SunValley2) AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores .NET SDK=8.0.100-preview.4.23260.5 [Host] : .NET 7.0.2 (7.0.222.60605), X64 RyuJIT AVX2 Job-YPKMNH : .NET 7.0.5 (7.0.523.17405), X64 RyuJIT AVX2 Job-XOTNUZ : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2 | Method | Runtime | data | Mean | Error | StdDev | Ratio | |----------- |--------- |-------------- |---------:|---------:|---------:|------:| | XXH3Native | .NET 7.0 | Byte[1048576] | 27.33 us | 0.016 us | 0.015 us | 1.00 | | XXH3 | .NET 7.0 | Byte[1048576] | 31.67 us | 0.100 us | 0.093 us | 1.16 | | XXH3 | .NET 8.0 | Byte[1048576] | 45.37 us | 0.165 us | 0.154 us | 1.66 | ``` I haven't dug into why, but considering the performance drop, my first suspect would be less inlining.

Author:	xoofx
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`, `untriaged`
Milestone:	-

stephentoub commented 1 year ago

with XXHash128

XXH3

The method name in the table suggests you're measuring XxHash3 and the title suggests XxHash128. Which is this issue referring to?

cc: @EgorBo

xoofx commented 1 year ago

The method name in the table suggests you're measuring XxHash3 and the title suggests XxHash128. Which is this issue referring to?

Sorry, udpated, the naming was from an old benchmark, but implementation is using XXHash128

EgorBo commented 1 year ago

@xoofx could you please share your benchmark (which version of System.IO.Hashing are you using btw?)? I tried locally with this:

public static IEnumerable<byte[]> TestData()
{
    yield return new byte[1024*1024];
}

[Benchmark]
[ArgumentsSource(nameof(TestData))]
public byte[] Test(byte[] data) => System.IO.Hashing.XxHash128.Hash(data);

and .net 8.0 is notably faster:

AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.4.23260.5
  [Host]   : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
  ShortRun : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1
WarmupCount=3

| Method |          data |     Mean |    Error |   StdDev |
|------- |-------------- |---------:|---------:|---------:|
|   Test | Byte[1048576] | 14.30 us | 0.150 us | 0.008 us |

BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1702/22H2/2022Update/SunValley2)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.4.23260.5
  [Host]   : .NET 7.0.5 (7.0.523.17405), X64 RyuJIT AVX2
  ShortRun : .NET 7.0.3 (7.0.323.6910), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1
WarmupCount=3

| Method |          data |     Mean |    Error |   StdDev |
|------- |-------------- |---------:|---------:|---------:|
|   Test | Byte[1048576] | 21.53 us | 1.083 us | 0.059 us |

xoofx commented 1 year ago

This is relatively similar but I'm using XxHash128.HashToUInt128:

    [Benchmark()]
    [ArgumentsSource(nameof(Data))]
    public UInt128 XXH3(byte[] data)
    {
        return XxHash128.HashToUInt128(data);
    }

    public IEnumerable<byte[]> Data()
    {
        yield return Enumerable.Range(0, 1 << 20).Select(x => (byte)x).ToArray();
    }

Looking at the generated assembly - with Disasmo - between .NET 7 and .NET 8, this code XxHashShared.ScrambleAccumulator256 is generating quite something weird for .NET 8.

The .NET 7 version is:

; Assembly listing for method System.IO.Hashing.c(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 2 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       4883EC78             sub      rsp, 120
       C5F877               vzeroupper 

G_M000_IG02:                ;; offset=0007H
       C5FD1002             vmovupd  ymm0, ymmword ptr[rdx]
       C5F573D02F           vpsrlq   ymm1, ymm0, 47
       C5FDEFC1             vpxor    ymm0, ymm0, ymm1
       C4C17DEF00           vpxor    ymm0, ymm0, ymmword ptr[r8]
       C5FD11442420         vmovupd  ymmword ptr[rsp+20H], ymm0
       C5FD100579000000     vmovupd  ymm0, ymmword ptr[reloc @RWD00]
       C5FD110424           vmovupd  ymmword ptr[rsp], ymm0
       488B442420           mov      rax, qword ptr [rsp+20H]
       480FAF0424           imul     rax, qword ptr [rsp]
       4889442440           mov      qword ptr [rsp+40H], rax
       488B442428           mov      rax, qword ptr [rsp+28H]
       480FAF442408         imul     rax, qword ptr [rsp+08H]
       4889442448           mov      qword ptr [rsp+48H], rax
       488B442430           mov      rax, qword ptr [rsp+30H]
       480FAF442410         imul     rax, qword ptr [rsp+10H]
       4889442450           mov      qword ptr [rsp+50H], rax
       488B442438           mov      rax, qword ptr [rsp+38H]
       480FAF442418         imul     rax, qword ptr [rsp+18H]
       4889442458           mov      qword ptr [rsp+58H], rax
       C5FD10442440         vmovupd  ymm0, ymmword ptr[rsp+40H]
       C5FD1102             vmovupd  ymmword ptr[rdx], ymm0
       C5FD1002             vmovupd  ymm0, ymmword ptr[rdx]
       C5FD1101             vmovupd  ymmword ptr[rcx], ymm0
       488BC1               mov      rax, rcx

G_M000_IG03:                ;; offset=0080H
       C5F877               vzeroupper 
       4883C478             add      rsp, 120
       C3                   ret      

RWD00   dq  000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h

; Total bytes of code 136

while the .NET 8 version is generating:

; Assembly listing for method System.IO.Hashing.XxHashShared:ScrambleAccumulator256(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 19 single block inlinees; 13 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       sub      rsp, 312
       vzeroupper 

G_M000_IG02:                ;; offset=000AH
       vmovups  ymm0, ymmword ptr [rdx]
       vpsrlq   ymm1, ymm0, 47
       vpxor    ymm0, ymm0, ymm1
       vpxor    ymm0, ymm0, ymmword ptr [r8]
       vmovups  ymmword ptr [rsp+100H], ymm0
       vmovups  ymm0, ymmword ptr [reloc @RWD00]
       vmovups  ymmword ptr [rsp+E0H], ymm0
       vmovups  xmm1, xmmword ptr [rsp+100H]
       vmovaps  xmmword ptr [rsp+B0H], xmm1
       vmovups  xmm1, xmmword ptr [rsp+E0H]
       vmovaps  xmmword ptr [rsp+A0H], xmm1
       mov      rax, qword ptr [rsp+B0H]
       mov      qword ptr [rsp+90H], rax
       mov      rax, qword ptr [rsp+A0H]
       mov      qword ptr [rsp+88H], rax
       mov      rax, qword ptr [rsp+90H]
       imul     rax, qword ptr [rsp+88H]
       mov      qword ptr [rsp+98H], rax
       mov      rax, qword ptr [rsp+98H]
       mov      r8, qword ptr [rsp+B8H]
       mov      qword ptr [rsp+78H], r8
       mov      r8, qword ptr [rsp+A8H]
       mov      qword ptr [rsp+70H], r8
       mov      r8, qword ptr [rsp+78H]
       imul     r8, qword ptr [rsp+70H]
       mov      qword ptr [rsp+80H], r8
       mov      r8, qword ptr [rsp+80H]
       mov      qword ptr [rsp+60H], rax
       mov      qword ptr [rsp+68H], r8
       vmovups  ymmword ptr [rsp+C0H], ymm0
       vmovaps  xmm0, xmmword ptr [rsp+60H]
       vmovups  xmm1, xmmword ptr [rsp+110H]
       vmovaps  xmmword ptr [rsp+50H], xmm1
       vmovups  xmm1, xmmword ptr [rsp+D0H]
       vmovaps  xmmword ptr [rsp+40H], xmm1
       mov      rax, qword ptr [rsp+50H]
       mov      qword ptr [rsp+30H], rax
       mov      rax, qword ptr [rsp+40H]
       mov      qword ptr [rsp+28H], rax
       mov      rax, qword ptr [rsp+30H]
       imul     rax, qword ptr [rsp+28H]
       mov      qword ptr [rsp+38H], rax
       mov      rax, qword ptr [rsp+38H]
       mov      r8, qword ptr [rsp+58H]
       mov      qword ptr [rsp+18H], r8
       mov      r8, qword ptr [rsp+48H]
       mov      qword ptr [rsp+10H], r8
       mov      r8, qword ptr [rsp+18H]
       imul     r8, qword ptr [rsp+10H]
       mov      qword ptr [rsp+20H], r8
       mov      r8, qword ptr [rsp+20H]
       mov      qword ptr [rsp], rax
       mov      qword ptr [rsp+08H], r8
       vinserti128 ymm0, ymm0, xmmword ptr [rsp], 1
       vmovups  ymmword ptr [rdx], ymm0
       vmovups  ymm0, ymmword ptr [rdx]
       vmovups  ymmword ptr [rcx], ymm0
       mov      rax, rcx

G_M000_IG03:                ;; offset=0178H
       vzeroupper 
       add      rsp, 312
       ret      

RWD00   dq  000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h

; Total bytes of code 387

The .NET 8 version is going a bit crazy, loading/storing multiple times the same register to the stack, taking an intermediate route with some xmm registers... quite weird...

This code is then called from the loop in XxHashShared.Accumulate and the impact there is not great.

xoofx commented 1 year ago

But even in .NET 7 the Vector256<ulong> * Vector256<ulong> is going back to scalar which is quite unfortunate. Would require a specialized code path with 32bits mult to keep it vectorized. I can make a separate PR maybe?

EgorBo commented 1 year ago

I can make a separate PR maybe?

Sounds great 🙂

Vector256<ulong> Test(Vector256<ulong> a, Vector256<ulong> b) => a * b;

emits on my machine:

; Method Prog:Test(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
G_M3664_IG01:
       vzeroupper 
                        ;; size=3 bbWeight=1 PerfScore 1.00

G_M3664_IG02:
       vmovups  ymm0, ymmword ptr [rdx]
       vpmullq  ymm0, ymm0, ymmword ptr [r8]
       vmovups  ymmword ptr [rcx], ymm0
       mov      rax, rcx
                        ;; size=17 bbWeight=1 PerfScore 24.25

G_M3664_IG03:
       vzeroupper 
       ret      
                        ;; size=4 bbWeight=1 PerfScore 2.00
; Total bytes of code: 24

but it needs AVX-512 to present, if it does not, I get:

; Method Prog:Test(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]:this
G_M62636_IG01:
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 248
       vzeroupper 
       vmovaps  xmmword ptr [rsp+E0H], xmm6
       mov      rbx, rdx
       mov      rsi, r8
       mov      rdi, r9
                        ;; size=32 bbWeight=4 PerfScore 32.00

G_M62636_IG02:
       vmovups  xmm0, xmmword ptr [rsi]
       vmovaps  xmmword ptr [rsp+D0H], xmm0
       vmovups  xmm0, xmmword ptr [rdi]
       vmovaps  xmmword ptr [rsp+C0H], xmm0
       mov      rdx, qword ptr [rsp+D0H]
       mov      qword ptr [rsp+B0H], rdx
       mov      rdx, qword ptr [rsp+C0H]
       mov      qword ptr [rsp+A8H], rdx
       mov      rdx, qword ptr [rsp+B0H]
       imul     rdx, qword ptr [rsp+A8H]
       mov      qword ptr [rsp+B8H], rdx
       mov      rbp, qword ptr [rsp+B8H]
       mov      rdx, qword ptr [rsp+D8H]
       mov      qword ptr [rsp+98H], rdx
       mov      rdx, qword ptr [rsp+C8H]
       mov      qword ptr [rsp+90H], rdx
       mov      rcx, qword ptr [rsp+98H]
       mov      rdx, qword ptr [rsp+90H]
       call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       mov      qword ptr [rsp+A0H], rax
       mov      rdx, qword ptr [rsp+A0H]
       mov      qword ptr [rsp+80H], rbp
       mov      qword ptr [rsp+88H], rdx
       vmovaps  xmm6, xmmword ptr [rsp+80H]
       vmovups  xmm0, xmmword ptr [rsi+10H]
       vmovaps  xmmword ptr [rsp+70H], xmm0
       vmovups  xmm0, xmmword ptr [rdi+10H]
       vmovaps  xmmword ptr [rsp+60H], xmm0
       mov      rdx, qword ptr [rsp+70H]
       mov      qword ptr [rsp+50H], rdx
       mov      rdx, qword ptr [rsp+60H]
       mov      qword ptr [rsp+48H], rdx
       mov      rcx, qword ptr [rsp+50H]
       mov      rdx, qword ptr [rsp+48H]
       call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       mov      qword ptr [rsp+58H], rax
       mov      rsi, qword ptr [rsp+58H]
       mov      rdx, qword ptr [rsp+78H]
       mov      qword ptr [rsp+38H], rdx
       mov      rdx, qword ptr [rsp+68H]
       mov      qword ptr [rsp+30H], rdx
       mov      rcx, qword ptr [rsp+38H]
       mov      rdx, qword ptr [rsp+30H]
       call     [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
       mov      qword ptr [rsp+40H], rax
       mov      rax, qword ptr [rsp+40H]
       mov      qword ptr [rsp+20H], rsi
       mov      qword ptr [rsp+28H], rax
       vinserti128 ymm0, ymm6, xmmword ptr [rsp+20H], 1
       vmovups  ymmword ptr [rbx], ymm0
       mov      rax, rbx
                        ;; size=325 bbWeight=4 PerfScore 309.00

G_M62636_IG03:
       vmovaps  xmm6, xmmword ptr [rsp+E0H]
       vzeroupper 
       add      rsp, 248
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       ret      
                        ;; size=24 bbWeight=4 PerfScore 33.00
; Total bytes of code: 381

xoofx commented 1 year ago

but presumably it's AVX-512?

Yep, it is. So it is more likely that you should get better performance on your machine

EgorBo commented 1 year ago

Btw, if you know how to optimize the multiplication of two Vector256<ulong> using pre-avx512 you might want to do it inside the operator * itself? I mean here: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256_1.cs#L268

xoofx commented 1 year ago

Btw, if you know how to optimize the multiplication of two Vector256 using pre-avx512 you might want to do it inside the operator * itself? I mean here:

Yep, exactly, that's what I would have tried, thanks for confirming. Let me check If I can come up with something there.

xoofx commented 1 year ago

Cool, by generating a compatible AVX2 64bit * 64bit vectorized multiplication, I'm now getting 10% faster than the C++ version, that was it! 😎 I should have checked more thoroughly the generated ASM in the first place. 😅

Will try to prepare a PR later to add AVX2 + SSE2 and double check if can create also an ARM64 version.

xoofx commented 1 year ago

@EgorBo I don't see any code that is using e.g AVX2, SSE2 in this code path and the operator is marked as an intrinsic. I have 2 questions:

Should I implement a specialized version of operator Vector256<ulong> * or should I if switch on the typeof(ulong) in the generic version?
Should I inline AVX2 implementation in this intrinsics operators? (safe guarded by Avx2.IsSupported)

Bonus question: How can I open a sln to compile this from VS? (I have only 17.6.0 installed) I'm getting some cryptic errors from ApiCompat MSBuild tasks when just trying to build System.Private.CorLib.csproj...

EgorBo commented 1 year ago

@EgorBo I don't see any code that is using e.g AVX2, SSE2

Right, because it's implemented in JIT, namely, here: https://github.com/dotnet/runtime/blob/da1da02bbd2cb54490b7fc22f43ec32f5f302615/src/coreclr/jit/hwintrinsicxarch.cpp#L2056-L2103 (as you can see it even has a TODO for V256 multiplication when Avx512dq_vl is not presented.

So you either implement it in JIT right there or in C#. When JIT doesn't handle the intrinsic it falls back to C# implementation so if you implement a path

if (typeof(T) == typeof(long) && Avx2.IsSupported)
{
    ...
}

it should be taken. Whatever path you want to take is up to you, I, personally, prefer to do it in C# if it's possible - it's simpler and is ILLink friendly (also, might help mono as well).

Bonus question: How can I open a sln to compile this from VS? (I have only 17.6.0 installed) I'm getting some cryptic errors from ApiCompat MSBuild tasks when just trying to build System.Private.CorLib.csproj...

I personally rarely do so, but to open the sln I do:

.\build.cmd Clr -c Debug -vs .\src\coreclr\System.Private.CoreLib\System.Private.CoreLib.csproj

EgorBo commented 1 year ago

Btw, https://github.com/dotnet/runtime/pull/86811 touches the same path in JIT for byte as far as I can see (but I still think it'd better to be done on the C# side)

xoofx commented 1 year ago

Fix proposal via PR #87113

(Edit: Though, the original issue of the code being worse with .NET 8 compared to .NET 7 with the default operator still stands)

AndyAyersMS commented 1 year ago

(Edit: Though, the original issue of the code being worse with .NET 8 compared to .NET 7 with the default operator still stands)

I couldn't quite tell what issue might remain after your PR -- can you highlight this so we don't forget about it?

xoofx commented 1 year ago

I couldn't quite tell what issue might remain after your PR -- can you highlight this so we don't forget about it?

The following code from XxHashShared.ScrambleAccumulator256 in .NET 8 is generating a lot more stack spilling (312 bytes - e.g storing and reloading the exact same value from the stack in a row, as captured below) than the .NET 7 equivalent (120 bytes), mainly because of accVec = xorWithKey * Vector256.Create((ulong)Prime32_1);

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector256<ulong> ScrambleAccumulator256(Vector256<ulong> accVec, Vector256<ulong> secret)
{
    Vector256<ulong> xorShift = accVec ^ Vector256.ShiftRightLogical(accVec, 47);
    Vector256<ulong> xorWithKey = xorShift ^ secret;
    accVec = xorWithKey * Vector256.Create((ulong)Prime32_1);
    return accVec;
}

My PR was going to optimize Vector256<ulong> * operator, but that doesn't explain the difference of codegen with .NET 8. e.g

       vmovups  ymmword ptr [rsp+100H], ymm0
       vmovups  ymm0, ymmword ptr [reloc @RWD00]
       vmovups  ymmword ptr [rsp+E0H], ymm0    ; <----- Double store of reloc @RWD00 above
       vmovups  xmm1, xmmword ptr [rsp+100H]   ; <----- reloading from ymm0 above

or:

       mov      rax, qword ptr [rsp+B0H]
       mov      qword ptr [rsp+90H], rax
       mov      rax, qword ptr [rsp+A0H]
       mov      qword ptr [rsp+88H], rax
       mov      rax, qword ptr [rsp+90H] ; <---- Reloading from rax just above
       imul     rax, qword ptr [rsp+88H] ; <---- Reloading from rax just above

tannergooding commented 1 year ago

This won't land in .NET 9 due to timing and other higher priority fixes being needed.

The suggested code fix ended up causing problems and we likely need to use a more standard algorithm for emulating long multiplication. The #87142 PR prototypes this, but has some failures that still needs looking at.

dotnet / runtime

Performance regression of XXHash128 for large buffers with .NET 8 #87107