Open xoofx opened 1 year ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | xoofx |
---|---|
Assignees: | - |
Labels: | `tenet-performance`, `area-CodeGen-coreclr`, `untriaged` |
Milestone: | - |
with XXHash128
XXH3
The method name in the table suggests you're measuring XxHash3 and the title suggests XxHash128. Which is this issue referring to?
cc: @EgorBo
The method name in the table suggests you're measuring XxHash3 and the title suggests XxHash128. Which is this issue referring to?
Sorry, udpated, the naming was from an old benchmark, but implementation is using XXHash128
@xoofx could you please share your benchmark (which version of System.IO.Hashing
are you using btw?)? I tried locally with this:
public static IEnumerable<byte[]> TestData()
{
yield return new byte[1024*1024];
}
[Benchmark]
[ArgumentsSource(nameof(TestData))]
public byte[] Test(byte[] data) => System.IO.Hashing.XxHash128.Hash(data);
and .net 8.0 is notably faster:
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.4.23260.5
[Host] : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
ShortRun : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
Job=ShortRun IterationCount=3 LaunchCount=1
WarmupCount=3
| Method | data | Mean | Error | StdDev |
|------- |-------------- |---------:|---------:|---------:|
| Test | Byte[1048576] | 14.30 us | 0.150 us | 0.008 us |
BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22621.1702/22H2/2022Update/SunValley2)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.4.23260.5
[Host] : .NET 7.0.5 (7.0.523.17405), X64 RyuJIT AVX2
ShortRun : .NET 7.0.3 (7.0.323.6910), X64 RyuJIT AVX2
Job=ShortRun IterationCount=3 LaunchCount=1
WarmupCount=3
| Method | data | Mean | Error | StdDev |
|------- |-------------- |---------:|---------:|---------:|
| Test | Byte[1048576] | 21.53 us | 1.083 us | 0.059 us |
This is relatively similar but I'm using XxHash128.HashToUInt128
:
[Benchmark()]
[ArgumentsSource(nameof(Data))]
public UInt128 XXH3(byte[] data)
{
return XxHash128.HashToUInt128(data);
}
public IEnumerable<byte[]> Data()
{
yield return Enumerable.Range(0, 1 << 20).Select(x => (byte)x).ToArray();
}
Looking at the generated assembly - with Disasmo - between .NET 7 and .NET 8, this code XxHashShared.ScrambleAccumulator256 is generating quite something weird for .NET 8.
The .NET 7 version is:
; Assembly listing for method System.IO.Hashing.c(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 2 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
4883EC78 sub rsp, 120
C5F877 vzeroupper
G_M000_IG02: ;; offset=0007H
C5FD1002 vmovupd ymm0, ymmword ptr[rdx]
C5F573D02F vpsrlq ymm1, ymm0, 47
C5FDEFC1 vpxor ymm0, ymm0, ymm1
C4C17DEF00 vpxor ymm0, ymm0, ymmword ptr[r8]
C5FD11442420 vmovupd ymmword ptr[rsp+20H], ymm0
C5FD100579000000 vmovupd ymm0, ymmword ptr[reloc @RWD00]
C5FD110424 vmovupd ymmword ptr[rsp], ymm0
488B442420 mov rax, qword ptr [rsp+20H]
480FAF0424 imul rax, qword ptr [rsp]
4889442440 mov qword ptr [rsp+40H], rax
488B442428 mov rax, qword ptr [rsp+28H]
480FAF442408 imul rax, qword ptr [rsp+08H]
4889442448 mov qword ptr [rsp+48H], rax
488B442430 mov rax, qword ptr [rsp+30H]
480FAF442410 imul rax, qword ptr [rsp+10H]
4889442450 mov qword ptr [rsp+50H], rax
488B442438 mov rax, qword ptr [rsp+38H]
480FAF442418 imul rax, qword ptr [rsp+18H]
4889442458 mov qword ptr [rsp+58H], rax
C5FD10442440 vmovupd ymm0, ymmword ptr[rsp+40H]
C5FD1102 vmovupd ymmword ptr[rdx], ymm0
C5FD1002 vmovupd ymm0, ymmword ptr[rdx]
C5FD1101 vmovupd ymmword ptr[rcx], ymm0
488BC1 mov rax, rcx
G_M000_IG03: ;; offset=0080H
C5F877 vzeroupper
4883C478 add rsp, 120
C3 ret
RWD00 dq 000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h
; Total bytes of code 136
while the .NET 8 version is generating:
; Assembly listing for method System.IO.Hashing.XxHashShared:ScrambleAccumulator256(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 19 single block inlinees; 13 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
sub rsp, 312
vzeroupper
G_M000_IG02: ;; offset=000AH
vmovups ymm0, ymmword ptr [rdx]
vpsrlq ymm1, ymm0, 47
vpxor ymm0, ymm0, ymm1
vpxor ymm0, ymm0, ymmword ptr [r8]
vmovups ymmword ptr [rsp+100H], ymm0
vmovups ymm0, ymmword ptr [reloc @RWD00]
vmovups ymmword ptr [rsp+E0H], ymm0
vmovups xmm1, xmmword ptr [rsp+100H]
vmovaps xmmword ptr [rsp+B0H], xmm1
vmovups xmm1, xmmword ptr [rsp+E0H]
vmovaps xmmword ptr [rsp+A0H], xmm1
mov rax, qword ptr [rsp+B0H]
mov qword ptr [rsp+90H], rax
mov rax, qword ptr [rsp+A0H]
mov qword ptr [rsp+88H], rax
mov rax, qword ptr [rsp+90H]
imul rax, qword ptr [rsp+88H]
mov qword ptr [rsp+98H], rax
mov rax, qword ptr [rsp+98H]
mov r8, qword ptr [rsp+B8H]
mov qword ptr [rsp+78H], r8
mov r8, qword ptr [rsp+A8H]
mov qword ptr [rsp+70H], r8
mov r8, qword ptr [rsp+78H]
imul r8, qword ptr [rsp+70H]
mov qword ptr [rsp+80H], r8
mov r8, qword ptr [rsp+80H]
mov qword ptr [rsp+60H], rax
mov qword ptr [rsp+68H], r8
vmovups ymmword ptr [rsp+C0H], ymm0
vmovaps xmm0, xmmword ptr [rsp+60H]
vmovups xmm1, xmmword ptr [rsp+110H]
vmovaps xmmword ptr [rsp+50H], xmm1
vmovups xmm1, xmmword ptr [rsp+D0H]
vmovaps xmmword ptr [rsp+40H], xmm1
mov rax, qword ptr [rsp+50H]
mov qword ptr [rsp+30H], rax
mov rax, qword ptr [rsp+40H]
mov qword ptr [rsp+28H], rax
mov rax, qword ptr [rsp+30H]
imul rax, qword ptr [rsp+28H]
mov qword ptr [rsp+38H], rax
mov rax, qword ptr [rsp+38H]
mov r8, qword ptr [rsp+58H]
mov qword ptr [rsp+18H], r8
mov r8, qword ptr [rsp+48H]
mov qword ptr [rsp+10H], r8
mov r8, qword ptr [rsp+18H]
imul r8, qword ptr [rsp+10H]
mov qword ptr [rsp+20H], r8
mov r8, qword ptr [rsp+20H]
mov qword ptr [rsp], rax
mov qword ptr [rsp+08H], r8
vinserti128 ymm0, ymm0, xmmword ptr [rsp], 1
vmovups ymmword ptr [rdx], ymm0
vmovups ymm0, ymmword ptr [rdx]
vmovups ymmword ptr [rcx], ymm0
mov rax, rcx
G_M000_IG03: ;; offset=0178H
vzeroupper
add rsp, 312
ret
RWD00 dq 000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h, 000000009E3779B1h
; Total bytes of code 387
The .NET 8 version is going a bit crazy, loading/storing multiple times the same register to the stack, taking an intermediate route with some xmm registers... quite weird...
This code is then called from the loop in XxHashShared.Accumulate
and the impact there is not great.
But even in .NET 7 the Vector256<ulong> * Vector256<ulong>
is going back to scalar which is quite unfortunate. Would require a specialized code path with 32bits mult to keep it vectorized. I can make a separate PR maybe?
I can make a separate PR maybe?
Sounds great π
Vector256<ulong> Test(Vector256<ulong> a, Vector256<ulong> b) => a * b;
emits on my machine:
; Method Prog:Test(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]
G_M3664_IG01:
vzeroupper
;; size=3 bbWeight=1 PerfScore 1.00
G_M3664_IG02:
vmovups ymm0, ymmword ptr [rdx]
vpmullq ymm0, ymm0, ymmword ptr [r8]
vmovups ymmword ptr [rcx], ymm0
mov rax, rcx
;; size=17 bbWeight=1 PerfScore 24.25
G_M3664_IG03:
vzeroupper
ret
;; size=4 bbWeight=1 PerfScore 2.00
; Total bytes of code: 24
but it needs AVX-512 to present, if it does not, I get:
; Method Prog:Test(System.Runtime.Intrinsics.Vector256`1[ulong],System.Runtime.Intrinsics.Vector256`1[ulong]):System.Runtime.Intrinsics.Vector256`1[ulong]:this
G_M62636_IG01:
push rdi
push rsi
push rbp
push rbx
sub rsp, 248
vzeroupper
vmovaps xmmword ptr [rsp+E0H], xmm6
mov rbx, rdx
mov rsi, r8
mov rdi, r9
;; size=32 bbWeight=4 PerfScore 32.00
G_M62636_IG02:
vmovups xmm0, xmmword ptr [rsi]
vmovaps xmmword ptr [rsp+D0H], xmm0
vmovups xmm0, xmmword ptr [rdi]
vmovaps xmmword ptr [rsp+C0H], xmm0
mov rdx, qword ptr [rsp+D0H]
mov qword ptr [rsp+B0H], rdx
mov rdx, qword ptr [rsp+C0H]
mov qword ptr [rsp+A8H], rdx
mov rdx, qword ptr [rsp+B0H]
imul rdx, qword ptr [rsp+A8H]
mov qword ptr [rsp+B8H], rdx
mov rbp, qword ptr [rsp+B8H]
mov rdx, qword ptr [rsp+D8H]
mov qword ptr [rsp+98H], rdx
mov rdx, qword ptr [rsp+C8H]
mov qword ptr [rsp+90H], rdx
mov rcx, qword ptr [rsp+98H]
mov rdx, qword ptr [rsp+90H]
call [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
mov qword ptr [rsp+A0H], rax
mov rdx, qword ptr [rsp+A0H]
mov qword ptr [rsp+80H], rbp
mov qword ptr [rsp+88H], rdx
vmovaps xmm6, xmmword ptr [rsp+80H]
vmovups xmm0, xmmword ptr [rsi+10H]
vmovaps xmmword ptr [rsp+70H], xmm0
vmovups xmm0, xmmword ptr [rdi+10H]
vmovaps xmmword ptr [rsp+60H], xmm0
mov rdx, qword ptr [rsp+70H]
mov qword ptr [rsp+50H], rdx
mov rdx, qword ptr [rsp+60H]
mov qword ptr [rsp+48H], rdx
mov rcx, qword ptr [rsp+50H]
mov rdx, qword ptr [rsp+48H]
call [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
mov qword ptr [rsp+58H], rax
mov rsi, qword ptr [rsp+58H]
mov rdx, qword ptr [rsp+78H]
mov qword ptr [rsp+38H], rdx
mov rdx, qword ptr [rsp+68H]
mov qword ptr [rsp+30H], rdx
mov rcx, qword ptr [rsp+38H]
mov rdx, qword ptr [rsp+30H]
call [System.Runtime.Intrinsics.Scalar`1[ulong]:Multiply(ulong,ulong):ulong]
mov qword ptr [rsp+40H], rax
mov rax, qword ptr [rsp+40H]
mov qword ptr [rsp+20H], rsi
mov qword ptr [rsp+28H], rax
vinserti128 ymm0, ymm6, xmmword ptr [rsp+20H], 1
vmovups ymmword ptr [rbx], ymm0
mov rax, rbx
;; size=325 bbWeight=4 PerfScore 309.00
G_M62636_IG03:
vmovaps xmm6, xmmword ptr [rsp+E0H]
vzeroupper
add rsp, 248
pop rbx
pop rbp
pop rsi
pop rdi
ret
;; size=24 bbWeight=4 PerfScore 33.00
; Total bytes of code: 381
but presumably it's AVX-512?
Yep, it is. So it is more likely that you should get better performance on your machine
Btw, if you know how to optimize the multiplication of two Vector256<ulong>
using pre-avx512 you might want to do it inside the operator *
itself? I mean here: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256_1.cs#L268
Btw, if you know how to optimize the multiplication of two Vector256
using pre-avx512 you might want to do it inside the operator * itself? I mean here:
Yep, exactly, that's what I would have tried, thanks for confirming. Let me check If I can come up with something there.
Cool, by generating a compatible AVX2 64bit * 64bit vectorized multiplication, I'm now getting 10% faster than the C++ version, that was it! π I should have checked more thoroughly the generated ASM in the first place. π
Will try to prepare a PR later to add AVX2 + SSE2 and double check if can create also an ARM64 version.
@EgorBo I don't see any code that is using e.g AVX2, SSE2 in this code path and the operator is marked as an intrinsic. I have 2 questions:
Vector256<ulong> *
or should I if switch
on the typeof(ulong)
in the generic version?Avx2.IsSupported
)Bonus question: How can I open a sln to compile this from VS? (I have only 17.6.0 installed) I'm getting some cryptic errors from ApiCompat MSBuild tasks when just trying to build System.Private.CorLib.csproj
...
@EgorBo I don't see any code that is using e.g AVX2, SSE2
Right, because it's implemented in JIT, namely, here: https://github.com/dotnet/runtime/blob/da1da02bbd2cb54490b7fc22f43ec32f5f302615/src/coreclr/jit/hwintrinsicxarch.cpp#L2056-L2103 (as you can see it even has a TODO for V256
So you either implement it in JIT right there or in C#. When JIT doesn't handle the intrinsic it falls back to C# implementation so if you implement a path
if (typeof(T) == typeof(long) && Avx2.IsSupported)
{
...
}
it should be taken. Whatever path you want to take is up to you, I, personally, prefer to do it in C# if it's possible - it's simpler and is ILLink friendly (also, might help mono as well).
Bonus question: How can I open a sln to compile this from VS? (I have only 17.6.0 installed) I'm getting some cryptic errors from ApiCompat MSBuild tasks when just trying to build System.Private.CorLib.csproj...
I personally rarely do so, but to open the sln I do:
.\build.cmd Clr -c Debug -vs .\src\coreclr\System.Private.CoreLib\System.Private.CoreLib.csproj
Btw, https://github.com/dotnet/runtime/pull/86811 touches the same path in JIT for byte
as far as I can see (but I still think it'd better to be done on the C# side)
Fix proposal via PR #87113
(Edit: Though, the original issue of the code being worse with .NET 8 compared to .NET 7 with the default operator still stands)
(Edit: Though, the original issue of the code being worse with .NET 8 compared to .NET 7 with the default operator still stands)
I couldn't quite tell what issue might remain after your PR -- can you highlight this so we don't forget about it?
I couldn't quite tell what issue might remain after your PR -- can you highlight this so we don't forget about it?
The following code from XxHashShared.ScrambleAccumulator256
in .NET 8 is generating a lot more stack spilling (312 bytes - e.g storing and reloading the exact same value from the stack in a row, as captured below) than the .NET 7 equivalent (120 bytes), mainly because of accVec = xorWithKey * Vector256.Create((ulong)Prime32_1);
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector256<ulong> ScrambleAccumulator256(Vector256<ulong> accVec, Vector256<ulong> secret)
{
Vector256<ulong> xorShift = accVec ^ Vector256.ShiftRightLogical(accVec, 47);
Vector256<ulong> xorWithKey = xorShift ^ secret;
accVec = xorWithKey * Vector256.Create((ulong)Prime32_1);
return accVec;
}
My PR was going to optimize Vector256<ulong> *
operator, but that doesn't explain the difference of codegen with .NET 8.
e.g
vmovups ymmword ptr [rsp+100H], ymm0
vmovups ymm0, ymmword ptr [reloc @RWD00]
vmovups ymmword ptr [rsp+E0H], ymm0 ; <----- Double store of reloc @RWD00 above
vmovups xmm1, xmmword ptr [rsp+100H] ; <----- reloading from ymm0 above
or:
mov rax, qword ptr [rsp+B0H]
mov qword ptr [rsp+90H], rax
mov rax, qword ptr [rsp+A0H]
mov qword ptr [rsp+88H], rax
mov rax, qword ptr [rsp+90H] ; <---- Reloading from rax just above
imul rax, qword ptr [rsp+88H] ; <---- Reloading from rax just above
This won't land in .NET 9 due to timing and other higher priority fixes being needed.
The suggested code fix ended up causing problems and we likely need to use a more standard algorithm for emulating long multiplication. The #87142 PR prototypes this, but has some failures that still needs looking at.
Hey, just wanted to double check the performance of .NET 8 with XXHash128 from PR #77944, and running with a large 1MB buffer seems to be significantly slower than the .NET 7 version now:
I haven't dug into why, but considering the performance drop, my first suspect would be less inlining.