dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.26k stars 4.73k forks source link

.NET 9 Vector128.Min/Max performance regression #108168

Closed BoyBaykiller closed 1 month ago

BoyBaykiller commented 1 month ago

Description

For my BVH building I need to find pairs of bounding boxes which when combined produce a bounding box with the smallest surface area. After upgrading from .NET 8 to preview 9.0.100-rc.1.24452.12 I notice a negative performance impact to this code. I made these benchmarks which should capture the issue. FindSmallestArea should show what I am doing in my actual code. SimplifiedFindSmallestArea is simplified as much as possible while still producing the perf difference.

[SimpleJob(RuntimeMoniker.Net80), SimpleJob(RuntimeMoniker.Net90), MemoryDiagnoser, DisassemblyDiagnoser]
public class Benchmark
{
    private struct Box
    {
        public Vector128<float> Min, Max;
    }

    private static readonly Box[] boxes = new Box[1_000_000];

    [Benchmark]
    public float FindSmallestArea()
    {
        float bestArea = float.MaxValue;

        ref readonly Box boxA = ref boxes[0];
        for (int i = 1; i < boxes.Length; i++)
        {
            ref readonly Box boxB = ref boxes[i];

            Box boxC;
            boxC.Min = Vector128.Min(boxA.Min, boxB.Min);
            boxC.Max = Vector128.Max(boxA.Max, boxB.Max);

            Vector128<float> size = boxC.Max - boxC.Min;
            float area = (size[0] + size[1]) * size[2] + size[0] * size[1];
            if (area < bestArea)
            {
                bestArea = area;
            }
        }

        return bestArea;
    }

    [Benchmark]
    public bool SimplifiedFindSmallestArea()
    {
        Vector128<float> minA = Vector128.Create(-10.0f, 20.0f, 0.4f, 1.0f);

        for (int i = 0; i < 1_000_000; i++)
        {
            Vector128<float> minB = Vector128.Create(i, i, i, 0.0f);
            Vector128<float> minC = Vector128.Min(minA, minB);

            if (minC[0] == float.MaxValue)
            {
                return false; // Note: Never hit, but needed to not optimize out stuff
            }
        }
        return true;
    }
}

Benchmark Results

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19043.2364/21H1/May2021Update)
AMD Ryzen 5 2600, 1 CPU, 12 logical and 6 physical cores
.NET SDK 9.0.100-rc.1.24452.12
  [Host]   : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  .NET 8.0 : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  .NET 9.0 : .NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX2
Method Job Runtime Mean Error StdDev Code Size Allocated
FindSmallestArea .NET 8.0 .NET 8.0 1,884.3 μs 18.80 μs 16.67 μs 126 B -
SimplifiedFindSmallestArea .NET 8.0 .NET 8.0 795.1 μs 2.78 μs 2.60 μs 75 B -
FindSmallestArea .NET 9.0 .NET 9.0 2,735.6 μs 17.76 μs 14.83 μs 209 B 2 B
SimplifiedFindSmallestArea .NET 9.0 .NET 9.0 1,263.4 μs 2.77 μs 2.45 μs 102 B 1 B
dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

huoyaoyuan commented 1 month ago

Codegen diff for FindSmallestArea:

-       vzeroupper
-       vmovss    xmm0,dword ptr [7FF7FC0383E0]
-       mov       rax,295DC801E48
+       sub       rsp,18
+       vmovaps   [rsp],xmm6
+       vmovss    xmm0,dword ptr [7FF8B48F95A8]
+       mov       rax,22FAD000CE8
       mov       rax,[rax]
       mov       rcx,rax
       add       rcx,10
       mov       edx,1
M00_L00:
       mov       r8,rax
+       mov       r10d,edx
-       mov       r10,rdx
       shl       r10,5
       lea       r8,[r8+r10+10]
-       vmovups   xmm1,[rcx+10]
-       vmaxps    xmm1,xmm1,[r8+10]
-       vmovups   xmm2,[rcx]
-       vminps    xmm2,xmm2,[r8]
-       vsubps    xmm1,xmm1,xmm2
+       vmovups   xmm1,[rcx]
+       vmovups   xmm2,[r8]
+       vcmpeqps  xmm3,xmm1,xmm2
+       vxorps    xmm4,xmm4,xmm4
+       vpcmpgtd  xmm4,xmm4,xmm1
+       vcmpneqps xmm5,xmm1,xmm1
+       vpternlogd xmm5,xmm4,xmm3,0F8
+       vcmpltps  xmm3,xmm1,xmm2
+       vorps     xmm3,xmm3,xmm5
+       vpternlogd xmm1,xmm3,xmm2,0E2
+       vmovups   xmm2,[rcx+10]
+       vmovups   xmm3,[r8+10]
+       vcmpeqps  xmm4,xmm2,xmm3
+       vxorps    xmm5,xmm5,xmm5
+       vpcmpgtd  xmm5,xmm5,xmm3
+       vcmpneqps xmm6,xmm2,xmm2
+       vpternlogd xmm6,xmm5,xmm4,0F8
+       vcmpltps  xmm4,xmm3,xmm2
+       vorps     xmm4,xmm4,xmm6
+       vpternlogd xmm2,xmm4,xmm3,0E2
+       vsubps    xmm1,xmm2,xmm1
       vmovaps   xmm2,xmm1
       vmovshdup xmm3,xmm1
       vaddss    xmm4,xmm2,xmm3
       vunpckhps xmm1,xmm1,xmm1
       vmulss    xmm1,xmm4,xmm1
       vmulss    xmm2,xmm2,xmm3
       vaddss    xmm1,xmm1,xmm2
       vucomiss  xmm0,xmm1
       ja        short M00_L02
M00_L01:
       inc       edx
       cmp       edx,0F4240
-       jl        short M00_L00
+       jl        near ptr M00_L00
+       vmovaps   xmm6,[rsp]
+       add       rsp,18
       ret
M00_L02:
       vmovaps   xmm0,xmm1
       jmp       short M00_L01

Vector128.Min and Vector128.Max are similarly pessimized in SimplifiedFindSmallestArea.

huoyaoyuan commented 1 month ago

Minimal repro:

    [Benchmark]
    [Arguments(1.0f)]
    public Vector128<float> VectorMin(float x) => Vector128.Min(Vector128.Create(x), Vector128.CreateScalar(x));

Codegen in 8.0:

; Benchmark.VectorMin(Single)
       vzeroupper
       vbroadcastss xmm0,xmm2
       vinsertps xmm1,xmm1,xmm2,0E
       vminps    xmm0,xmm0,xmm1
       vmovups   [rdx],xmm0
       mov       rax,rdx
       ret
; Total bytes of code 26

Codegen in 9.0 rc 1:

; Benchmark.VectorMin(Single)
       vbroadcastss xmm0,xmm2
       vinsertps xmm1,xmm1,xmm2,0E
       vcmpeqps  xmm2,xmm0,xmm1
       vxorps    xmm3,xmm3,xmm3
       vpcmpgtd  xmm3,xmm3,xmm0
       vcmpneqps xmm4,xmm0,xmm0
       vpternlogd xmm4,xmm3,xmm2,0F8
       vcmpltps  xmm2,xmm0,xmm1
       vorps     xmm2,xmm2,xmm4
       vpternlogd xmm0,xmm2,xmm1,0E2
       vmovups   [rdx],xmm0
       mov       rax,rdx
       ret
; Total bytes of code 60
BoyBaykiller commented 1 month ago

Thanks for the assembly diff! The minimal repro you've sent sent shows the difference in generated code for vector min/max, but note that it does not reproduce the performance difference for me. In fact NET 9 code performs much better on this test:

Method Job Runtime x Mean Error StdDev
VectorMin .NET 8.0 .NET 8.0 1 1.2347 ns 0.0242 ns 0.0227 ns
VectorMin .NET 9.0 .NET 9.0 1 0.2159 ns 0.0085 ns 0.0079 ns
neon-sunset commented 1 month ago

@huoyaoyuan Just wanted to note that the posted disassembly here contains AVX512VL, namely vpternlogd, and would not be something that the author would observe since the CPU in question is AMD 2600X which is AVX2-only.

Here's the AVX2 output for SimplifiedFindSmallestArea: .NET 8

G_M000_IG01:                ;; offset=0x0000
       vzeroupper 

G_M000_IG02:                ;; offset=0x0003
       xor      eax, eax
       vmovss   xmm0, dword ptr [reloc @RWD00]
       align    [3 bytes for IG03]

G_M000_IG03:                ;; offset=0x0010
       vxorps   xmm1, xmm1, xmm1
       vcvtsi2ss xmm1, xmm1, eax
       vmovaps  xmm2, xmm1
       vinsertps xmm2, xmm2, xmm1, 16
       vinsertps xmm1, xmm2, xmm1, 40
       vmovups  xmm2, xmmword ptr [reloc @RWD16]
       vminps   xmm1, xmm2, xmm1
       vucomiss xmm1, xmm0
       jp       SHORT G_M000_IG04
       je       SHORT G_M000_IG07

G_M000_IG04:                ;; offset=0x003C
       inc      eax
       cmp      eax, 0xF4240
       jl       SHORT G_M000_IG03

G_M000_IG05:                ;; offset=0x0045
       mov      eax, 1

G_M000_IG06:                ;; offset=0x004A
       ret      

G_M000_IG07:                ;; offset=0x004B
       xor      eax, eax

G_M000_IG08:                ;; offset=0x004D
       ret      

RWD00   dd  7F7FFFFFh       ; 3.40282e+38
RWD04   dd  00000000h, 00000000h, 00000000h
RWD16   dq  41A00000C1200000h, 3F8000003ECCCCCDh

; Total bytes of code 78

.NET 9 RC.2

G_M56681_IG01:  ;; offset=0x0000
                        ;; size=0 bbWeight=0.25 PerfScore 0.00
G_M56681_IG02:  ;; offset=0x0000
       vmovups  xmm0, xmmword ptr [reloc @RWD00]
       xor      eax, eax
       vmovss   xmm1, dword ptr [reloc @RWD16]
       align    [0 bytes for IG03]
                        ;; size=18 bbWeight=0.25 PerfScore 1.56
G_M56681_IG03:  ;; offset=0x0012
       vxorps   xmm2, xmm2, xmm2
       vcvtsi2ss xmm2, xmm2, eax
       vmovaps  xmm3, xmm2
       vinsertps xmm3, xmm3, xmm2, 16
       vinsertps xmm2, xmm3, xmm2, 40
       vcmpps   xmm3, xmm0, xmm2, 0
       vandps   xmm3, xmm3, xmmword ptr [reloc @RWD32]
       vcmpps   xmm4, xmm0, xmm2, 1
       vorps    xmm3, xmm4, xmm3
       vandps   xmm4, xmm0, xmm3
       vandnps  xmm2, xmm3, xmm2
       vorps    xmm2, xmm2, xmm4
       vucomiss xmm2, xmm1
       jp       SHORT G_M56681_IG04
       je       SHORT G_M56681_IG07
                        ;; size=66 bbWeight=4 PerfScore 87.67
G_M56681_IG04:  ;; offset=0x0054
       inc      eax
       cmp      eax, 0xF4240
       jl       SHORT G_M56681_IG03
                        ;; size=9 bbWeight=4 PerfScore 6.00
G_M56681_IG05:  ;; offset=0x005D
       mov      eax, 1
                        ;; size=5 bbWeight=0.50 PerfScore 0.12
G_M56681_IG06:  ;; offset=0x0062
       ret      
                        ;; size=1 bbWeight=0.50 PerfScore 0.50
G_M56681_IG07:  ;; offset=0x0063
       xor      eax, eax
                        ;; size=2 bbWeight=0.50 PerfScore 0.12
G_M56681_IG08:  ;; offset=0x0065
       ret      
                        ;; size=1 bbWeight=0.50 PerfScore 0.50
RWD00   dq  41A00000C1200000h, 3F8000003ECCCCCDh
RWD16   dd  7F7FFFFFh       ; 3.40282e+38
RWD20   dd  00000000h, 00000000h, 00000000h
RWD32   dq  00000000FFFFFFFFh, 0000000000000000h
EgorBo commented 1 month ago

Minimal repro:

It's expected, in net8.0 Vector128.Min/Max were not the same between arm64 and x64, so x64 version was adjusted to be IEEE754 compatible just like arm64's is (e.g. different behavior for max(-0.0, 0.0) or NaNs)

@BoyBaykiller if you don't care about deterministic results across different platforms, you can use

if (Sse.IsSupported)
    x = Sse.Min(v1, v2);
else
    x = Vector128.Min(v1, v2);

I have confirmed locally that this fixes your perf issue.

EgorBo commented 1 month ago

cc @tannergooding

huoyaoyuan commented 1 month ago

if you don't care about deterministic results across different platforms, you can use

Can there be MinUnsafe etc? I had written such code that doesn't need to distinguish +0 and -0.

EgorBo commented 1 month ago

if you don't care about deterministic results across different platforms, you can use

Can there be MinUnsafe etc? I had written such code that doesn't need to distinguish +0 and -0.

Maybe? in theory, in some cases JIT should be able to figure out that it doesn't have to check the corner cases and do the fast thing as is (e.g. if it sees that the inputs are never -0.0 or NaN via assertprop, IV analsysis etc), but it doesn't do it today

BoyBaykiller commented 1 month ago

@EgorBo Thanks that does fix the perf issue. Although a bit unfortunate that I have to use explicit SIMD now.

It's expected, in net8.0 Vector128.Min/Max were not the same between arm64 and x64, so x64 version was adjusted to be IEEE754 compatible just like arm64's is (e.g. different behavior for max(-0.0, 0.0) or NaNs)

Do you know which PR that was?

EgorBo commented 1 month ago

Do you know which PR that was?

103837

EgorBo commented 1 month ago

Can there be MinUnsafe etc?

ah, it turns out it's MinNative from https://github.com/dotnet/runtime/issues/103845

BoyBaykiller commented 1 month ago

I guess this issue can be closed then. But do you happen to know if this regression is also expected? If not I'll open a seperate issue.

[SimpleJob(RuntimeMoniker.Net80), SimpleJob(RuntimeMoniker.Net90), MemoryDiagnoser, DisassemblyDiagnoser]
public class MathBenchmark
{
    float x, y, z, w;

    [Benchmark]
    public uint Clamp4()
    {
        const uint max = (1u << 21) - 1;
        uint a = Math.Clamp((uint)(x * max), 0u, max);
        uint b = Math.Clamp((uint)(y * max), 0u, max);
        uint c = Math.Clamp((uint)(z * max), 0u, max);
        uint d = Math.Clamp((uint)(w * max), 0u, max);

        return a + b + c + d;
    }
}
BenchmarkDotNet v0.14.0, Windows 10 (10.0.19043.2364/21H1/May2021Update)
AMD Ryzen 5 2600, 1 CPU, 12 logical and 6 physical cores
.NET SDK 9.0.100-rc.1.24452.12
  [Host]   : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 [AttachedDebugger]
  .NET 8.0 : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  .NET 9.0 : .NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX2
Method Job Runtime Mean Error StdDev Code Size Allocated
Clamp4 .NET 8.0 .NET 8.0 2.5879 ns 0.0181 ns 0.0169 ns 115 B -
Clamp4 .NET 9.0 .NET 9.0 5.0463 ns 0.0395 ns 0.0369 ns 362 B -
EgorBo commented 1 month ago

But do you happen to know if this regression is also expected? If not I'll open a seperate issue.

Yeah please do, looks like something wrong with its codegen when AVX512 is not presented

EgorBo commented 1 month ago

@BoyBaykiller actually, I think it is also expected, but we might improve it. in .net 9.0 we also fixed some float -> unsigned issues to make it more deterministic. Unfortunately, it regresses x64 without avx512

uint Clamp4(float x) => (uint)x;

Avx512:

       vmovaps  xmm0, xmm1
       vfixupimmss xmm0, xmm1, dword ptr [reloc @RWD00], 0
       vcvttss2usi eax, xmm0
       ret 

No avx512:

       vcvtss2sd xmm0, xmm0, xmm1
       vmovddup xmm1, xmm0
       vmovddup xmm2, xmm0
       vmovddup xmm3, xmm0
       vmovddup xmm0, xmm0
       vxorps   xmm4, xmm4, xmm4
       vmovups  xmm5, xmmword ptr [reloc @RWD00]
       vcmppd   xmm1, xmm2, xmm1, 0
       vcmppd   xmm2, xmm3, xmm4, 13
       vandpd   xmm1, xmm2, xmm1
       vandpd   xmm0, xmm1, xmm0
       vcmppd   xmm1, xmm0, xmm5, 13
       vblendvpd xmm0, xmm0, xmm5, xmm1
       vcvttsd2si rax, xmm0
       ret  
BoyBaykiller commented 1 month ago

in .net 9.0 we also fixed some float -> unsigned issues to make it more deterministic

For signed as well, right? Because I can reproduce the perf issue in Clamp4 with int or long aswell.

Method Job Runtime Mean Error StdDev Median Code Size Allocated
Clamp4Int .NET 8.0 .NET 8.0 3.2261 ns 0.0710 ns 0.0554 ns 3.2056 ns 154 B -
CastFloatToInt .NET 8.0 .NET 8.0 0.0013 ns 0.0029 ns 0.0026 ns 0.0000 ns 6 B -
Clamp4Int .NET 9.0 .NET 9.0 5.3244 ns 0.0456 ns 0.0404 ns 5.3207 ns 369 B -
CastFloatToInt .NET 9.0 .NET 9.0 0.4521 ns 0.0130 ns 0.0116 ns 0.4510 ns 66 B -

.NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX2

; Program.MathBenchmark.CastFloatToInt()
       vmovss    xmm0,dword ptr [rcx+8]
       vbroadcastss xmm1,xmm0
       vbroadcastss xmm2,xmm0
       vbroadcastss xmm0,xmm0
       vcmpeqps  xmm1,xmm2,xmm1
       vandps    xmm0,xmm1,xmm0
       vcmpgeps  xmm1,xmm0,[7FFBBAA48A30]
       vcvttss2si eax,xmm0
       vmovd     xmm0,eax
       vpbroadcastd xmm0,xmm0
       vpblendvb xmm0,xmm0,[7FFBBAA48A40],xmm1
       vmovd     eax,xmm0
       ret
; Total bytes of code 66
BoyBaykiller commented 1 month ago

@EgorBo, just in case you didnt see.

tannergooding commented 1 month ago

The change here is intentional and should rarely be the bottleneck for end to end applications.

In the case it is a bottleneck and you do not need determinism or commutativity with regards to NaN, +0, or -0; then you can use the new Vector128.MinNative and Vector128.MaxNative APIs.

-- That is, instructions like maxps dst, x, y on x86/x64 return y if both operands are 0 of either sign (that means MaxNative(+0, -0) returns -0, while MaxNative(-0, +0) returns +0) and likewise return y if either operand is NaN (that means MaxNative(NaN, 1) returns 1, while MaxNative(1, NaN) returns NaN). The behavior may then differ on other platforms, like Arm64 or WASM, which can lead to bugs in your code if you're not careful, there is likewise no guarantee that MaxNative will always emit maxps on x86/x64 hardware. It is free to optimize to other alternative instructions in the future such as using vrangeps on AVX512 capable hardware or using some future instruction that exactly follows IEEE 754 semantics. The main consideration of MaxNative/MinNative is that they do what is most efficient and don't guarantee the behavior in said edge cases.

tannergooding commented 1 month ago

I think it is also expected, but we might improve it. in .net 9.0 we also fixed some float -> unsigned issues to make it more deterministic

Same here for casting. There is Vector128.ConvertToUInt32Native (and related APIs) for doing efficient (but non-deterministic across different CPU) conversions.

BoyBaykiller commented 1 month ago

Alright thanks for the explanation.