Closed BoyBaykiller closed 1 month ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Codegen diff for FindSmallestArea
:
- vzeroupper
- vmovss xmm0,dword ptr [7FF7FC0383E0]
- mov rax,295DC801E48
+ sub rsp,18
+ vmovaps [rsp],xmm6
+ vmovss xmm0,dword ptr [7FF8B48F95A8]
+ mov rax,22FAD000CE8
mov rax,[rax]
mov rcx,rax
add rcx,10
mov edx,1
M00_L00:
mov r8,rax
+ mov r10d,edx
- mov r10,rdx
shl r10,5
lea r8,[r8+r10+10]
- vmovups xmm1,[rcx+10]
- vmaxps xmm1,xmm1,[r8+10]
- vmovups xmm2,[rcx]
- vminps xmm2,xmm2,[r8]
- vsubps xmm1,xmm1,xmm2
+ vmovups xmm1,[rcx]
+ vmovups xmm2,[r8]
+ vcmpeqps xmm3,xmm1,xmm2
+ vxorps xmm4,xmm4,xmm4
+ vpcmpgtd xmm4,xmm4,xmm1
+ vcmpneqps xmm5,xmm1,xmm1
+ vpternlogd xmm5,xmm4,xmm3,0F8
+ vcmpltps xmm3,xmm1,xmm2
+ vorps xmm3,xmm3,xmm5
+ vpternlogd xmm1,xmm3,xmm2,0E2
+ vmovups xmm2,[rcx+10]
+ vmovups xmm3,[r8+10]
+ vcmpeqps xmm4,xmm2,xmm3
+ vxorps xmm5,xmm5,xmm5
+ vpcmpgtd xmm5,xmm5,xmm3
+ vcmpneqps xmm6,xmm2,xmm2
+ vpternlogd xmm6,xmm5,xmm4,0F8
+ vcmpltps xmm4,xmm3,xmm2
+ vorps xmm4,xmm4,xmm6
+ vpternlogd xmm2,xmm4,xmm3,0E2
+ vsubps xmm1,xmm2,xmm1
vmovaps xmm2,xmm1
vmovshdup xmm3,xmm1
vaddss xmm4,xmm2,xmm3
vunpckhps xmm1,xmm1,xmm1
vmulss xmm1,xmm4,xmm1
vmulss xmm2,xmm2,xmm3
vaddss xmm1,xmm1,xmm2
vucomiss xmm0,xmm1
ja short M00_L02
M00_L01:
inc edx
cmp edx,0F4240
- jl short M00_L00
+ jl near ptr M00_L00
+ vmovaps xmm6,[rsp]
+ add rsp,18
ret
M00_L02:
vmovaps xmm0,xmm1
jmp short M00_L01
Vector128.Min
and Vector128.Max
are similarly pessimized in SimplifiedFindSmallestArea
.
Minimal repro:
[Benchmark]
[Arguments(1.0f)]
public Vector128<float> VectorMin(float x) => Vector128.Min(Vector128.Create(x), Vector128.CreateScalar(x));
Codegen in 8.0:
; Benchmark.VectorMin(Single)
vzeroupper
vbroadcastss xmm0,xmm2
vinsertps xmm1,xmm1,xmm2,0E
vminps xmm0,xmm0,xmm1
vmovups [rdx],xmm0
mov rax,rdx
ret
; Total bytes of code 26
Codegen in 9.0 rc 1:
; Benchmark.VectorMin(Single)
vbroadcastss xmm0,xmm2
vinsertps xmm1,xmm1,xmm2,0E
vcmpeqps xmm2,xmm0,xmm1
vxorps xmm3,xmm3,xmm3
vpcmpgtd xmm3,xmm3,xmm0
vcmpneqps xmm4,xmm0,xmm0
vpternlogd xmm4,xmm3,xmm2,0F8
vcmpltps xmm2,xmm0,xmm1
vorps xmm2,xmm2,xmm4
vpternlogd xmm0,xmm2,xmm1,0E2
vmovups [rdx],xmm0
mov rax,rdx
ret
; Total bytes of code 60
Thanks for the assembly diff! The minimal repro you've sent sent shows the difference in generated code for vector min/max, but note that it does not reproduce the performance difference for me. In fact NET 9 code performs much better on this test:
Method | Job | Runtime | x | Mean | Error | StdDev |
---|---|---|---|---|---|---|
VectorMin | .NET 8.0 | .NET 8.0 | 1 | 1.2347 ns | 0.0242 ns | 0.0227 ns |
VectorMin | .NET 9.0 | .NET 9.0 | 1 | 0.2159 ns | 0.0085 ns | 0.0079 ns |
@huoyaoyuan Just wanted to note that the posted disassembly here contains AVX512VL, namely vpternlogd
, and would not be something that the author would observe since the CPU in question is AMD 2600X which is AVX2-only.
Here's the AVX2 output for SimplifiedFindSmallestArea
:
.NET 8
G_M000_IG01: ;; offset=0x0000
vzeroupper
G_M000_IG02: ;; offset=0x0003
xor eax, eax
vmovss xmm0, dword ptr [reloc @RWD00]
align [3 bytes for IG03]
G_M000_IG03: ;; offset=0x0010
vxorps xmm1, xmm1, xmm1
vcvtsi2ss xmm1, xmm1, eax
vmovaps xmm2, xmm1
vinsertps xmm2, xmm2, xmm1, 16
vinsertps xmm1, xmm2, xmm1, 40
vmovups xmm2, xmmword ptr [reloc @RWD16]
vminps xmm1, xmm2, xmm1
vucomiss xmm1, xmm0
jp SHORT G_M000_IG04
je SHORT G_M000_IG07
G_M000_IG04: ;; offset=0x003C
inc eax
cmp eax, 0xF4240
jl SHORT G_M000_IG03
G_M000_IG05: ;; offset=0x0045
mov eax, 1
G_M000_IG06: ;; offset=0x004A
ret
G_M000_IG07: ;; offset=0x004B
xor eax, eax
G_M000_IG08: ;; offset=0x004D
ret
RWD00 dd 7F7FFFFFh ; 3.40282e+38
RWD04 dd 00000000h, 00000000h, 00000000h
RWD16 dq 41A00000C1200000h, 3F8000003ECCCCCDh
; Total bytes of code 78
.NET 9 RC.2
G_M56681_IG01: ;; offset=0x0000
;; size=0 bbWeight=0.25 PerfScore 0.00
G_M56681_IG02: ;; offset=0x0000
vmovups xmm0, xmmword ptr [reloc @RWD00]
xor eax, eax
vmovss xmm1, dword ptr [reloc @RWD16]
align [0 bytes for IG03]
;; size=18 bbWeight=0.25 PerfScore 1.56
G_M56681_IG03: ;; offset=0x0012
vxorps xmm2, xmm2, xmm2
vcvtsi2ss xmm2, xmm2, eax
vmovaps xmm3, xmm2
vinsertps xmm3, xmm3, xmm2, 16
vinsertps xmm2, xmm3, xmm2, 40
vcmpps xmm3, xmm0, xmm2, 0
vandps xmm3, xmm3, xmmword ptr [reloc @RWD32]
vcmpps xmm4, xmm0, xmm2, 1
vorps xmm3, xmm4, xmm3
vandps xmm4, xmm0, xmm3
vandnps xmm2, xmm3, xmm2
vorps xmm2, xmm2, xmm4
vucomiss xmm2, xmm1
jp SHORT G_M56681_IG04
je SHORT G_M56681_IG07
;; size=66 bbWeight=4 PerfScore 87.67
G_M56681_IG04: ;; offset=0x0054
inc eax
cmp eax, 0xF4240
jl SHORT G_M56681_IG03
;; size=9 bbWeight=4 PerfScore 6.00
G_M56681_IG05: ;; offset=0x005D
mov eax, 1
;; size=5 bbWeight=0.50 PerfScore 0.12
G_M56681_IG06: ;; offset=0x0062
ret
;; size=1 bbWeight=0.50 PerfScore 0.50
G_M56681_IG07: ;; offset=0x0063
xor eax, eax
;; size=2 bbWeight=0.50 PerfScore 0.12
G_M56681_IG08: ;; offset=0x0065
ret
;; size=1 bbWeight=0.50 PerfScore 0.50
RWD00 dq 41A00000C1200000h, 3F8000003ECCCCCDh
RWD16 dd 7F7FFFFFh ; 3.40282e+38
RWD20 dd 00000000h, 00000000h, 00000000h
RWD32 dq 00000000FFFFFFFFh, 0000000000000000h
Minimal repro:
It's expected, in net8.0 Vector128.Min/Max
were not the same between arm64 and x64, so x64 version was adjusted to be IEEE754 compatible just like arm64's is (e.g. different behavior for max(-0.0, 0.0)
or NaNs)
@BoyBaykiller if you don't care about deterministic results across different platforms, you can use
if (Sse.IsSupported)
x = Sse.Min(v1, v2);
else
x = Vector128.Min(v1, v2);
I have confirmed locally that this fixes your perf issue.
cc @tannergooding
if you don't care about deterministic results across different platforms, you can use
Can there be MinUnsafe
etc? I had written such code that doesn't need to distinguish +0 and -0.
if you don't care about deterministic results across different platforms, you can use
Can there be
MinUnsafe
etc? I had written such code that doesn't need to distinguish +0 and -0.
Maybe? in theory, in some cases JIT should be able to figure out that it doesn't have to check the corner cases and do the fast thing as is (e.g. if it sees that the inputs are never -0.0 or NaN via assertprop, IV analsysis etc), but it doesn't do it today
@EgorBo Thanks that does fix the perf issue. Although a bit unfortunate that I have to use explicit SIMD now.
It's expected, in net8.0 Vector128.Min/Max were not the same between arm64 and x64, so x64 version was adjusted to be IEEE754 compatible just like arm64's is (e.g. different behavior for max(-0.0, 0.0) or NaNs)
Do you know which PR that was?
Do you know which PR that was?
Can there be
MinUnsafe
etc?
ah, it turns out it's MinNative from https://github.com/dotnet/runtime/issues/103845
I guess this issue can be closed then. But do you happen to know if this regression is also expected? If not I'll open a seperate issue.
[SimpleJob(RuntimeMoniker.Net80), SimpleJob(RuntimeMoniker.Net90), MemoryDiagnoser, DisassemblyDiagnoser]
public class MathBenchmark
{
float x, y, z, w;
[Benchmark]
public uint Clamp4()
{
const uint max = (1u << 21) - 1;
uint a = Math.Clamp((uint)(x * max), 0u, max);
uint b = Math.Clamp((uint)(y * max), 0u, max);
uint c = Math.Clamp((uint)(z * max), 0u, max);
uint d = Math.Clamp((uint)(w * max), 0u, max);
return a + b + c + d;
}
}
BenchmarkDotNet v0.14.0, Windows 10 (10.0.19043.2364/21H1/May2021Update)
AMD Ryzen 5 2600, 1 CPU, 12 logical and 6 physical cores
.NET SDK 9.0.100-rc.1.24452.12
[Host] : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2 [AttachedDebugger]
.NET 8.0 : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
.NET 9.0 : .NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX2
Method | Job | Runtime | Mean | Error | StdDev | Code Size | Allocated |
---|---|---|---|---|---|---|---|
Clamp4 | .NET 8.0 | .NET 8.0 | 2.5879 ns | 0.0181 ns | 0.0169 ns | 115 B | - |
Clamp4 | .NET 9.0 | .NET 9.0 | 5.0463 ns | 0.0395 ns | 0.0369 ns | 362 B | - |
But do you happen to know if this regression is also expected? If not I'll open a seperate issue.
Yeah please do, looks like something wrong with its codegen when AVX512 is not presented
@BoyBaykiller actually, I think it is also expected, but we might improve it. in .net 9.0 we also fixed some float -> unsigned issues to make it more deterministic. Unfortunately, it regresses x64 without avx512
uint Clamp4(float x) => (uint)x;
Avx512:
vmovaps xmm0, xmm1
vfixupimmss xmm0, xmm1, dword ptr [reloc @RWD00], 0
vcvttss2usi eax, xmm0
ret
No avx512:
vcvtss2sd xmm0, xmm0, xmm1
vmovddup xmm1, xmm0
vmovddup xmm2, xmm0
vmovddup xmm3, xmm0
vmovddup xmm0, xmm0
vxorps xmm4, xmm4, xmm4
vmovups xmm5, xmmword ptr [reloc @RWD00]
vcmppd xmm1, xmm2, xmm1, 0
vcmppd xmm2, xmm3, xmm4, 13
vandpd xmm1, xmm2, xmm1
vandpd xmm0, xmm1, xmm0
vcmppd xmm1, xmm0, xmm5, 13
vblendvpd xmm0, xmm0, xmm5, xmm1
vcvttsd2si rax, xmm0
ret
in .net 9.0 we also fixed some float -> unsigned issues to make it more deterministic
For signed as well, right? Because I can reproduce the perf issue in Clamp4
with int
or long
aswell.
Method | Job | Runtime | Mean | Error | StdDev | Median | Code Size | Allocated |
---|---|---|---|---|---|---|---|---|
Clamp4Int | .NET 8.0 | .NET 8.0 | 3.2261 ns | 0.0710 ns | 0.0554 ns | 3.2056 ns | 154 B | - |
CastFloatToInt | .NET 8.0 | .NET 8.0 | 0.0013 ns | 0.0029 ns | 0.0026 ns | 0.0000 ns | 6 B | - |
Clamp4Int | .NET 9.0 | .NET 9.0 | 5.3244 ns | 0.0456 ns | 0.0404 ns | 5.3207 ns | 369 B | - |
CastFloatToInt | .NET 9.0 | .NET 9.0 | 0.4521 ns | 0.0130 ns | 0.0116 ns | 0.4510 ns | 66 B | - |
; Program.MathBenchmark.CastFloatToInt()
vmovss xmm0,dword ptr [rcx+8]
vbroadcastss xmm1,xmm0
vbroadcastss xmm2,xmm0
vbroadcastss xmm0,xmm0
vcmpeqps xmm1,xmm2,xmm1
vandps xmm0,xmm1,xmm0
vcmpgeps xmm1,xmm0,[7FFBBAA48A30]
vcvttss2si eax,xmm0
vmovd xmm0,eax
vpbroadcastd xmm0,xmm0
vpblendvb xmm0,xmm0,[7FFBBAA48A40],xmm1
vmovd eax,xmm0
ret
; Total bytes of code 66
@EgorBo, just in case you didnt see.
The change here is intentional and should rarely be the bottleneck for end to end applications.
In the case it is a bottleneck and you do not need determinism or commutativity with regards to NaN
, +0
, or -0
; then you can use the new Vector128.MinNative
and Vector128.MaxNative
APIs.
-- That is, instructions like maxps dst, x, y
on x86/x64 return y
if both operands are 0
of either sign (that means MaxNative(+0, -0)
returns -0
, while MaxNative(-0, +0)
returns +0
) and likewise return y
if either operand is NaN
(that means MaxNative(NaN, 1)
returns 1
, while MaxNative(1, NaN)
returns NaN
). The behavior may then differ on other platforms, like Arm64 or WASM, which can lead to bugs in your code if you're not careful, there is likewise no guarantee that MaxNative
will always emit maxps
on x86/x64 hardware. It is free to optimize to other alternative instructions in the future such as using vrangeps
on AVX512 capable hardware or using some future instruction that exactly follows IEEE 754 semantics. The main consideration of MaxNative
/MinNative
is that they do what is most efficient and don't guarantee the behavior in said edge cases.
I think it is also expected, but we might improve it. in .net 9.0 we also fixed some float -> unsigned issues to make it more deterministic
Same here for casting. There is Vector128.ConvertToUInt32Native
(and related APIs) for doing efficient (but non-deterministic across different CPU) conversions.
Alright thanks for the explanation.
Description
For my BVH building I need to find pairs of bounding boxes which when combined produce a bounding box with the smallest surface area. After upgrading from .NET 8 to preview 9.0.100-rc.1.24452.12 I notice a negative performance impact to this code. I made these benchmarks which should capture the issue.
FindSmallestArea
should show what I am doing in my actual code.SimplifiedFindSmallestArea
is simplified as much as possible while still producing the perf difference.Benchmark Results