dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.38k stars 4.75k forks source link

JIT: Bad codegen with `Avx512DQ.VL.Range` #106610

Closed jakobbotsch closed 2 months ago

jakobbotsch commented 3 months ago
// Generated by Fuzzlyn v2.2 on 2024-08-17 17:40:06
// Run on X86 Windows
// Seed: 1343518557351353159-vectort,vector128,vector256,vector512,x86aes,x86avx,x86avx2,x86avx512bw,x86avx512bwvl,x86avx512cd,x86avx512cdvl,x86avx512dq,x86avx512dqvl,x86avx512f,x86avx512fvl,x86avx512vbmi,x86avx512vbmivl,x86bmi1,x86bmi2,x86fma,x86lzcnt,x86pclmulqdq,x86popcnt,x86sse,x86sse2,x86sse3,x86sse41,x86sse42,x86ssse3,x86x86base
// Reduced from 171.2 KiB to 0.6 KiB in 00:06:37
// Debug: Outputs <4292870144, 0, 0, 0, 0, 0, 0, 0>
// Release: Outputs <0, 0, 0, 0, 0, 0, 0, 0>
using System;
using System.Runtime.CompilerServices;
using System.Numerics;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

public class C1
{
    public Vector256<float> F5;
    public C1(Vector256<float> f5)
    {
        F5 = f5;
    }
}

public class Program
{
    public static void Main()
    {
        var vr4 = Vector256.Create<float>(0);
        var vr5 = Vector256.CreateScalar(1f);
        var vr6 = Vector256.CreateScalar(-10f);
        var vr7 = Avx.Or(vr5, vr6);
        C1 vr8 = new C1(Avx512DQ.VL.Range(vr4, vr7, 0));
        System.Console.WriteLine(vr8.F5.AsUInt32());
    }
}

cc @dotnet/jit-contrib

dotnet-policy-service[bot] commented 3 months ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

AndyAyersMS commented 3 months ago

Preview 7 is ok

G_M000_IG01:                ;; offset=0x0000
       sub      esp, 32

G_M000_IG02:                ;; offset=0x0003
       vxorps   ymm0, ymm0, ymm0
       vrangeps ymm0, ymm0, ymmword ptr [@RWD00], 0
       vmovups  ymmword ptr [esp], ymm0
       mov      ecx, 0x8B3C8AC
       call     CORINFO_HELP_NEWSFAST
       vmovups  ymm0, ymmword ptr [esp]
       vmovups  ymmword ptr [eax+0x04], ymm0
       mov      ecx, eax
       call     [System.Console:WriteLine(System.Object)]

G_M000_IG03:                ;; offset=0x0033
       vzeroupper
       add      esp, 32
       ret

RWD00   dq      00000000FFA00000h, 0000000000000000h, 0000000000000000h, 0000000000000000h

but recent builds are not

G_M27646_IG01:  ;; offset=0x0000
       sub      esp, 32
                                                ;; size=3 bbWeight=1 PerfScore 0.25
G_M27646_IG02:  ;; offset=0x0003
       vxorps   ymm0, ymm0, ymm0
       vrangeps ymm0, ymm0, ymmword ptr [@RWD00], 0
       vmovups  ymmword ptr [esp], ymm0
       mov      ecx, 0xB6632E0      ; System.Runtime.Intrinsics.Vector256`1[uint]
       call     CORINFO_HELP_NEWSFAST
       vmovups  ymm0, ymmword ptr [esp]
       vmovups  ymmword ptr [eax+0x04], ymm0
       mov      ecx, eax
       call     [System.Console:WriteLine(System.Object)]
                                                ;; size=48 bbWeight=1 PerfScore 16.83
G_M27646_IG03:  ;; offset=0x0033
       vzeroupper
       add      esp, 32
       ret
                                                ;; size=7 bbWeight=1 PerfScore 2.25
RWD00   dq      00000000FFE00000h, 0000000000000000h, 0000000000000000h, 0000000000000000h

Only diff is the value in RWD00.

AndyAyersMS commented 3 months ago
***** BB01 [0000]
STMT00001 ( ??? ... 0x025 )
               [000009] DA---------                         *  STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V01 loc1         
               [000008] -----------                         \--*  HWINTRINSIC simd32 float Or
               [000003] -----------                            +--*  CNS_VEC   simd32<0x000000003f800000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000>
               [000005] -----------                            \--*  CNS_VEC   simd32<0x00000000c1200000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000>

becomes

               [000009] DA---------                         *  STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V01 loc1         
               [000005] -----+-----                         \--*  CNS_VEC   simd32<0x00000000ffe00000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000>

so somehow an extra bit gets set in the vector constant (FFE instead of FFA).

AndyAyersMS commented 3 months ago

Looks like this is nan normalization? 0xFFA00000 is a nan float, and we normalize it to 0xFFE00000.

tannergooding commented 3 months ago

Not quite NaN normalization, but rather 0xFFA0_0000 is an sNaN (signalling NaN) because the most significant bit of the significand is zero and sNaN when encountered by the CPU raise the "signal" (throw an exception if IEEE 754 floating-point exception handling is enabled, which it never is for .NET), and then suppress the signal from being raised again in the future by setting the bit that converts it to a qNaN (quiet NaN).

[0xFFC0_0000, 0xFFFF_FFFF] is -qNaN [0xFF80_0001, 0xFFBF_FFFF] is -sNaN 0xFF80_0000 is -inf

0x7F80_0000 is +inf [0x7F80_0001, 0x7FBF_FFFF] is +sNaN [0x7FC0_0000, 0x7FFF_FFFF] is +qNaN

In general we (RyuJIT) try and avoid any explicit normalization and instead try to explicitly propagate wherever applicable (that is, if a NaN is passed in, we return precisely that NaN back). The only places we explicitly produce canonical NaN is when neither input is a NaN (so -inf + inf produces NaN, this requires a net new NaN to be produced so we choose the canonical NaN of 0xFFC0_0000 on xarch and 0x7FC0_0000 on arm/arm64/loongarch64/riscv64).

tannergooding commented 3 months ago

So this is somewhat expected behavior, but it would still be interesting to find out what is causing the CPU to see it as an sNaN for this particular case. I expect there's some place its being seen as a scalar float and we could instead pass it through as a scalar uint32_t to ensure that operations like and, not, or, xor, left shift, and right shift don't exhibit such behavior.

AndyAyersMS commented 3 months ago

The path we take is:

gtFoldExprHWIntrinsic

               [000008] ----------- >>>                     *  HWINTRINSIC simd32 float Or
               [000005] -----+-----                         +--*  CNS_VEC   simd32<0x00000000c1200000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000>
               [000003] -----+-----                         \--*  CNS_VEC   simd32<0x000000003f800000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000>

GenTreeVecCon::EvaluateBinaryInPlace
EvaluateBinarySimd<simd32_t>
EvaluateBinarySimd<simd32_t,float>
EvaluateBinaryScalar<float>(genTreeOps oper, float arg0, float arg1)
EvaluateBinaryScalarSpecialized<float>(genTreeOps oper, float arg0, float arg1)

uint32_t resultBits is 0xFFA0_0000

UInt32BitsToSingle produces float 0xFFE0_0000
tannergooding commented 3 months ago

👍, so we'd need to intercept this prir to extracting the scalar value out of the SIMD input (that is prior to EvaluateBinaryScalar

I think the simplest place to do that is EvaluateBinarySimd<simd32_t>, basically check for oper in case TYP_FLOAT/TYP_DOUBLE and dispatch instead to EvaluateBinarySimd<TSimd, int32_t/int64_t>: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simd.h#L1189-L1199

I can get a fix up for this.

AndyAyersMS commented 3 months ago

Ok, I will reassign this one to you.