Difference in code quality: inline vs. local vector

am11 commented 2 years ago

Slightly refactoring vector code can degrade CQ in some cases:

using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

class C
{
    static float M1(float v)
    {
        return Sse2.Or(Vector128.CreateScalarUnsafe(v), Vector128.Create(2).AsSingle()).ToScalar();
    }

    [SkipLocalsInit] // ".locals init" -> ".locals" in IL (it had no effect on codegen)
    static float M1_afterRefactoring(float v)
    {
        Vector128<float> op2 = Vector128.Create(2).AsSingle();
        return Sse2.Or(Vector128.CreateScalarUnsafe(v), op2).ToScalar();
    }
}

.NET 6 x64 codegen:

C.M1(Single)
    vzeroupper
    vorps xmm0, xmm0, [0x7ffb275b0450]
    ret

C.M1_afterRefactoring(Single)
    vzeroupper
    vmovupd xmm1, [0x7ffb275b04a0]
    vorps xmm0, xmm0, xmm1
    ret

ghost commented 2 years ago

Tagging subscribers to this area: @JulieLeeMSFT See info in area-owners.md if you want to be subscribed.

Issue Details

Slightly refactoring vector code can degrade CQ in some cases: ```c# using System.Runtime.CompilerServices; using System.Runtime.Intrinsics; using System.Runtime.Intrinsics.X86; class C { static float M1(float v) { return Sse2.Or(Vector128.CreateScalarUnsafe(v), Vector128.Create(2).AsSingle()).ToScalar(); } [SkipLocalsInit] // ".locals init" -> ".locals" in IL (it had no effect on codegen) static float M1_afterRefactoring(float v) { Vector128 op2 = Vector128.Create(2).AsSingle(); return Sse2.Or(Vector128.CreateScalarUnsafe(v), op2).ToScalar(); } } ``` .NET 6 x64 codegen: ```asm C.M1(Single) vzeroupper vorps xmm0, xmm0, [0x7ffb275b0450] ret C.M1_afterRefactoring(Single) vzeroupper vmovupd xmm1, [0x7ffb275b04a0] vorps xmm0, xmm0, xmm1 ret ```

Author:	am11
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `untriaged`
Milestone:	-

AndyAyersMS commented 2 years ago

Likely fixed in .NET 7 by #63720.

am11 commented 2 years ago

Tested with main, looks to be same issue:

$ git rev-parse HEAD
207f2b66678138d18a44339fa5326ad627efec22

$ ./build.sh clr+libs+host -c release -rc checked
$ DOTNET_JitDisasm=M1* artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun \
       /issue-64879/bin/release/net6.0/issue-64879.dll

; Assembly listing for method C:M1(float):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; Tier-0 compilation
; MinOpts code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00    ] (  1,  1   )   float  ->  [rbp-04H]   do-not-enreg[]
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   do-not-enreg[] "OutgoingArgSpace"
;
; Lcl frame size = 16

G_M24324_IG01:              ;; offset=0000H
       55                   push     rbp
       4883EC10             sub      rsp, 16
       C5F877               vzeroupper 
       488D6C2410           lea      rbp, [rsp+10H]
       C5FA1145FC           vmovss   dword ptr [rbp-04H], xmm0
                                                ;; bbWeight=1    PerfScore 3.75
G_M24324_IG02:              ;; offset=0012H
       C5FA1045FC           vmovss   xmm0, dword ptr [rbp-04H]
       C5F8560511000000     vorps    xmm0, xmm0, xmmword ptr [reloc @RWD00]
                                                ;; bbWeight=1    PerfScore 5.00
G_M24324_IG03:              ;; offset=001FH
       4883C410             add      rsp, 16
       5D                   pop      rbp
       C3                   ret      
                                                ;; bbWeight=1    PerfScore 1.75
RWD00   dq      0000000200000002h, 0000000200000002h

; Total bytes of code 37, prolog size 13, PerfScore 14.50, instruction count 10, allocated bytes for code 40 (MethodHash=3760a0fb) for method C:M1(float):float
; ============================================================

; Assembly listing for method C:M1_afterRefactoring(float):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; Tier-0 compilation
; MinOpts code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00    ] (  1,  1   )   float  ->  [rbp-04H]   do-not-enreg[]
;  V01 loc0         [V01    ] (  1,  1   )  simd16  ->  [rbp-20H]   do-not-enreg[S]
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   do-not-enreg[] "OutgoingArgSpace"
;
; Lcl frame size = 32

G_M50981_IG01:              ;; offset=0000H
       55                   push     rbp
       4883EC20             sub      rsp, 32
       C5F877               vzeroupper 
       488D6C2420           lea      rbp, [rsp+20H]
       C5FA1145FC           vmovss   dword ptr [rbp-04H], xmm0
                                                ;; bbWeight=1    PerfScore 3.75
G_M50981_IG02:              ;; offset=0012H
       C5F9100526000000     vmovupd  xmm0, xmmword ptr [reloc @RWD00]
       C5F92945E0           vmovapd  xmmword ptr [rbp-20H], xmm0
       C5FA1045FC           vmovss   xmm0, dword ptr [rbp-04H]
       C5F85645E0           vorps    xmm0, xmm0, xmmword ptr [rbp-20H]
                                                ;; bbWeight=1    PerfScore 9.00
G_M50981_IG03:              ;; offset=0029H
       4883C420             add      rsp, 32
       5D                   pop      rbp
       C3                   ret      
                                                ;; bbWeight=1    PerfScore 1.75
RWD00   dq      0000000200000002h, 0000000200000002h

; Total bytes of code 47, prolog size 13, PerfScore 19.70, instruction count 12, allocated bytes for code 52 (MethodHash=ea0038da) for method C:M1_afterRefactoring(float):float
; ============================================================

AndyAyersMS commented 2 years ago

; Tier-0 compilation

Take a look at optimized code (via COMPlus_TieredCompilation=0 or equivalent).

EgorBo commented 2 years ago

@AndyAyersMS not substituted even in tier1 unfortunately, Fsub rejects it with

    [000004]:  would change struct handle (assignment)


*************** In fgMarkAddressExposedLocals()
LocalAddressVisitor visiting statement:
STMT00000 ( 0x000[E-] ... 0x00B )
               [000004] -A----------              *  ASG       simd16 (copy)
               [000002] D------N----              +--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1[Single]> V01 loc0         
               [000001] ------------              \--*  HWINTRINSIC simd16 int Create
               [000000] ------------                 \--*  CNS_INT   int    2

LocalAddressVisitor visiting statement:
STMT00001 ( 0x00C[E-] ... 0x01D )
               [000010] ------------              *  RETURN    float 
               [000009] ------------              \--*  HWINTRINSIC float  float ToScalar
               [000008] ------------                 \--*  HWINTRINSIC simd16 float Or
               [000006] ------------                    +--*  HWINTRINSIC simd16 float CreateScalarUnsafe
               [000005] ------------                    |  \--*  LCL_VAR   float  V00 arg0         
               [000007] ------------                    \--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1[Single]> V01 loc0         

*************** Finishing PHASE Morph - Structs/AddrExp

*************** Starting PHASE Forward Substitution

===> BB01
    [000004]:  would change struct handle (assignment)

*************** Finishing PHASE Forward Substitution [no changes]

EgorBo commented 2 years ago

it's also the reason why I in https://github.com/dotnet/runtime/pull/64821 substituted vectors by hands:

am11 commented 2 years ago

; Tier-0 compilation
Take a look at optimized code (via COMPlus_TieredCompilation=0 or equivalent).

Ah, right, had to use

DOTNET_TieredCompilation=0 DOTNET_JitDisasm=M1* artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun \
     /issue-64879/bin/release/net6.0/issue-64879.dll

and decorated methods with [MethodImpl(MethodImplOptions.NoInlining)] (since the inliner works very well :+1:) Result is aligned with @EgorBo's findings:

; Assembly listing for method C:M1(float):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   float  ->  mm0         single-def
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M24324_IG01:              ;; offset=0000H
       C5F877               vzeroupper 
                                                ;; bbWeight=1    PerfScore 1.00
G_M24324_IG02:              ;; offset=0003H
       C5F8560505000000     vorps    xmm0, xmm0, xmmword ptr [reloc @RWD00]
                                                ;; bbWeight=1    PerfScore 2.00
G_M24324_IG03:              ;; offset=000BH
       C3                   ret      
                                                ;; bbWeight=1    PerfScore 1.00
RWD00   dq      0000000200000002h, 0000000200000002h

; Total bytes of code 12, prolog size 3, PerfScore 5.30, instruction count 3, allocated bytes for code 13 (MethodHash=3760a0fb) for method C:M1(float):float
; ============================================================

; Assembly listing for method C:M1_afterRefactoring(float):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   float  ->  mm0         single-def
;  V01 loc0         [V01,T01] (  2,  2   )  simd16  ->  mm1        
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M50981_IG01:              ;; offset=0000H
       C5F877               vzeroupper 
                                                ;; bbWeight=1    PerfScore 1.00
G_M50981_IG02:              ;; offset=0003H
       C5F9100D15000000     vmovupd  xmm1, xmmword ptr [reloc @RWD00]
       C5F856C1             vorps    xmm0, xmm0, xmm1
                                                ;; bbWeight=1    PerfScore 3.33
G_M50981_IG03:              ;; offset=000FH
       C3                   ret      
                                                ;; bbWeight=1    PerfScore 1.00
RWD00   dq      0000000200000002h, 0000000200000002h

; Total bytes of code 16, prolog size 3, PerfScore 7.13, instruction count 4, allocated bytes for code 18 (MethodHash=ea0038da) for method C:M1_afterRefactoring(float):float
; ============================================================

AndyAyersMS commented 2 years ago

Ok, thanks. Probably something we can relax in fwd sub for suitable pairs of struct handles.

AndyAyersMS commented 2 years ago

Forward sub still not handling this, but the initial example is fixed in current main.

Not clear if that fix is fully general.

AndyAyersMS commented 2 years ago

@tannergooding do you know the right compatibility check here? Given this tree:

               [000004] -A----------              *  ASG       simd16 (copy)
               [000002] D------N----              +--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1[Single]> V01 loc0         
               [000001] ------------              \--*  HWINTRINSIC simd16 int Create
               [000000] ------------                 \--*  CNS_INT   int    2

when / how should we substitute the simd16 int node for uses of the differently typed loc0?

If the downstream code is already able to handle the class mismatch in places other than assignment, then perhaps we can just remove the check, though I'm pretty sure I added it because not blocking this substitution caused problems. It might be better to retype the constant?

tannergooding commented 2 years ago

@AndyAyersMS, for HWIntrinsic ops the type of the input doesn't matter, its effectively just a raw set of 16-bytes. Its the actual operation that really matters and so HWINTRINSIC simd16 int Add will always treat it as a Vector128<int> regardless of what the type of the input node is.

There is no difference for any platforms at the ABI level either. Vector128<T> is just the singular __m128, regardless of T.

If the downstream code is already able to handle the class mismatch in places other than assignment, then perhaps we can just remove the check, though I'm pretty sure I added it because not blocking this substitution caused problems. It might be better to retype the constant?

I'm surprised this one is being created, the new GT_CNS_VEC support should mean this gets converted into a CNS_VEC simd16 <2, 2, 2, 2> node (this can happen at import or in morph) and the type information should already be getting "stripped".

The Create call itself is important to keep as int (since creating an int vs a float is different) but the downstream consumer of such a node shouldn't matter. If there is such a place, then I'd expect it to be a bug especially since we magic away vector.AsInt32() casts (and similar) in importation.

AndyAyersMS commented 2 years ago

I'm surprised this one is being created, the new GT_CNS_VEC support should mean this gets converted into a CNS_VEC simd16 <2, 2, 2, 2> node (this can happen at import or in morph) and the type information should already be getting "stripped".

For that particular case we likely fix it up in morph (at least its still there in fwd sub). I have the jitdump lying around, will take a closer look.

So it seems like maybe we can just remove this fwd sub restriction for simd types? Will try that too.

tannergooding commented 2 years ago

So it seems like maybe we can just remove this fwd sub restriction for simd types?

That's my understanding. I'd be interested if any issues crop up here as I'd expect they are subtle bugs we've not encountered yet.

AndyAyersMS commented 2 years ago

Yeah, we now create

[000004] -A---------                         *  ASG       simd16 (copy)
[000002] D------N---                         +--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1[System.Single]> V01 loc0         
[000001] -----------                         \--*  CNS_VEC   simd16<0x00000002, 0x00000002, 0x00000002, 0x00000002>

but fwd sub is still wary as the class handles differ.

dotnet / runtime

Difference in code quality: inline vs. local vector #64879