Closed am11 closed 2 years ago
Tagging subscribers to this area: @JulieLeeMSFT See info in area-owners.md if you want to be subscribed.
Author: | am11 |
---|---|
Assignees: | - |
Labels: | `area-CodeGen-coreclr`, `untriaged` |
Milestone: | - |
Likely fixed in .NET 7 by #63720.
Tested with main
, looks to be same issue:
$ git rev-parse HEAD
207f2b66678138d18a44339fa5326ad627efec22
$ ./build.sh clr+libs+host -c release -rc checked
$ DOTNET_JitDisasm=M1* artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun \
/issue-64879/bin/release/net6.0/issue-64879.dll
; Assembly listing for method C:M1(float):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; Tier-0 compilation
; MinOpts code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00 ] ( 1, 1 ) float -> [rbp-04H] do-not-enreg[]
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] do-not-enreg[] "OutgoingArgSpace"
;
; Lcl frame size = 16
G_M24324_IG01: ;; offset=0000H
55 push rbp
4883EC10 sub rsp, 16
C5F877 vzeroupper
488D6C2410 lea rbp, [rsp+10H]
C5FA1145FC vmovss dword ptr [rbp-04H], xmm0
;; bbWeight=1 PerfScore 3.75
G_M24324_IG02: ;; offset=0012H
C5FA1045FC vmovss xmm0, dword ptr [rbp-04H]
C5F8560511000000 vorps xmm0, xmm0, xmmword ptr [reloc @RWD00]
;; bbWeight=1 PerfScore 5.00
G_M24324_IG03: ;; offset=001FH
4883C410 add rsp, 16
5D pop rbp
C3 ret
;; bbWeight=1 PerfScore 1.75
RWD00 dq 0000000200000002h, 0000000200000002h
; Total bytes of code 37, prolog size 13, PerfScore 14.50, instruction count 10, allocated bytes for code 40 (MethodHash=3760a0fb) for method C:M1(float):float
; ============================================================
; Assembly listing for method C:M1_afterRefactoring(float):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; Tier-0 compilation
; MinOpts code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00 ] ( 1, 1 ) float -> [rbp-04H] do-not-enreg[]
; V01 loc0 [V01 ] ( 1, 1 ) simd16 -> [rbp-20H] do-not-enreg[S]
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] do-not-enreg[] "OutgoingArgSpace"
;
; Lcl frame size = 32
G_M50981_IG01: ;; offset=0000H
55 push rbp
4883EC20 sub rsp, 32
C5F877 vzeroupper
488D6C2420 lea rbp, [rsp+20H]
C5FA1145FC vmovss dword ptr [rbp-04H], xmm0
;; bbWeight=1 PerfScore 3.75
G_M50981_IG02: ;; offset=0012H
C5F9100526000000 vmovupd xmm0, xmmword ptr [reloc @RWD00]
C5F92945E0 vmovapd xmmword ptr [rbp-20H], xmm0
C5FA1045FC vmovss xmm0, dword ptr [rbp-04H]
C5F85645E0 vorps xmm0, xmm0, xmmword ptr [rbp-20H]
;; bbWeight=1 PerfScore 9.00
G_M50981_IG03: ;; offset=0029H
4883C420 add rsp, 32
5D pop rbp
C3 ret
;; bbWeight=1 PerfScore 1.75
RWD00 dq 0000000200000002h, 0000000200000002h
; Total bytes of code 47, prolog size 13, PerfScore 19.70, instruction count 12, allocated bytes for code 52 (MethodHash=ea0038da) for method C:M1_afterRefactoring(float):float
; ============================================================
; Tier-0 compilation
Take a look at optimized code (via COMPlus_TieredCompilation=0
or equivalent).
@AndyAyersMS not substituted even in tier1 unfortunately, Fsub rejects it with
[000004]: would change struct handle (assignment)
*************** In fgMarkAddressExposedLocals()
LocalAddressVisitor visiting statement:
STMT00000 ( 0x000[E-] ... 0x00B )
[000004] -A---------- * ASG simd16 (copy)
[000002] D------N---- +--* LCL_VAR simd16<System.Runtime.Intrinsics.Vector128`1[Single]> V01 loc0
[000001] ------------ \--* HWINTRINSIC simd16 int Create
[000000] ------------ \--* CNS_INT int 2
LocalAddressVisitor visiting statement:
STMT00001 ( 0x00C[E-] ... 0x01D )
[000010] ------------ * RETURN float
[000009] ------------ \--* HWINTRINSIC float float ToScalar
[000008] ------------ \--* HWINTRINSIC simd16 float Or
[000006] ------------ +--* HWINTRINSIC simd16 float CreateScalarUnsafe
[000005] ------------ | \--* LCL_VAR float V00 arg0
[000007] ------------ \--* LCL_VAR simd16<System.Runtime.Intrinsics.Vector128`1[Single]> V01 loc0
*************** Finishing PHASE Morph - Structs/AddrExp
*************** Starting PHASE Forward Substitution
===> BB01
[000004]: would change struct handle (assignment)
*************** Finishing PHASE Forward Substitution [no changes]
it's also the reason why I in https://github.com/dotnet/runtime/pull/64821 substituted vectors by hands:
; Tier-0 compilation
Take a look at optimized code (via
COMPlus_TieredCompilation=0
or equivalent).
Ah, right, had to use
DOTNET_TieredCompilation=0 DOTNET_JitDisasm=M1* artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun \
/issue-64879/bin/release/net6.0/issue-64879.dll
and decorated methods with [MethodImpl(MethodImplOptions.NoInlining)]
(since the inliner works very well :+1:)
Result is aligned with @EgorBo's findings:
; Assembly listing for method C:M1(float):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) float -> mm0 single-def
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M24324_IG01: ;; offset=0000H
C5F877 vzeroupper
;; bbWeight=1 PerfScore 1.00
G_M24324_IG02: ;; offset=0003H
C5F8560505000000 vorps xmm0, xmm0, xmmword ptr [reloc @RWD00]
;; bbWeight=1 PerfScore 2.00
G_M24324_IG03: ;; offset=000BH
C3 ret
;; bbWeight=1 PerfScore 1.00
RWD00 dq 0000000200000002h, 0000000200000002h
; Total bytes of code 12, prolog size 3, PerfScore 5.30, instruction count 3, allocated bytes for code 13 (MethodHash=3760a0fb) for method C:M1(float):float
; ============================================================
; Assembly listing for method C:M1_afterRefactoring(float):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) float -> mm0 single-def
; V01 loc0 [V01,T01] ( 2, 2 ) simd16 -> mm1
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M50981_IG01: ;; offset=0000H
C5F877 vzeroupper
;; bbWeight=1 PerfScore 1.00
G_M50981_IG02: ;; offset=0003H
C5F9100D15000000 vmovupd xmm1, xmmword ptr [reloc @RWD00]
C5F856C1 vorps xmm0, xmm0, xmm1
;; bbWeight=1 PerfScore 3.33
G_M50981_IG03: ;; offset=000FH
C3 ret
;; bbWeight=1 PerfScore 1.00
RWD00 dq 0000000200000002h, 0000000200000002h
; Total bytes of code 16, prolog size 3, PerfScore 7.13, instruction count 4, allocated bytes for code 18 (MethodHash=ea0038da) for method C:M1_afterRefactoring(float):float
; ============================================================
Ok, thanks. Probably something we can relax in fwd sub for suitable pairs of struct handles.
Forward sub still not handling this, but the initial example is fixed in current main.
Not clear if that fix is fully general.
@tannergooding do you know the right compatibility check here? Given this tree:
[000004] -A---------- * ASG simd16 (copy)
[000002] D------N---- +--* LCL_VAR simd16<System.Runtime.Intrinsics.Vector128`1[Single]> V01 loc0
[000001] ------------ \--* HWINTRINSIC simd16 int Create
[000000] ------------ \--* CNS_INT int 2
when / how should we substitute the simd16 int
node for uses of the differently typed loc0
?
If the downstream code is already able to handle the class mismatch in places other than assignment, then perhaps we can just remove the check, though I'm pretty sure I added it because not blocking this substitution caused problems. It might be better to retype the constant?
@AndyAyersMS, for HWIntrinsic ops the type of the input doesn't matter, its effectively just a raw set of 16-bytes. Its the actual operation that really matters and so HWINTRINSIC simd16 int Add
will always treat it as a Vector128<int>
regardless of what the type of the input node is.
There is no difference for any platforms at the ABI level either. Vector128<T>
is just the singular __m128
, regardless of T
.
If the downstream code is already able to handle the class mismatch in places other than assignment, then perhaps we can just remove the check, though I'm pretty sure I added it because not blocking this substitution caused problems. It might be better to retype the constant?
I'm surprised this one is being created, the new GT_CNS_VEC
support should mean this gets converted into a CNS_VEC simd16 <2, 2, 2, 2>
node (this can happen at import or in morph) and the type information should already be getting "stripped".
The Create
call itself is important to keep as int
(since creating an int
vs a float
is different) but the downstream consumer of such a node shouldn't matter. If there is such a place, then I'd expect it to be a bug especially since we magic away vector.AsInt32()
casts (and similar) in importation.
I'm surprised this one is being created, the new
GT_CNS_VEC
support should mean this gets converted into aCNS_VEC simd16 <2, 2, 2, 2>
node (this can happen at import or in morph) and the type information should already be getting "stripped".
For that particular case we likely fix it up in morph (at least its still there in fwd sub). I have the jitdump lying around, will take a closer look.
So it seems like maybe we can just remove this fwd sub restriction for simd types? Will try that too.
So it seems like maybe we can just remove this fwd sub restriction for simd types?
That's my understanding. I'd be interested if any issues crop up here as I'd expect they are subtle bugs we've not encountered yet.
Yeah, we now create
[000004] -A--------- * ASG simd16 (copy)
[000002] D------N--- +--* LCL_VAR simd16<System.Runtime.Intrinsics.Vector128`1[System.Single]> V01 loc0
[000001] ----------- \--* CNS_VEC simd16<0x00000002, 0x00000002, 0x00000002, 0x00000002>
but fwd sub is still wary as the class handles differ.
Slightly refactoring vector code can degrade CQ in some cases:
.NET 6 x64 codegen: