Closed sejongoh closed 4 years ago
static void callBack_RPInvoke_Vector3Arg_Unix2(
Vector3 v3f32_xmm0,
float f32_xmm2,
float f32_xmm3,
float f32_xmm4,
float f32_xmm5,
float f32_xmm6,
float f32_xmm7,
float f32_mem0,
Vector3 v3f32_mem0,
float f32_mem1,
float f32_mem2,
Vector3 v3f32_mem3,
float f32_mem4)
The upper bits are not initialized to zero for v3f32_mem0 and v3f32_mem3.
We were lucky with v3f32_mem0.z because it has a very small number, 4.59163468e-41 and v3f32_mem0.z x v3f32_mem0.z becomes zero.
v32f32_mem3.z has a bigger value and v3f32_mem0.z x v3f32_mem0.z is not zero.
v3f32_xmm0 which is mapped on registers is okay since JIT generates mov instructions filling in the upper bits with zeros.
Lowering converts simd12 to simd16 since it expects the upper bits are already initialized to zeros.
Before lowering
***** BB05, stmt 5 (top level)
( 19, 15) [000014] ------------ * stmtExpr void (top level) (IL 0x001...0x00A)
N001 ( 3, 2) [000005] -------N---- | /--* lclVar simd12 V11 arg11
N004 ( 3, 2) [000006] -------N---- | +--* lclVar simd12 V11 arg11
N007 ( 19, 15) [000011] ---XG------- | /--* simd float float Dot
N009 ( 19, 15) [000013] DA-XG------- \--* st.lclVar float V14 tmp0
After lowering
( 19, 15) [000014] ------------ * stmtExpr void (top level) (IL 0x001...0x00A)
N027 ( 3, 2) [000005] -------N---- | /--* lclVar simd16 V11 arg11 REG mm0
N029 ( 3, 2) [000006] -------N---- | +--* lclVar simd16 V11 arg11 REG mm1
N031 ( 19, 15) [000011] ---XG------- | /--* simd float float Dot REG mm0
N033 ( 19, 15) [000013] DA-XG------- \--* st.lclVar float V14 tmp0 REG NA
The conversion from simd12 to simd 16 is done here:
[lower.cpp]
652 case GT_LCL_VAR:
653 case GT_STORE_LCL_VAR:
654 if ((*ppTree)->TypeGet() == TYP_SIMD12)
655 {
656 #ifdef _TARGET_64BIT_
657 // Assumption 1:
658 // RyuJit backend depends on the assumption that on 64-Bit targets Vector3 size is rounded off
659 // to TARGET_POINTER_SIZE and hence Vector3 locals on stack can be treated as TYP_SIMD16 for
660 // reading and writing purposes.
661 //
662 // Assumption 2:
663 // RyuJit backend is making another implicit assumption that Vector3 type args when passed in
664 // registers or on stack, the upper most 4-bytes will be zero.
665 //
666 // TODO-64bit: assumptions 1 and 2 hold within RyuJIT generated code. It is not clear whether
667 // these assumptions hold when a Vector3 type arg is passed by native code. Example: PInvoke
668 // returning Vector3 type value or RPInvoke passing Vector3 type args.
669 (*ppTree)->gtType = TYP_SIMD16;
670 #else
671 NYI("Lowering of TYP_SIMD12 locals");
672 #endif // _TARGET_64BIT_
673 }
For Windows, JIT generates indir and it ends with up using different instructions.
***** BB05, stmt 5 (top level)
( 15, 11) [000014] ------------ * stmtExpr void (top level) (IL 0x001...0x00A)
N001 ( 1, 1) [000005] ------------ | /--* lclVar byref V08 arg8
N002 ( 7, 5) [000010] ---XG------- | /--* indir simd12
N003 ( 1, 1) [000006] ------------ | | /--* lclVar byref V08 arg8
N004 ( 7, 5) [000008] ---XG------- | +--* indir simd12
N005 ( 15, 11) [000011] ---XG------- | /--* simd float float Dot
N007 ( 15, 11) [000013] DA-XG------- \--* st.lclVar float V14 tmp0
Also, lowering doesn't convert simd12 to simd16 in this case.
N002 ( 7, 5) [000010] ---XG------- * indir simd12
No addressing mode
vmovss clears the upper bits.
Generating: N033 ( 7, 5) [000008] ---XG------- * indir simd12 REG mm1
Byref regs: 00000002 {rcx} => 00000000 {}
IN000a: vmovss ymm2, dword ptr [rcx+8]
IN000b: vmovsd ymm1, qword ptr [rcx]
IN000c: vshufps ymm1, ymm2, 68
Since the native compilers don't initialize the upper bits to zero, JIT has to handle this. In this Linux case, it seems to be missing.
@CarolEidt Where should we fix the problem?
cc @dotnet/jit-contrib
From discussion with @CarolEidt, Windows doesn't trigger the following transform since the parameter is treated as pass-by-reference, and it passes.
[Before]
RewriteSimpleTransforms, with statement:
( 19, 15) [000014] ------------ * stmtExpr void (top level) (IL 0x001...0x00A)
N001 ( 3, 2) [000005] -------N---- | /--* lclVar simd12 V08 arg8
N002 ( 3, 3) [000009] L----------- | /--* addr byref
N003 ( 9, 7) [000010] ---XG------- | /--* obj simd12
N004 ( 3, 2) [000006] -------N---- | | /--* lclVar simd12 V08 arg8
N005 ( 3, 3) [000007] L----------- | | /--* addr byref
N006 ( 9, 7) [000008] ---XG------- | +--* obj simd12
N007 ( 19, 15) [000011] ---XG------- | /--* simd float float Dot
N008 ( 1, 2) [000012] D------N---- | +--* lclVar float V14 tmp0
N009 ( 19, 15) [000013] -A-XG---R--- \--* = float
[After]
***** BB05, stmt 5 (top level)
( 19, 15) [000014] ------------ * stmtExpr void (top level) (IL 0x001...0x00A)
N001 ( 3, 2) [000005] -------N---- | /--* lclVar simd12 V11 arg11
N004 ( 3, 2) [000006] -------N---- | +--* lclVar simd12 V11 arg11
N007 ( 19, 15) [000011] ---XG------- | /--* simd float float Dot
N009 ( 19, 15) [000013] DA-XG------- \--* st.lclVar float V14 tmp0
One solutions is to fix the prolog generator to clear the upper bits for arguments mapped on stack.
I think that making the change in the prolog is the right solution. The other near-term option would be to recognize when you are loading true pass-by-value SIMD12 arguments (i.e. not the copy and pass by reference args as for x64 windows), and always do the clearing of the upper bits then. But then you'd pay the price on each reference. Of course, another option would be to change the model to always assume that the upper bits may be garbage (in registers or on stack), and generate code accordingly. But that is a bigger change, and some view it as more risky (see dotnet/runtime#4543)
As discussed in dotnet/runtime#5181, We need to clear up the upper bits for register arguments.
@briansull @CarolEidt @kyulee1 How about ARM64? Do we need to clear the upper bits? If so, do we need a fix now?
Currently we don't have support in CodegenArm64.cpp for SIMD types on ARM64 We do have support for encoding the SIMD instructions for Arm64 in emitarm64.cpp
So, no you don't have to do anything here for Arm64
@briansull Thanks!
@sejongoh - this can be closed now?
Fixed by dotnet/coreclr#4963
Vector3Arg_Unix2 case fails.
The expected sum results are 14, 77, 194, 32, 122, 1045.
The actual results: