Vector3Interop_r fails on Linux

sejongoh commented 8 years ago

Vector3Arg_Unix2 case fails.

The expected sum results are 14, 77, 194, 32, 122, 1045.

The actual results:

callBack_RPInvoke_Vector3Arg_Unix2:
    1, 2, 3
    4, 5, 6
    7, 8, 9
    Sum0,1,2,3,4,5 = 14, 77, 1.152921E+16, 32, 122, 1045

sejongoh commented 8 years ago

   static void callBack_RPInvoke_Vector3Arg_Unix2(
        Vector3 v3f32_xmm0, 
        float   f32_xmm2,
        float   f32_xmm3,
        float   f32_xmm4,
        float   f32_xmm5,
        float   f32_xmm6,
        float   f32_xmm7,
        float   f32_mem0,
        Vector3 v3f32_mem0,
        float   f32_mem1,
        float   f32_mem2,
        Vector3 v3f32_mem3,
        float   f32_mem4)

The upper bits are not initialized to zero for v3f32_mem0 and v3f32_mem3.

We were lucky with v3f32_mem0.z because it has a very small number, 4.59163468e-41 and v3f32_mem0.z x v3f32_mem0.z becomes zero.

v32f32_mem3.z has a bigger value and v3f32_mem0.z x v3f32_mem0.z is not zero.

v3f32_xmm0 which is mapped on registers is okay since JIT generates mov instructions filling in the upper bits with zeros.

Lowering converts simd12 to simd16 since it expects the upper bits are already initialized to zeros.

Before lowering

***** BB05, stmt 5 (top level)
     ( 19, 15) [000014] ------------             *  stmtExpr  void  (top level) (IL 0x001...0x00A)
N001 (  3,  2) [000005] -------N----             |     /--*  lclVar    simd12 V11 arg11        
N004 (  3,  2) [000006] -------N----             |     +--*  lclVar    simd12 V11 arg11        
N007 ( 19, 15) [000011] ---XG-------             |  /--*  simd      float  float Dot
N009 ( 19, 15) [000013] DA-XG-------             \--*  st.lclVar float  V14 tmp0

After lowering

     ( 19, 15) [000014] ------------             *  stmtExpr  void  (top level) (IL 0x001...0x00A)
N027 (  3,  2) [000005] -------N----             |     /--*  lclVar    simd16 V11 arg11         REG mm0
N029 (  3,  2) [000006] -------N----             |     +--*  lclVar    simd16 V11 arg11         REG mm1
N031 ( 19, 15) [000011] ---XG-------             |  /--*  simd      float  float Dot REG mm0
N033 ( 19, 15) [000013] DA-XG-------             \--*  st.lclVar float  V14 tmp0          REG NA

The conversion from simd12 to simd 16 is done here:

[lower.cpp]

 652     case GT_LCL_VAR:
 653     case GT_STORE_LCL_VAR:
 654         if ((*ppTree)->TypeGet() == TYP_SIMD12)
 655         {
 656 #ifdef _TARGET_64BIT_
 657             // Assumption 1:
 658             // RyuJit backend depends on the assumption that on 64-Bit targets Vector3 size is rounded off
 659             // to TARGET_POINTER_SIZE and hence Vector3 locals on stack can be treated as TYP_SIMD16 for
 660             // reading and writing purposes.
 661             //
 662             // Assumption 2:
 663             // RyuJit backend is making another implicit assumption that Vector3 type args when passed in
 664             // registers or on stack, the upper most 4-bytes will be zero.
 665             //
 666             // TODO-64bit: assumptions 1 and 2 hold within RyuJIT generated code. It is not clear whether
 667             // these assumptions hold when a Vector3 type arg is passed by native code. Example: PInvoke
 668             // returning Vector3 type value or RPInvoke passing Vector3 type args.
 669             (*ppTree)->gtType = TYP_SIMD16;
 670 #else
 671             NYI("Lowering of TYP_SIMD12 locals");
 672 #endif // _TARGET_64BIT_
 673         }

sejongoh commented 8 years ago

For Windows, JIT generates indir and it ends with up using different instructions.

***** BB05, stmt 5 (top level)
     ( 15, 11) [000014] ------------             *  stmtExpr  void  (top level) (IL 0x001...0x00A)
N001 (  1,  1) [000005] ------------             |        /--*  lclVar    byref  V08 arg8         
N002 (  7,  5) [000010] ---XG-------             |     /--*  indir     simd12
N003 (  1,  1) [000006] ------------             |     |  /--*  lclVar    byref  V08 arg8         
N004 (  7,  5) [000008] ---XG-------             |     +--*  indir     simd12
N005 ( 15, 11) [000011] ---XG-------             |  /--*  simd      float  float Dot
N007 ( 15, 11) [000013] DA-XG-------             \--*  st.lclVar float  V14 tmp0

Also, lowering doesn't convert simd12 to simd16 in this case.

N002 (  7,  5) [000010] ---XG-------             *  indir     simd12
  No addressing mode

vmovss clears the upper bits.

Generating: N033 (  7,  5) [000008] ---XG-------             *  indir     simd12 REG mm1
                            Byref regs: 00000002 {rcx} => 00000000 {}
IN000a:        vmovss   ymm2, dword ptr [rcx+8]
IN000b:        vmovsd   ymm1, qword ptr [rcx]
IN000c:        vshufps  ymm1, ymm2, 68

sejongoh commented 8 years ago

Since the native compilers don't initialize the upper bits to zero, JIT has to handle this. In this Linux case, it seems to be missing.

@CarolEidt Where should we fix the problem?

sejongoh commented 8 years ago

cc @dotnet/jit-contrib

From discussion with @CarolEidt, Windows doesn't trigger the following transform since the parameter is treated as pass-by-reference, and it passes.

[Before]

RewriteSimpleTransforms, with statement:
     ( 19, 15) [000014] ------------             *  stmtExpr  void  (top level) (IL 0x001...0x00A)
N001 (  3,  2) [000005] -------N----             |           /--*  lclVar    simd12 V08 arg8         
N002 (  3,  3) [000009] L-----------             |        /--*  addr      byref 
N003 (  9,  7) [000010] ---XG-------             |     /--*  obj       simd12
N004 (  3,  2) [000006] -------N----             |     |     /--*  lclVar    simd12 V08 arg8         
N005 (  3,  3) [000007] L-----------             |     |  /--*  addr      byref 
N006 (  9,  7) [000008] ---XG-------             |     +--*  obj       simd12
N007 ( 19, 15) [000011] ---XG-------             |  /--*  simd      float  float Dot
N008 (  1,  2) [000012] D------N----             |  +--*  lclVar    float  V14 tmp0         
N009 ( 19, 15) [000013] -A-XG---R---             \--*  =         float

[After]

***** BB05, stmt 5 (top level)
     ( 19, 15) [000014] ------------             *  stmtExpr  void  (top level) (IL 0x001...0x00A)
N001 (  3,  2) [000005] -------N----             |     /--*  lclVar    simd12 V11 arg11        
N004 (  3,  2) [000006] -------N----             |     +--*  lclVar    simd12 V11 arg11        
N007 ( 19, 15) [000011] ---XG-------             |  /--*  simd      float  float Dot
N009 ( 19, 15) [000013] DA-XG-------             \--*  st.lclVar float  V14 tmp0

One solutions is to fix the prolog generator to clear the upper bits for arguments mapped on stack.

CarolEidt commented 8 years ago

I think that making the change in the prolog is the right solution. The other near-term option would be to recognize when you are loading true pass-by-value SIMD12 arguments (i.e. not the copy and pass by reference args as for x64 windows), and always do the clearing of the upper bits then. But then you'd pay the price on each reference. Of course, another option would be to change the model to always assume that the upper bits may be garbage (in registers or on stack), and generate code accordingly. But that is a bigger change, and some view it as more risky (see dotnet/runtime#4543)

sejongoh commented 8 years ago

As discussed in dotnet/runtime#5181, We need to clear up the upper bits for register arguments.

sejongoh commented 8 years ago

@briansull @CarolEidt @kyulee1 How about ARM64? Do we need to clear the upper bits? If so, do we need a fix now?

briansull commented 8 years ago

Currently we don't have support in CodegenArm64.cpp for SIMD types on ARM64 We do have support for encoding the SIMD instructions for Arm64 in emitarm64.cpp

So, no you don't have to do anything here for Arm64

sejongoh commented 8 years ago

@briansull Thanks!

sivarv commented 8 years ago

@sejongoh - this can be closed now?

sejongoh commented 8 years ago

Fixed by dotnet/coreclr#4963

dotnet / runtime

Vector3Interop_r fails on Linux #5830