mrange commented 5 years ago

Struct type overhead higher than expected for small numerical vectors

General

I am running dotnet 3.0 preview on Windows 10:

$ dotnet --version
3.0.100-preview-010184

My CPU is: Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz

(so it doesn't have FMA support but do support AVX).

I have been messing around with the new intrinsics support i in dotnet core 3.0 quite a lot and had some success with it. I want to write Raymarcher using SIMD AVX.

Todo so I declare a SIMD V3 struct type:

  // Snippet of a bigger program
  using VF = System.Runtime.Intrinsics.Vector256<float>;
  struct V3
  {
    public VF X;
    public VF Y;
    public VF Z;
  }

I know there are SIMD enabled types in System.Numerics.Vector but I want to do my own custom struct SIMD types.

When I inspect the disassembly code I find code that seems to do nothing except slowing down performance. I can work-around these issues by inlining all code but naturally I don't want to do that as that will significantly complicate my raymarchers.

I have the full code attached below.

I declared a few operators like so:

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static V3 operator+(in V3 l, in V3 r)
    {
      return new V3(Avx.Add(l.X, r.X), Avx.Add(l.Y, r.Y), Avx.Add(l.Z, r.Z));
    }

My inner loop looks like this

        var qre   = re*re;  // re and qre are V3 type
        var qim   = im*im;
        var reim  = re*im;
        re = qre - qim + cre;
        im = reim + reim + cim;

When looking into the disassembly I find code that is odd to me:

; qre = re*re
00007ffa`edfa8373 c54c59e6        vmulps  ymm12,ymm6,ymm6
00007ffa`edfa8377 c54459ef        vmulps  ymm13,ymm7,ymm7
00007ffa`edfa837b c4413c59f0      vmulps  ymm14,ymm8,ymm8
; Saving qre (Not really needed)
00007ffa`edfa8380 c57d11a42480020000 vmovupd ymmword ptr [rsp+280h],ymm12
00007ffa`edfa8389 c57d11ac2460020000 vmovupd ymmword ptr [rsp+260h],ymm13
00007ffa`edfa8392 c57d11b42440020000 vmovupd ymmword ptr [rsp+240h],ymm14
; Reloading qre? (Shouldn't be needed)
00007ffa`edfa839b c57d10a42480020000 vmovupd ymm12,ymmword ptr [rsp+280h]
00007ffa`edfa83a4 c57d10ac2460020000 vmovupd ymm13,ymmword ptr [rsp+260h]
00007ffa`edfa83ad c57d10b42440020000 vmovupd ymm14,ymmword ptr [rsp+240h]
; qim = im*im
00007ffa`edfa83b6 c4413459f9      vmulps  ymm15,ymm9,ymm9
00007ffa`edfa83bb c4c12c59ea      vmulps  ymm5,ymm10,ymm10
00007ffa`edfa83c0 c4c12459e3      vmulps  ymm4,ymm11,ymm11
; Saving qre (Not really needed)
00007ffa`edfa83c5 c57d11bc2420020000 vmovupd ymmword ptr [rsp+220h],ymm15
00007ffa`edfa83ce c5fd11ac2400020000 vmovupd ymmword ptr [rsp+200h],ymm5
00007ffa`edfa83d7 c5fd11a424e0010000 vmovupd ymmword ptr [rsp+1E0h],ymm4
; Reloading qim? (Shouldn't be needed)
00007ffa`edfa83e0 c5fd10a42420020000 vmovupd ymm4,ymmword ptr [rsp+220h]
00007ffa`edfa83e9 c5fd10ac2400020000 vmovupd ymm5,ymmword ptr [rsp+200h]
00007ffa`edfa83f2 c57d10bc24e0010000 vmovupd ymm15,ymmword ptr [rsp+1E0h]
; reim = re*rim
00007ffa`edfa83fb c4c14c59f1      vmulps  ymm6,ymm6,ymm9
00007ffa`edfa8400 c4c14459fa      vmulps  ymm7,ymm7,ymm10
00007ffa`edfa8405 c4413c59c3      vmulps  ymm8,ymm8,ymm11
; Saving reim (Not really needed)
00007ffa`edfa840a c5fd11b424c0010000 vmovupd ymmword ptr [rsp+1C0h],ymm6
00007ffa`edfa8413 c5fd11bc24a0010000 vmovupd ymmword ptr [rsp+1A0h],ymm7
00007ffa`edfa841c c57d11842480010000 vmovupd ymmword ptr [rsp+180h],ymm8
; Reloading reim? (Shouldn't be needed)
00007ffa`edfa8425 c57d108c24c0010000 vmovupd ymm9,ymmword ptr [rsp+1C0h]
00007ffa`edfa842e c57d109424a0010000 vmovupd ymm10,ymmword ptr [rsp+1A0h]
00007ffa`edfa8437 c57d109c2480010000 vmovupd ymm11,ymmword ptr [rsp+180h]
; (qre - qim)
00007ffa`edfa8440 c59c5ce4        vsubps  ymm4,ymm12,ymm4
00007ffa`edfa8444 c5945ced        vsubps  ymm5,ymm13,ymm5
00007ffa`edfa8448 c4c10c5cf7      vsubps  ymm6,ymm14,ymm15
; Saving intermediate results (Not really needed)
00007ffa`edfa844d c5fd11a42460010000 vmovupd ymmword ptr [rsp+160h],ymm4 ss:000000b7`64d7d1a0=00
00007ffa`edfa8456 c5fd11ac2440010000 vmovupd ymmword ptr [rsp+140h],ymm5
00007ffa`edfa845f c5fd11b42420010000 vmovupd ymmword ptr [rsp+120h],ymm6
; Loading intermediate results (Shouldn't be needed)
00007ffa`edfa8468 c5fd10a42460010000 vmovupd ymm4,ymmword ptr [rsp+160h]
00007ffa`edfa8471 c5fd10ac2440010000 vmovupd ymm5,ymmword ptr [rsp+140h]
00007ffa`edfa847a c5fd10b42420010000 vmovupd ymm6,ymmword ptr [rsp+120h]
; re = intermediate + cre
00007ffa`edfa8483 c5dc58e0        vaddps  ymm4,ymm4,ymm0
00007ffa`edfa8487 c5d458e9        vaddps  ymm5,ymm5,ymm1
00007ffa`edfa848b c5cc58f2        vaddps  ymm6,ymm6,ymm2
; Saving re (Needed?)
00007ffa`edfa848f c5fd11a42400010000 vmovupd ymmword ptr [rsp+100h],ymm4
00007ffa`edfa8498 c5fd11ac24e0000000 vmovupd ymmword ptr [rsp+0E0h],ymm5
00007ffa`edfa84a1 c5fd11b424c0000000 vmovupd ymmword ptr [rsp+0C0h],ymm6
; Reloading re (Different registers)
00007ffa`edfa84aa c5fd10b42400010000 vmovupd ymm6,ymmword ptr [rsp+100h]
00007ffa`edfa84b3 c5fd10bc24e0000000 vmovupd ymm7,ymmword ptr [rsp+0E0h]
00007ffa`edfa84bc c57d108424c0000000 vmovupd ymm8,ymmword ptr [rsp+0C0h]
; (reim + reim)
00007ffa`edfa84c5 c4c13458e1      vaddps  ymm4,ymm9,ymm9
00007ffa`edfa84ca c4c12c58ea      vaddps  ymm5,ymm10,ymm10
00007ffa`edfa84cf c4412458cb      vaddps  ymm9,ymm11,ymm11
; Saving intermediate results (Not really needed)
00007ffa`edfa84d4 c5fd11a424a0000000 vmovupd ymmword ptr [rsp+0A0h],ymm4
00007ffa`edfa84dd c5fd11ac2480000000 vmovupd ymmword ptr [rsp+80h],ymm5
00007ffa`edfa84e6 c57d114c2460    vmovupd ymmword ptr [rsp+60h],ymm9
; Loading intermediate results (Shouldn't be needed)
00007ffa`edfa84ec c5fd10a424a0000000 vmovupd ymm4,ymmword ptr [rsp+0A0h]
00007ffa`edfa84f5 c5fd10ac2480000000 vmovupd ymm5,ymmword ptr [rsp+80h]
00007ffa`edfa84fe c57d104c2460    vmovupd ymm9,ymmword ptr [rsp+60h]
; im = intermediate + cim
00007ffa`edfa8504 c5dc58e3        vaddps  ymm4,ymm4,ymm3
00007ffa`edfa8508 c57d10a424e0020000 vmovupd ymm12,ymmword ptr [rsp+2E0h]
00007ffa`edfa8511 c4c15458ec      vaddps  ymm5,ymm5,ymm12
00007ffa`edfa8516 c57d10ac24c0020000 vmovupd ymm13,ymmword ptr [rsp+2C0h]
00007ffa`edfa851f c4413458cd      vaddps  ymm9,ymm9,ymm13
; Saving re (Needed?)
00007ffa`edfa8524 c5fd11642440    vmovupd ymmword ptr [rsp+40h],ymm4
00007ffa`edfa852a c5fd116c2420    vmovupd ymmword ptr [rsp+20h],ymm5
00007ffa`edfa8530 c57d110c24      vmovupd ymmword ptr [rsp],ymm9
; Reloading im (Different registers)
00007ffa`edfa8535 c57d104c2440    vmovupd ymm9,ymmword ptr [rsp+40h]
00007ffa`edfa853b c57d10542420    vmovupd ymm10,ymmword ptr [rsp+20h]
00007ffa`edfa8541 c57d101c24      vmovupd ymm11,ymmword ptr [rsp]
; What is this?
00007ffa`edfa8546 c57d119c24a0020000 vmovupd ymmword ptr [rsp+2A0h],ymm11 ss:000000b7`64d7d2e0=00
; Loop
00007ffa`edfa854f ffc2            inc     edx
00007ffa`edfa8551 81fa80969800    cmp     edx,989680h
; What is this?
00007ffa`edfa8557 c57d11a424e0020000 vmovupd ymmword ptr [rsp+2E0h],ymm12
00007ffa`edfa8560 c57d11ac24c0020000 vmovupd ymmword ptr [rsp+2C0h],ymm13
00007ffa`edfa8569 0f8c1f010000    jl      00007ffa`edfa868e

So what seems odd to me is saving state to the stack and then immedietly reloading it and never looking at the saved state again. Perhaps one could argue that qre and qim needs visibility because I named the variables (lvalue expressions to borrow a term from c++) but it also seems intermediate results are stored on the stack (rvalue expressions).

I was helped somewhat by adding in that did eliminate some code but no the unnecessary writes to stack (unnecessary as it seems to me).

If inline all operations so that my inner loop looks like this:

        var qrex  = Avx.Multiply(rex,rex);
        var qrey  = Avx.Multiply(rey,rey);
        var qrez  = Avx.Multiply(rez,rez);
        var qimx  = Avx.Multiply(imx,imx);
        var qimy  = Avx.Multiply(imy,imy);
        var qimz  = Avx.Multiply(imz,imz);
        var reimx = Avx.Multiply(rex,imx);
        var reimy = Avx.Multiply(rey,imy);
        var reimz = Avx.Multiply(rez,imz);
        rex = Avx.Add(Avx.Subtract(qrex,qimx),crex);
        rey = Avx.Add(Avx.Subtract(qrey,qimy),crey);
        rez = Avx.Add(Avx.Subtract(qrez,qimz),crez);
        imx = Avx.Add(Avx.Add(reimx,reimx),cimx);
        imy = Avx.Add(Avx.Add(reimy,reimy),cimy);
        imz = Avx.Add(Avx.Add(reimz,reimz),cimz);

Then the disassembly looks more appealing and performs 3x faster.

; qre = re*re
00007ffa`edf88373 c54c59e6        vmulps  ymm12,ymm6,ymm6
00007ffa`edf88377 c54459ef        vmulps  ymm13,ymm7,ymm7
00007ffa`edf8837b c4413c59f0      vmulps  ymm14,ymm8,ymm8
; qim = im*im
00007ffa`edf88380 c4413459f9      vmulps  ymm15,ymm9,ymm9
00007ffa`edf88385 c4c12c59ea      vmulps  ymm5,ymm10,ymm10
00007ffa`edf8838a c4c12459e3      vmulps  ymm4,ymm11,ymm11
; reim = re*im
00007ffa`edf8838f c4414459d2      vmulps  ymm10,ymm7,ymm10
00007ffa`edf88394 c4413c59db      vmulps  ymm11,ymm8,ymm11
00007ffa`edf88399 c4414c59c9      vmulps  ymm9,ymm6,ymm9
; (qre - qim) + cre
00007ffa`edf8839e c4c11c5cf7      vsubps  ymm6,ymm12,ymm15
00007ffa`edf883a3 c5cc58f0        vaddps  ymm6,ymm6,ymm0
00007ffa`edf883a7 c5945ced        vsubps  ymm5,ymm13,ymm5
00007ffa`edf883ab c5d458f9        vaddps  ymm7,ymm5,ymm1
00007ffa`edf883af c58c5ce4        vsubps  ymm4,ymm14,ymm4
00007ffa`edf883b3 c55c58c2        vaddps  ymm8,ymm4,ymm2
; (reim + reim) + cim
00007ffa`edf883b7 c4c13458e1      vaddps  ymm4,ymm9,ymm9
00007ffa`edf883bc c55c58cb        vaddps  ymm9,ymm4,ymm3
00007ffa`edf883c0 c4c12c58e2      vaddps  ymm4,ymm10,ymm10
00007ffa`edf883c5 c5fd10ac24a0010000 vmovupd ymm5,ymmword ptr [rsp+1A0h]
00007ffa`edf883ce c55c58d5        vaddps  ymm10,ymm4,ymm5
00007ffa`edf883d2 c4c12458e3      vaddps  ymm4,ymm11,ymm11
00007ffa`edf883d7 c57d10a42480010000 vmovupd ymm12,ymmword ptr [rsp+180h]
00007ffa`edf883e0 c4415c58dc      vaddps  ymm11,ymm4,ymm12
; Loop
00007ffa`edf883e5 ffc2            inc     edx
00007ffa`edf883e7 81fa80969800    cmp     edx,989680h
; What is this
00007ffa`edf883ed c5fd11ac24a0010000 vmovupd ymmword ptr [rsp+1A0h],ymm5
00007ffa`edf883f6 c57d11a42480010000 vmovupd ymmword ptr [rsp+180h],ymm12
00007ffa`edf883ff 0f8c6effffff    jl      00007ffa`edf88373

For F# the disassembly looks even worse.

Perhaps there are some obvious flags I don't know of that I should have set on my struct types to enable the jitter to eliminate the intermediate results. I am happy with such a solution.

Full C# example:

project file:

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>netcoreapp3.0</TargetFramework>
    <TieredCompilation>false</TieredCompilation>
    <LangVersion>8.0</LangVersion>
  </PropertyGroup>

</Project>

namespace csperftest
{
  using System;
  using System.Runtime.CompilerServices;
  using System.Diagnostics;
  using System.Runtime.InteropServices;
  using System.Runtime.Intrinsics;
  using System.Runtime.Intrinsics.X86;

  using VF = System.Runtime.Intrinsics.Vector256<float>;

  struct V3
  {
    public VF X;
    public VF Y;
    public VF Z;

    public static readonly V3 Zero = new V3 ();

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public V3(VF x, VF y, VF z)
    {
      X = x;
      Y = y;
      Z = z;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static V3 operator+(in V3 l, in V3 r)
    {
      return new V3(Avx.Add(l.X, r.X), Avx.Add(l.Y, r.Y), Avx.Add(l.Z, r.Z));
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static V3 operator-(in V3 l, in V3 r)
    {
      return new V3(Avx.Subtract(l.X, r.X), Avx.Subtract(l.Y, r.Y), Avx.Subtract(l.Z, r.Z));
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static V3 operator*(in V3 l, in V3 r)
    {
      return new V3(Avx.Multiply(l.X, r.X), Avx.Multiply(l.Y, r.Y), Avx.Multiply(l.Z, r.Z));
    }

    public override string ToString() => $"(X: {X}, Y: {Y}, Z: {Z})";

  }

  class Program
  {
    static (V3 re, V3 im) TestSlow(V3 cre_, V3 cim_)
    {
      var cre = cre_;
      var cim = cim_;

      var re = cre;
      var im = cim;
      for (var iter = 0; iter < 10000000; ++iter)
      {
        var qre   = re*re;
        var qim   = im*im;
        var reim  = re*im;
        re = qre - qim + cre;
        im = reim + reim + cim;
      }

      return (re, im);
    }

    static (V3 re, V3 im) TestFast(V3 cre_, V3 cim_)
    {
      var crex = cre_.X;
      var crey = cre_.Y;
      var crez = cre_.Z;
      var cimx = cim_.X;
      var cimy = cim_.Y;
      var cimz = cim_.Z;

      var rex = crex;
      var rey = crey;
      var rez = crez;
      var imx = cimx;
      var imy = cimy;
      var imz = cimz;
      for (var iter = 0; iter < 10000000; ++iter)
      {
        var qrex  = Avx.Multiply(rex,rex);
        var qrey  = Avx.Multiply(rey,rey);
        var qrez  = Avx.Multiply(rez,rez);
        var qimx  = Avx.Multiply(imx,imx);
        var qimy  = Avx.Multiply(imy,imy);
        var qimz  = Avx.Multiply(imz,imz);
        var reimx = Avx.Multiply(rex,imx);
        var reimy = Avx.Multiply(rey,imy);
        var reimz = Avx.Multiply(rez,imz);
        rex = Avx.Add(Avx.Subtract(qrex,qimx),crex);
        rey = Avx.Add(Avx.Subtract(qrey,qimy),crey);
        rez = Avx.Add(Avx.Subtract(qrez,qimz),crez);
        imx = Avx.Add(Avx.Add(reimx,reimx),cimx);
        imy = Avx.Add(Avx.Add(reimy,reimy),cimy);
        imz = Avx.Add(Avx.Add(reimz,reimz),cimz);
      }

      return (new V3(rex, rey, rez), new V3(imx, imy, imz));
    }

    static void Main(string[] args)
    {
      // To make it simpler attaching the debugger
      for (var iter = 0; iter < 1000; ++iter)
      {
        var sw = new Stopwatch();
        sw.Start();
        TestSlow(V3.Zero, V3.Zero);
        //TestFast(V3.Zero, V3.Zero);
        sw.Stop();
        Console.WriteLine($"Took: {sw.ElapsedMilliseconds}");

      }
    }
  }
}

benaadams commented 5 years ago

/cc @tannergooding

tannergooding commented 5 years ago

CC. @CarolEidt, @AndyAyersMS

Is this already covered under the "First Class Struct" work? This basically looks like a case where the JIT isn't realizing that the V3 struct is just a convenience wrapper and that the fields could be treated as locals for the the lifetime of the function.

AndyAyersMS commented 5 years ago

Hmm, I would have expected all the V3 locals to be promoted and that should have eliminated all the copying. Let me take a closer look.

AndyAyersMS commented 5 years ago

We do fully promote, but certain promoted temps are blocked from enregistration, and this causes the odd looking store / reload blocks:

fgMorphCopyBlock:
The assignment [000199] using V74 removes: Constant Assertion: V74 == 0
block assignment to morph:
               [000195] -----+------              /--*  LCL_VAR   simd32 V15 tmp3         
               [000199] -A----------              *  ASG       simd32 (copy)
               [000198] n----+-N----              \--*  BLK(32)   simd32
               [000197] -----+------                 \--*  ADDR      byref 
               [000196] D----+-N----                    \--*  LCL_VAR   simd32 V74 tmp62        
 this requires a CopyBlock.
Local V74 should not be enregistered because: written in a block op
...
;  V74 tmp62        [V74,T44] (  2,  8   )  simd32  ->  [rsp+0x280]   do-not-enreg[SB] V14.X(offs=0x00) P-INDEP "field V14.X (fldOffset=0x0)"
;  V75 tmp63        [V75,T45] (  2,  8   )  simd32  ->  [rsp+0x260]   do-not-enreg[SB] V14.Y(offs=0x20) P-INDEP "field V14.Y (fldOffset=0x20)"
;  V76 tmp64        [V76,T46] (  2,  8   )  simd32  ->  [rsp+0x240]   do-not-enreg[SB] V14.Z(offs=0x40) P-INDEP "field V14.Z (fldOffset=0x40)"
...
       C57D11A42480020000   vmovupd  ymmword ptr[rsp+280H], ymm12
       C57D11AC2460020000   vmovupd  ymmword ptr[rsp+260H], ymm13
       C57D11B42440020000   vmovupd  ymmword ptr[rsp+240H], ymm14
       C57D10A42480020000   vmovupd  ymm12, ymmword ptr[rsp+280H]
       C57D10AC2460020000   vmovupd  ymm13, ymmword ptr[rsp+260H]
       C57D10B42440020000   vmovupd  ymm14, ymmword ptr[rsp+240H]

Not sure yet why these temps are blocked and not others.

AndyAyersMS commented 5 years ago

Looks like all the blocked cases are from the various news that get inlined. Perhaps the local address visitor should more aggressively simplify....

LocalAddressVisitor visiting statement:
               [000200] ------------              *  STMT      void  (IL 0x00D...  ???)
               [000195] ------------              |  /--*  LCL_VAR   simd32 V15 tmp3         
               [000199] -A----------              \--*  ASG       simd32 (copy)
               [000198] ------------                 \--*  BLK(32)   simd32
               [000197] ------------                    \--*  ADDR      byref 
               [000196] ------------                       \--*  FIELD     struct X
               [000193] ------------                          \--*  ADDR      byref 
               [000194] ------------                             \--*  LCL_VAR   struct(P) V14 tmp2         
                                                                 \--*    simd32 V14.X (offs=0x00) -> V74 tmp62        
                                                                 \--*    simd32 V14.Y (offs=0x20) -> V75 tmp63        
                                                                 \--*    simd32 V14.Z (offs=0x40) -> V76 tmp64        
Replacing the field in promoted struct with local var V74
LocalAddressVisitor modified statement:
               [000200] ------------              *  STMT      void  (IL 0x00D...  ???)
               [000195] ------------              |  /--*  LCL_VAR   simd32 V15 tmp3         
               [000199] -A----------              \--*  ASG       simd32 (copy)
               [000198] ------------                 \--*  BLK(32)   simd32
               [000197] ------------                    \--*  ADDR      byref 
               [000196] ------------                       \--*  LCL_VAR   simd32 V74 tmp62

would like to see that last tree just collapse to an assignment of locals.

cc @mikedn who I think has looked at something like this.

mikedn commented 5 years ago

This looks like a fgMorphCopyBlock issue, it managed to produce a non-block copy but not before marking the variable as DNER. I'll take a look later today.

Yes, it can also be done in LocalAddressVistor but so far I've been hesitant about adding this kind of stuff to it, pending further investigation and decision regarding moving struct promotion out of it.

And then there's probably the main root cause of this issue - the presence of a block op from the beginning. That may be caused by the fact that it's a struct typed struct field and JIT's LCL_FLD cannot be properly used in such cases because it doesn't maintain struct type information. I'm working on adding such support.

AndyAyersMS commented 5 years ago

Seems roughly like we just need to call fgMorphBlockOperand earlier in fgMorphCopyBlock, then we'll fall into the reg struct exemption from DNER.

mikedn commented 5 years ago

Not sure yet, it looks more like an fgMorphOneAsgBlockOp issue to me. It doesn't quite recognize SIMD assignments.

mikedn commented 5 years ago

Yes, it can also be done in LocalAddressVistor but so far I've been hesitant about adding this kind of stuff to it, pending further investigation and decision regarding moving struct promotion out of it.

And it turns out that it's not enough to do this in LocalAddressVisitor (which is trivial). Even if it generates

               [000195] ------------              |  /--*  LCL_VAR   simd32 V15 tmp3         
               [000199] -A----------              \--*  ASG       simd32 (copy)
               [000198] ------------                 \--*  LCL_VAR   simd32 V74 tmp62

fgMorphCopyBlock still makes V74 DNER. So we need to fix it anyway. Oh well.

mrange commented 5 years ago

I am not sure from the conversation if this kind of optimization is something you wish the jitter to handle but is there some way to change my code to get more optimized code without needing to manually inline every SIMD call?

mikedn commented 5 years ago

@mrange I'm not aware of any other way that doesn't require manual inlining. The JIT should definitely learn to handle structs better, it has a rather long history of failing in this area and it doesn't have to be this way.

@AndyAyersMS I have a tentative 3 line fix in https://github.com/mikedn/coreclr/commit/2f79dfdd7af2858d52cbdd9d73d54771af5b5a05. But I need to read that code more carefully to be sure. What would be the deadline for getting such a fix in .NET 3?

AndyAyersMS commented 5 years ago

There is a fair amount of overhead in passing and returning structs. You can minimize some of this in sources by using in and ref, but you may not like how your code looks that way either.

If you don't want to inline the basic operators then you might find it works better to use arrays (or perhaps spans) as your top level aggregates, though you will lose the convenience of being able to refer to things by name (you'd have 0, 1, 2 instead of X, Y, Z).

mrange commented 5 years ago

in did seem to help. I was considering to try ref returns but then I think I have to switch from + to += in semantics?

Part of the reason to use struct is to eliminate the need for heap objects as + etc will create new instances. If I use array wouldn't that imply more heap objects? Ofc if I need to go for mutating operators like += maybe the heap overhead can be mitigated..

But if I use your suggestion of spans perhaps the spans can reference into a big array. Hmm. Worth some experimenting

mikedn commented 5 years ago

I think I have to switch from + to += in semantics?

Be careful with += operators. The C# compiler tends to generate code that the JIT can't handle well and leaves variables address exposed. Sometimes that's a lot worse than a few extra copies.

mrange commented 5 years ago

Tried a variant with methods like this:

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static ref V3 Add(ref V3 l, in V3 r)
    {
      l.X = Avx.Add(l.X, r.X);
      l.Y = Avx.Add(l.Y, r.Y);
      l.Z = Avx.Add(l.Z, r.Z);
      return ref l;
    }

The outcome was worse than the OP version. The disassembly indicated that operations were never register to register but rather register to memory. So that doesn't seem like a fruitful alternative.

(Perhaps obvious, but it wasn't to me)

AndyAyersMS commented 5 years ago

@mikedn we're ok with fixes going into 3.0 for the next few weeks, so put up a PR when you have something you feel good about.

karelz commented 5 years ago

@AndyAyersMS should we close the discussion here and let new PRs be created in CoreCLR, or is it worth moving this issue to CoreCLR repo?

mikedn commented 5 years ago

@CarolEidt has a pending PR in the same area, not sure if that fixes this issue or not.

IMO this should be moved to coreclr, it's a typical JIT issue.

AndyAyersMS commented 5 years ago

Went ahead and moved it over here. I would have expected some kind of explicit forwarding pointer over in dotnet/core, but the old issue just vanishes there. Hopefully nobody's too confused.

Marking as future but if we end up with a simple fix soon we can probably get it into 3.0.

AndyAyersMS commented 5 years ago

cc @dotnet/jit-contrib

mikedn commented 5 years ago

I verified that @CarolEidt 's work in dotnet/coreclr#22255 does indeed fixes this issue:

G_M27407_IG03:
       C54C59E6             vmulps   ymm12, ymm6, ymm6
       C54459EF             vmulps   ymm13, ymm7, ymm7
       C4413C59F0           vmulps   ymm14, ymm8, ymm8
       C4413459F9           vmulps   ymm15, ymm9, ymm9
       C4C12C59EA           vmulps   ymm5, ymm10, ymm10
       C4C12459E3           vmulps   ymm4, ymm11, ymm11
       C4414C59C9           vmulps   ymm9, ymm6, ymm9
       C4414459D2           vmulps   ymm10, ymm7, ymm10
       C4413C59DB           vmulps   ymm11, ymm8, ymm11
       C4C11C5CF7           vsubps   ymm6, ymm12, ymm15
       C5945CED             vsubps   ymm5, ymm13, ymm5
       C58C5CE4             vsubps   ymm4, ymm14, ymm4
       C5CC58F0             vaddps   ymm6, ymm6, ymm0
       C5D458F9             vaddps   ymm7, ymm5, ymm1
       C55C58C2             vaddps   ymm8, ymm4, ymm2
       C4C13458E1           vaddps   ymm4, ymm9, ymm9
       C4C12C58EA           vaddps   ymm5, ymm10, ymm10
       C4412458CB           vaddps   ymm9, ymm11, ymm11
       C5DC58E3             vaddps   ymm4, ymm4, ymm3
       C57D10642440         vmovupd  ymm12, ymmword ptr[rsp+40H]
       C4415458D4           vaddps   ymm10, ymm5, ymm12
       C5FD106C2420         vmovupd  ymm5, ymmword ptr[rsp+20H]
       C53458DD             vaddps   ymm11, ymm9, ymm5
       C57C28CC             vmovaps  ymm9, ymm4
       C57D111C24           vmovupd  ymmword ptr[rsp], ymm11
       FFC2                 inc      edx
       81FA80969800         cmp      edx, 0x989680
       C57D11642440         vmovupd  ymmword ptr[rsp+40H], ymm12
       C5FD116C2420         vmovupd  ymmword ptr[rsp+20H], ymm5
       0F8C31010000         jl       G_M27407_IG06

mrange commented 5 years ago

That's quite exciting news. When/how can I test it?

fiigii commented 5 years ago

@CarolEidt's work also makes the codegen of PacketTracer much better. Diff on GetPoints (collected on OSX)

--- masterGetPoints.asm 2019-03-18 21:22:47.000000000 -0700
+++ CarolGetPoints.asm  2019-03-18 21:23:26.000000000 -0700
@@ -1,14 +1,13 @@
 push     rbp
-sub      rsp, 496
 vzeroupper 
-lea      rbp, [rsp+1F0H]
+mov      rbp, rsp
 vxorps   xmm0, xmm0
 vcvtsi2ss xmm0, dword ptr [rdi+16]
 vbroadcastss ymm0, ymm0
 vxorps   xmm1, xmm1
 vcvtsi2ss xmm1, dword ptr [rdi+20]
 vbroadcastss ymm1, ymm1
-mov      rax, 0x191370B58
+mov      rax, 0x193608B58
 mov      rax, gword ptr [rax]
 add      rax, 8
 vdivps   ymm2, ymm0, ymmword ptr[rax]
@@ -34,21 +33,9 @@
 vmulps   ymm5, ymm0, ymm5
 vmulps   ymm6, ymm0, ymm6
 vmulps   ymm0, ymm0, ymm7
-vmovupd  ymmword ptr[V77 rbp-30H], ymm5
-vmovupd  ymmword ptr[V78 rbp-50H], ymm6
-vmovupd  ymmword ptr[V79 rbp-70H], ymm0
-vmovupd  ymm0, ymmword ptr[V77 rbp-30H]
-vmovupd  ymm5, ymmword ptr[V78 rbp-50H]
-vmovupd  ymm6, ymmword ptr[V79 rbp-70H]
-vaddps   ymm0, ymm2, ymm0
-vaddps   ymm2, ymm3, ymm5
-vaddps   ymm3, ymm4, ymm6
-vmovupd  ymmword ptr[V86 rbp-90H], ymm0
-vmovupd  ymmword ptr[V87 rbp-B0H], ymm2
-vmovupd  ymmword ptr[V88 rbp-D0H], ymm3
-vmovupd  ymm0, ymmword ptr[V86 rbp-90H]
-vmovupd  ymm2, ymmword ptr[V87 rbp-B0H]
-vmovupd  ymm3, ymmword ptr[V88 rbp-D0H]
+vaddps   ymm2, ymm2, ymm5
+vaddps   ymm3, ymm3, ymm6
+vaddps   ymm0, ymm4, ymm0
 add      rdx, 200
 vmovupd  ymm4, ymmword ptr[rdx]
 vmovupd  ymm5, ymmword ptr[rdx+32]
@@ -56,44 +43,22 @@
 vmulps   ymm4, ymm1, ymm4
 vmulps   ymm5, ymm1, ymm5
 vmulps   ymm1, ymm1, ymm6
-vmovupd  ymmword ptr[V92 rbp-F0H], ymm4
-vmovupd  ymmword ptr[V93 rbp-110H], ymm5
-vmovupd  ymmword ptr[V94 rbp-130H], ymm1
-vmovupd  ymm1, ymmword ptr[V92 rbp-F0H]
-vmovupd  ymm4, ymmword ptr[V93 rbp-110H]
-vmovupd  ymm5, ymmword ptr[V94 rbp-130H]
-vaddps   ymm0, ymm0, ymm1
-vaddps   ymm1, ymm2, ymm4
-vaddps   ymm2, ymm3, ymm5
-vmovupd  ymmword ptr[V101 rbp-150H], ymm0
-vmovupd  ymmword ptr[V102 rbp-170H], ymm1
-vmovupd  ymmword ptr[V103 rbp-190H], ymm2
-vmovupd  ymm0, ymmword ptr[V101 rbp-150H]
-vmovupd  ymm1, ymmword ptr[V102 rbp-170H]
-vmovupd  ymm2, ymmword ptr[V103 rbp-190H]
-vmovaps  ymm3, ymm0
-vmovaps  ymm4, ymm1
-vmovaps  ymm5, ymm2
-vmulps   ymm4, ymm4, ymm1
-vmulps   ymm5, ymm5, ymm2
-vmulps   ymm3, ymm3, ymm0
-vaddps   ymm3, ymm3, ymm4
+vaddps   ymm2, ymm2, ymm4
 vaddps   ymm3, ymm3, ymm5
-vsqrtps  ymm3, ymm3
-vdivps   ymm0, ymm0, ymm3
-vdivps   ymm1, ymm1, ymm3
-vdivps   ymm2, ymm2, ymm3
-vmovupd  ymmword ptr[V104 rbp-1B0H], ymm0
-vmovupd  ymmword ptr[V105 rbp-1D0H], ymm1
-vmovupd  ymmword ptr[V106 rbp-1F0H], ymm2
-vmovupd  ymm0, ymmword ptr[V104 rbp-1B0H]
-vmovupd  ymmword ptr[rsi], ymm0
-vmovupd  ymm0, ymmword ptr[V105 rbp-1D0H]
-vmovupd  ymmword ptr[rsi+32], ymm0
-vmovupd  ymm0, ymmword ptr[V106 rbp-1F0H]
+vaddps   ymm0, ymm0, ymm1
+vmulps   ymm1, ymm3, ymm3
+vmulps   ymm4, ymm0, ymm0
+vmulps   ymm5, ymm2, ymm2
+vaddps   ymm1, ymm5, ymm1
+vaddps   ymm1, ymm1, ymm4
+vsqrtps  ymm1, ymm1
+vdivps   ymm2, ymm2, ymm1
+vdivps   ymm3, ymm3, ymm1
+vdivps   ymm0, ymm0, ymm1
+vmovupd  ymmword ptr[rsi], ymm2
+vmovupd  ymmword ptr[rsi+32], ymm3
 vmovupd  ymmword ptr[rsi+64], ymm0
 mov      rax, rsi
 vzeroupper 
-lea      rsp, [rbp]
 pop      rbp
 ret

I believe larger functions like GetNaturalColor would benefit more.

CarolEidt commented 5 years ago

I've re-synced dotnet/coreclr#22255 and am working through some new issues. I hope to get this in for 3.0, but will have to see what complexities arise.

mrange commented 5 years ago

The removed -lea rsp, [rbp] I suppose is because there's no data in the stackframe and no need to restore the stackpointer?

fiigii commented 5 years ago

because there's no data in the stackframe and no need to restore the stackpointer?

Yes, because the program no longer need to allocate stack space via -sub rsp, 496 for spilled local variables, so rsp is unchanged in this function body.

CarolEidt commented 5 years ago

Fixed with dotnet/coreclr#22255

dotnet / runtime

Struct type overhead higher than expected for small numerical vectors #12277

Struct type overhead higher than expected for small numerical vectors

General