RyuJIT: Vector.Dot not being inlined / converted to SIMD instructions?

redknightlois commented 7 years ago

We are trying SIMD alternatives to our very fast locality cache (as it is starting to shows up on our profiling runs ) and got very bad results.

Given that I couldnt find any way to achieve the codegen to write _mm_hadd_epi32 I thought I could use Vector.Dot which essentially achieves the same on a premium even if dpps is not available (SSE 4.1).

While we weren't really expecting to beat the current code, there was a chance though. Surprisingly our benchmark code results were plain awful.

This is a dry run, but a proper one doesnt change the results by much.

CLR=CORE, Arch=64-bit ? [RyuJIT]
GC=Concurrent Workstation
dotnet cli version: 1.0.0-preview2-003131

Type=PageLocatorBenchmark  Mode=SingleRun  Platform=X64
Jit=RyuJit  Toolchain=Core  Runtime=Core

Method	CacheSize	RandomSeed	Median	Mean	Op/s	Min	Q1	Median	Q3	Max
Basic_PageLocatorV1	4	5	22.5852 ns	22.5852 ns	44276720.29	22.5852 ns	22.5852 ns	22.5852 ns	22.5852 ns	22.5852 ns
Basic_PageLocatorV4	4	5	72.6100 ns	72.6100 ns	13772209.77	72.6100 ns	72.6100 ns	72.6100 ns	72.6100 ns	72.6100 ns
Basic_PageLocatorV1	8	5	31.0675 ns	31.0675 ns	32188009.91	31.0675 ns	31.0675 ns	31.0675 ns	31.0675 ns	31.0675 ns
Basic_PageLocatorV4	8	5	48.1086 ns	48.1086 ns	20786309.07	48.1086 ns	48.1086 ns	48.1086 ns	48.1086 ns	48.1086 ns
Basic_PageLocatorV1	16	5	59.6822 ns	59.6822 ns	16755402.44	59.6822 ns	59.6822 ns	59.6822 ns	59.6822 ns	59.6822 ns
Basic_PageLocatorV4	16	5	79.3038 ns	79.3038 ns	12609735.75	79.3038 ns	79.3038 ns	79.3038 ns	79.3038 ns	79.3038 ns
Basic_PageLocatorV1	32	5	100.1517 ns	100.1517 ns	9984852.08	100.1517 ns	100.1517 ns	100.1517 ns	100.1517 ns	100.1517 ns
Basic_PageLocatorV4	32	5	154.9541 ns	154.9541 ns	6453523.48	154.9541 ns	154.9541 ns	154.9541 ns	154.9541 ns	154.9541 ns

But then trying to understand why the results looked like this (they shouldnt be that bad). I found this:

Eventually I got around to fire-up the profile to double-check that.

And just for the purpose of completeness

Tomorrow I will probably play a bit with a native profiler to have a better idea of where the microarchitecture costs are, but this call doesnt look good. Any idea?

PS: And not to beat a dead tree here, but this is yet another case of missing instructions.

mikedn commented 7 years ago

I thought I could use Vector.Dot which essentially achieves the same on a premium even if dpps is not available (SSE 4.1).

I'm not sure what dpps has to do with this. You seem to be using integer vectors and Vector.Dot is treated as an intrinsic only for floating point vectors. And anyway, an AVX implementation for integer dot product would require one multiplication and likely 3 horizontal adds and some permutes, I don't see how it could be a reasonable substitute for a single horizontal add or whatever is what you need.

redknightlois commented 7 years ago

Its not a substitute. Horizontal add is needed, but I was timing it just to see if I could still offset the cost for big sizes even at the expense of the multiplication and shuffles. Was definitely not expecting how bad it behaves in this case for all cases.

RussKeldorph commented 7 years ago

@dotnet/jit-contrib

sivarv commented 7 years ago

@redknightlois - We definitely can improve Vector.Dot() for AVX case using phaddd.

In the above screen sheet there is another label that reads "Also this generates worse code". I am assuming that you are referring to the commented out line:

var result = Vector.ConditionalSelect(comparison, Vector<int>.One, Vector<Int>.Zero)

Are you referring to ConditionalSelect() or Vector<int>.One? Please let us know which specific C# code leads to worst performing code along with a snippet of jitted code.

redknightlois commented 7 years ago

@sivarv Using Vector<int>.One in the ConditionalSelect generate a bit worse code than copying that into an static variable (for whatever reason).

The actual code is here: https://github.com/Corvalius/ravendb/blob/v4.0/bench/Micro.Benchmark/PageLocatorImpl/PageLocatorV2.cs I doubt it can beat this though: https://github.com/Corvalius/ravendb/blob/v4.0/bench/Micro.Benchmark/PageLocatorImpl/PageLocatorV7.cs

But looking forward to be proven wrong :D

sivarv commented 7 years ago

@redknightlois - the code pointer that you have given is using Vector instead of Vector. AFAIK, AVX supports instruction to perform horizontal add of 16-bit and 32-bit type vectors. I am assuming that you are open to using Vector<int> instead of Vector<long>.

I have created the below PR to recognize Vector<int>.Dot() as intrinsic on AVX. Please try these changes and let me know whether you are seeing the expected perf difference compared to scalar version. If not, please provide me a standalone micro-benchmark that I can run to see whats happening. https://github.com/dotnet/coreclr/pull/7749

Please elaborate what you meant by "a bit worse code" when using Vector.One. Here is what i am seeing:

ConditionalSelect(comp, Vector<int>.One, Vector<int>.Zero)

IN0008: 000046 mov      ecx, 1
IN0009: 00004B vmovd    xmm2, ecx
IN000a: 000050 vpbroadcastd ymm2, ymm2
IN000b: 000055 vpand    ymm2, ymm1
IN000c: 00005A vpcmpeqd ymm0, ymm6
IN000d: 00005F vpxor    ymm1, ymm1
IN000e: 000064 vpandn   ymm0, ymm1
IN000f: 000069 vpor     ymm7, ymm2, ymm0

ConditionalSelect(comp, one, zero)  

IN0008: 000020 mov      rdx, 0xA3B392780
IN0009: 00002A mov      rdx, gword ptr [rdx]
IN000a: 00002D vmovupd  ymm3, (null)[rdx+8]
IN000b: 000033 vpand    ymm3, ymm2
IN000c: 000038 vpcmpeqd ymm1, ymm0
IN000d: 00003D mov      rdx, 0xA3B392788
IN000e: 000047 mov      rdx, gword ptr [rdx]
IN000f: 00004A vmovupd  ymm2, (null)[rdx+8]
IN0010: 000050 vpandn   ymm1, ymm2
IN0011: 000055 vpor     ymm1, ymm3, ymm1

As you can see using static fields 'one' and 'zero' will read 32-bytes from memory in each iteration through the loop, which would be inefficient than using Vector.One directly.

My guess is that using _indexes static field in Vector.Dot() is resulting in a call to helper CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE and a null check inside the loop. Since the call would trash upper 128-bits of YMM regs that are live, LSRA would add vextractf/vinsert to save/restore upper 128-bits of those YMM registers around the call. You can workaround this by assigning _indexes static field to a local outside the loop and using that local in the loop instead.

redknightlois commented 7 years ago

@sivarv Thanks for taking a look into it. Yes, whatever is faster is fine by me :) ... Just let me know what version of the JIT I have to select when a build is up on nuget and I will take it for a spin using Vector<int>

Looking at the code again, I may have messed up when investigating the issue and mix the binaries for ConditionalSelect(comp, Vector<int>.One, Vector<int>.Zero). You are right, the code is tighter when using them.

// * Summary *

Host Process Environment Information:
BenchmarkDotNet.Core=v0.9.9.0
OS=Windows
Processor=?, ProcessorCount=8
Frequency=3914057 ticks, Resolution=255.4894 ns, Timer=TSC
CLR=CORE, Arch=64-bit ? [RyuJIT]
GC=Concurrent Workstation
dotnet cli version: 1.0.0-preview2-003131

Type=PlRandomRead  Mode=Throughput  Platform=X64
Jit=RyuJit  Toolchain=Core  Runtime=Core

  Method | CacheSize | RandomSeed |        Median |     StdDev |          Mean |  StdError |     StdDev |        Op/s |           Min |            Q1 |        Median |            Q3 |           Max |
-------- |---------- |----------- |-------------- |----------- |-------------- |---------- |----------- |------------ |-------------- |-------------- |-------------- |-------------- |-------------- |
 BasicV2 |         8 |          5 |    58.5424 ns |  0.8495 ns |    58.7051 ns | 0.1900 ns |  0.8495 ns | 17034295.16 |    57.5253 ns |    58.0319 ns |    58.5424 ns |    59.4348 ns |    60.4696 ns |
 BasicV7 |         8 |          5 |    16.2503 ns |  0.2681 ns |    16.3522 ns | 0.0599 ns |  0.2681 ns | 61154012.29 |    16.0277 ns |    16.1489 ns |    16.2503 ns |    16.5388 ns |    17.0693 ns |
 BasicV2 |        16 |          5 |   110.9019 ns |  1.8697 ns |   110.9448 ns | 0.4181 ns |  1.8697 ns |  9013491.26 |   108.7304 ns |   109.1775 ns |   110.9019 ns |   112.1921 ns |   114.8121 ns |
 BasicV7 |        16 |          5 |    19.3517 ns |  0.6482 ns |    19.4715 ns | 0.1382 ns |  0.6482 ns | 51357205.04 |    18.5425 ns |    18.9458 ns |    19.3517 ns |    19.8667 ns |    20.8962 ns |
 BasicV2 |        32 |          5 |   195.6714 ns |  4.5037 ns |   196.4968 ns | 1.0071 ns |  4.5037 ns |  5089141.06 |   191.9787 ns |   194.2846 ns |   195.6714 ns |   196.9989 ns |   214.2859 ns |
 BasicV7 |        32 |          5 |    23.8621 ns |  0.4823 ns |    23.9680 ns | 0.1078 ns |  0.4823 ns | 41722221.05 |    23.3302 ns |    23.6218 ns |    23.8621 ns |    24.2481 ns |    25.3740 ns |
 BasicV2 |        64 |          5 |   374.2629 ns | 10.0164 ns |   376.0139 ns | 2.2397 ns | 10.0164 ns |  2659476.35 |   364.2047 ns |   367.3912 ns |   374.2629 ns |   381.6700 ns |   396.0265 ns |
 BasicV7 |        64 |          5 |    33.6090 ns |  0.4028 ns |    33.6553 ns | 0.0901 ns |  0.4028 ns | 29712987.32 |    33.0859 ns |    33.3698 ns |    33.6090 ns |    33.8331 ns |    34.7496 ns |
 BasicV2 |       128 |          5 |   709.8784 ns | 11.1069 ns |   712.7788 ns | 2.4836 ns | 11.1069 ns |  1402959.86 |   695.8916 ns |   705.6194 ns |   709.8784 ns |   723.7533 ns |   738.9168 ns |
 BasicV7 |       128 |          5 |    53.2756 ns |  0.8599 ns |    53.1868 ns | 0.1923 ns |  0.8599 ns | 18801645.63 |    51.6082 ns |    52.7016 ns |    53.2756 ns |    53.5450 ns |    55.3726 ns |
 BasicV2 |       256 |          5 | 1,389.6383 ns | 11.4753 ns | 1,389.7817 ns | 2.5660 ns | 11.4753 ns |   719537.48 | 1,369.1476 ns | 1,381.8399 ns | 1,389.6383 ns | 1,395.5928 ns | 1,411.9613 ns |
 BasicV7 |       256 |          5 |   100.2056 ns |  1.0423 ns |   100.1082 ns | 0.2331 ns |  1.0423 ns |     9989188 |    98.1135 ns |    99.1694 ns |   100.2056 ns |   100.9620 ns |   102.1478 ns |

// ***** BenchmarkRunner: End *****

These are the results using the indexes as a local instead. Note though that V2 is what V4 was used to be and V7 is our current version (not V1 which is the original used in this post).

sivarv commented 7 years ago

@redknightlois - the PR meant to recognize Vector.Dot as JIT intrinsic is merged.

Regarding Vector.One and Vector.Zero - to further optimize the loop, you can declare two locals outside while-loop

Vector<int> one = Vector<int>.One
Vector<int> zero = Vector<int>.Zero

and use 'one' and 'zero' within the loop. This avoids the overhead of constructing these constant vectors in each iteration through the loop. Hoisting of constant SIMD vectors like this out of loops is something we are tracking as an issue https://github.com/dotnet/coreclr/issues/7422. Till that is fixed, you would have to hand optimize it.

sivarv commented 7 years ago

@redknightlois - did you get a chance to try out new JIT and measure perf?

redknightlois commented 7 years ago

@sivarv not yet. I need https://github.com/PerfDotNet/BenchmarkDotNet/issues/292 to be resolved to get the proper benchmark data out. For long values it doesnt seem to generate different code though. Have to try it out with int values, but wont be able until probably mid next week.

sivarv commented 7 years ago

@redknightlois - did you get a chance to measure perf of your benchmark after modifying it to use Vector<int>.Dot()?

redknightlois commented 7 years ago

@sivarv Benchmarking tool not working yet with new builds. As soon as I make it work, will post results.

dotnet / runtime

RyuJIT: Vector.Dot not being inlined / converted to SIMD instructions? #6797