Open jackmott opened 8 years ago
Maybe playing fast and loose on the stack could kick it up a notch? (was based on this)
We could see if it can combine well with NativePtr.stackalloc<'a>
Doing some experiments with loop unrolling on map seems to have no change in runtime at all, doing it like so:
let mutable i, ri = 0, 0
while i <= len - (inCount <<< 2) do
let mutable v = Vector< ^T>(array,i )
(f v).CopyTo(result,ri)
i <- i + inCount
ri <- ri + outCount
v <- Vector< ^T>(array,i)
(f v).CopyTo(result,ri)
i <- i + inCount
ri <- ri + outCount
v <- Vector< ^T>(array,i)
(f v).CopyTo(result,ri)
i <- i + inCount
ri <- ri + outCount
v <- Vector< ^T>(array,i)
(f v).CopyTo(result,ri)
i <- i + inCount
ri <- ri + outCount
Two at a time, or 4 at a time, no difference. Interestingly it also did not seem to matter if you constrain map to return the same type it started with, which lets you omit the separate ri counter. So maybe memory throughput is the limiter then the mapping function is a single add instruction.
Have you been inspecting the differences in the generated IL?
Not yet. But I will have to go down that path soon.
Another note: filling an array with random values nets very different benchmark results than filling them with a constant value.
can you post the results? there should be a gh-flavor .md
file in the /bin
How do I run the benchmarks to produce the .md output file? I'm just running it from the command line and only get output to the terminal.
Ok I seem to have convinced myself that the random values vs constant values different wasn't really happening anyway.
@cloudRoutine am I interpreting that BlitParser stuff correctly? They are outputting IL by hand there? That is rad.
Yea it was originally in C#, but I took a look at it and knew I could rewrite it in F# :grinning:
So it looks like removing the inline call to getLeftovers
and just, actually making it inline is a big win. I will experiment with that more to be sure though.
So the primary loop for map is :
while i <= lenLessCount do
(f (Vector< ^T>(array,i ))).CopyTo(result,i)
i <- i + count
and the IL for that loop is:
IL_0038: stloc.s i
// loop start (head: IL_003a)
IL_003a: ldloc.s i
IL_003c: call instance !0 class [FSharp.Core]Microsoft.FSharp.Core.FSharpRef`1<int32>::get_contents()
IL_0041: ldloc.3
IL_0042: bgt.s IL_007b
IL_0044: ldarg.0
IL_0045: ldarg.1
IL_0046: ldloc.s i
IL_0048: call instance !0 class [FSharp.Core]Microsoft.FSharp.Core.FSharpRef`1<int32>::get_contents()
IL_004d: newobj instance void valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!!T>::.ctor(!0[], int32)
IL_0052: callvirt instance !1 class [FSharp.Core]Microsoft.FSharp.Core.FSharpFunc`2<valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!!T>, valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!!T>>::Invoke(!0)
IL_0057: stloc.s v
IL_0059: ldloca.s v
IL_005b: ldloc.1
IL_005c: ldloc.s i
IL_005e: call instance !0 class [FSharp.Core]Microsoft.FSharp.Core.FSharpRef`1<int32>::get_contents()
IL_0063: call instance void valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!!T>::CopyTo(!0[], int32)
IL_0068: ldloc.s i
IL_006a: ldloc.s i
IL_006c: call instance !0 class [FSharp.Core]Microsoft.FSharp.Core.FSharpRef`1<int32>::get_contents()
IL_0071: ldloc.2
IL_0072: add
IL_0073: call instance void class [FSharp.Core]Microsoft.FSharp.Core.FSharpRef`1<int32>::set_contents(!0)
IL_0078: nop
IL_0079: br.s IL_003a
// end loop
My IL knowledge is near zero, not sure if there is any low hanging fruit there. I do know that the SIMD stuff creates unnecessary bounds checks but I don't know that those can be elided at the IL level.
Possible to roll our own copyto without the checks? https://github.com/dotnet/corefx/blob/future/src/System.Numerics.Vectors/src/System/Numerics/Vector.cs#L777
Doesn't seem like JitIntrinsic is an attribute we get to use :( It is possible the JitIntrinsic magic elides all of that somehow anyway, as well.
Oh wow, using a tail recursive main loop instead of a while loop is consistently faster. This test was done with an array of doubles (half as wide as int32s)
Host Process Environment Information:
BenchmarkDotNet=v0.9.8.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8
Frequency=2240907 ticks, Resolution=446.2479 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1080.0
Type=SIMDBenchmark Mode=Throughput Platform=X64
Jit=RyuJit GarbageCollection=Concurrent Workstation
Method | Length | Median | StdDev | Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
---|---|---|---|---|---|---|---|
SIMDMap | 10000 | 13.9526 us | 0.2316 us | 7,357.00 | - | - | 35,302.57 |
SIMDMapRec | 10000 | 9.5063 us | 0.1882 us | 7,207.81 | - | - | 34,532.16 |
Map | 10000 | 11.9088 us | 0.1615 us | 7,934.99 | - | - | 34,738.85 |
So this trick doesn't work with sum, map2, or map3 unfortunately. Works great with mapInPlace though!
Host Process Environment Information:
BenchmarkDotNet=v0.9.8.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8
Frequency=2240907 ticks, Resolution=446.2479 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1080.0
Type=SIMDBenchmark Mode=Throughput Platform=X64
Jit=RyuJit GarbageCollection=Concurrent Workstation
Method | Length | Median | StdDev | Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
---|---|---|---|---|---|---|---|
SIMDMapInPlace | 100000 | 31.8363 us | 0.5257 us | - | - | - | 63.30 |
SIMDMapInPlaceRec | 100000 | 19.5686 us | 0.4455 us | - | - | - | 36.54 |
Map | 100000 | 166.5645 us | 4.6652 us | - | - | 4,582.00 | 156,742.26 |
Did you see a big difference in the decompiled source of sum, map2, or map3 using the recursive implementation and the while loop? Is there something happening for the map and map in place that isn't happening for the others?
Yeah so I got some insight into what was going on from r/fsharp. I was using the loop counter variable in a lambda below, when dealing with the leftover bits of the array. This causes the compiler to turn the mutable
i into a reference cell. So, when I fix that, go back to using a while loop, it is faster still.
Results with the loop stuff all sorted out. Big improvements for small arrays which is nice.
Host Process Environment Information:
BenchmarkDotNet=v0.9.8.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8
Frequency=2240907 ticks, Resolution=446.2479 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1590.0
Type=SIMDBenchmark Mode=Throughput Platform=X64
Jit=RyuJit GarbageCollection=Concurrent Workstation
Method | Length | Median | StdDev | Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
---|---|---|---|---|---|---|---|
SIMDMap | 10 | 36.7623 ns | 0.5923 ns | 0.09 | - | - | 51.75 |
Map | 10 | 15.7268 ns | 0.1792 ns | 0.05 | - | - | 27.00 |
SIMDMap | 100 | 72.0358 ns | 0.7518 ns | 0.35 | - | - | 194.39 |
Map | 100 | 120.2898 ns | 1.3775 ns | 0.33 | - | - | 188.29 |
SIMDMap | 1000 | 394.5545 ns | 9.1914 ns | 2.78 | - | - | 1,891.49 |
Map | 1000 | 1,072.4757 ns | 19.5905 ns | 3.27 | - | - | 2,225.31 |
SIMDMap | 10000 | 4,641.6221 ns | 121.5923 ns | 29.11 | - | - | 16,394.50 |
Map | 10000 | 10,056.5177 ns | 162.3417 ns | 28.07 | - | - | 15,836.59 |
SIMDMap | 1000000 | 1,789,189.0427 ns | 24,558.2193 ns | - | - | 785.00 | 1,717,301.72 |
Map | 1000000 | 2,070,883.0174 ns | 24,421.1725 ns | - | - | 527.23 | 1,154,018.44 |
I think short of exotic hacks things are as fast as they are going to get now. If anyone has exotic hacks feel free to chime in.
let inline private getLeftovers (array: ^T []) (curIndex: int) : ^T Vector =
let mutable vi = curIndex
let d = Unchecked.defaultof< ^T>
let leftOverArray =
[| for i=1 to Vector< ^T>.Count do
if vi < array.Length then
yield array.[vi]
vi <- vi + 1
else
yield d
|]
Vector< ^T> leftOverArray
if we know the leftOverArray is going to be the size of the vector we can just create an array of that size and fill it, we don't have to use yield
construct which involves going through
internal static Vector<T> getLeftovers<T>(T[] array, int curIndex) where T : struct
{
FSharpRef<int> vi = new FSharpRef<int>(curIndex);
T[] leftOverArray = SeqModule.ToArray<T>(new SIMD<T>.leftOverArray@23(array, vi, default(T), 0, null, 0, default(T)));
return new Vector<T>(leftOverArray);
}
// --------------------------------------
// Microsoft.FSharp.Collections.SeqModule
[CompilationSourceName("toArray")]
public static T[] ToArray<T>(IEnumerable<T> source)
{
if (source == null)
{
throw new ArgumentNullException("source");
}
T[] array = source as T[];
if (array != null)
{
T[] array2 = array;
return (T[])array2.Clone();
}
FSharpList<T> fSharpList = source as FSharpList<T>;
if (fSharpList != null)
{
FSharpList<T> l = fSharpList;
return List.toArray<T>(l);
}
ICollection<T> collection = source as ICollection<T>;
if (collection != null)
{
ICollection<T> collection2 = collection;
T[] array2 = new T[collection2.Count];
collection2.CopyTo(array2, 0);
return array2;
}
return new List<T>(source).ToArray();
}
Good point. I'll change that.
For leftovers, Instead of
if i < len then
let leftOver = len - i
let leftOverArray1 = Array.zeroCreate count
let leftOverArray2 = Array.zeroCreate count
for j in 0 .. leftOverArray1.Length-1 do
if j < leftOver then
leftOverArray1.[j] <- array1.[j+i]
leftOverArray2.[j] <- array2.[j+i]
let v = f (Vector< ^T>(leftOverArray1,0 )) (Vector< ^U>(leftOverArray2,0))
for j in 0 .. leftOver-1 do
result.[i] <- v.[j]
i <- i + 1
we could use
if i < len then
let lastVector1 = Vector< ^T>(array1, len - count)
let lastVector2 = Vector< ^U>(array2, len - count)
(f (lastVector1) (lastVector2)).CopyTo(result,len - count)
We create a vector at the very end, it overlaps with the last one from the main while
loop, but there are no allocations and less instructions overall.
Must check for the edge case when len < count
, but for 1001 input size in the map2
benchmark the difference is almost 8%.
I think this is probably worthwhile to apply to all of the 2 and 3 array functions
Yeah that had a big payoff for map3 I'm going to apply it to mapi2 as well
Thanks to @cloudRoutine we have a solid benchmarking framework in place. There are some cases where SIMD versions of the operations don't win, or only win with very large arrays. We should investigate ways to improve that.
Loop Unrolling
Unrolling the main SIMD loop could be viable, as the JIT tends not to do that, while C++ compilers tend to put two SIMD operations per loop. EDIT - I've not managed to get any meaningful performance gains by trying this, but maybe there is a more clever way.