Improve Performance - Githubissues

jackmott commented 8 years ago

Thanks to @cloudRoutine we have a solid benchmarking framework in place. There are some cases where SIMD versions of the operations don't win, or only win with very large arrays. We should investigate ways to improve that.

Loop Unrolling

Unrolling the main SIMD loop could be viable, as the JIT tends not to do that, while C++ compilers tend to put two SIMD operations per loop. EDIT - I've not managed to get any meaningful performance gains by trying this, but maybe there is a more clever way.

cloudRoutine commented 8 years ago

Maybe playing fast and loose on the stack could kick it up a notch? (was based on this)

We could see if it can combine well with NativePtr.stackalloc<'a>

jackmott commented 8 years ago

Doing some experiments with loop unrolling on map seems to have no change in runtime at all, doing it like so:

 let mutable i, ri = 0, 0    
    while i <= len - (inCount <<< 2) do                
        let mutable v = Vector< ^T>(array,i )
        (f v).CopyTo(result,ri)
        i <- i + inCount
        ri <- ri + outCount
        v <- Vector< ^T>(array,i)
        (f v).CopyTo(result,ri)
        i <- i + inCount
        ri <- ri + outCount
        v <- Vector< ^T>(array,i)
        (f v).CopyTo(result,ri)
        i <- i + inCount
        ri <- ri + outCount
        v <- Vector< ^T>(array,i)
        (f v).CopyTo(result,ri)
        i <- i + inCount
        ri <- ri + outCount

Two at a time, or 4 at a time, no difference. Interestingly it also did not seem to matter if you constrain map to return the same type it started with, which lets you omit the separate ri counter. So maybe memory throughput is the limiter then the mapping function is a single add instruction.

cloudRoutine commented 8 years ago

Have you been inspecting the differences in the generated IL?

jackmott commented 8 years ago

Not yet. But I will have to go down that path soon.

jackmott commented 8 years ago

Another note: filling an array with random values nets very different benchmark results than filling them with a constant value.

cloudRoutine commented 8 years ago

can you post the results? there should be a gh-flavor .md file in the /bin

jackmott commented 8 years ago

How do I run the benchmarks to produce the .md output file? I'm just running it from the command line and only get output to the terminal.

jackmott commented 8 years ago

Ok I seem to have convinced myself that the random values vs constant values different wasn't really happening anyway.

jackmott commented 8 years ago

@cloudRoutine am I interpreting that BlitParser stuff correctly? They are outputting IL by hand there? That is rad.

cloudRoutine commented 8 years ago

Yea it was originally in C#, but I took a look at it and knew I could rewrite it in F# :grinning:

jackmott commented 8 years ago

So it looks like removing the inline call to getLeftovers and just, actually making it inline is a big win. I will experiment with that more to be sure though.

jackmott commented 8 years ago

So the primary loop for map is :

 while i <= lenLessCount do
        (f (Vector< ^T>(array,i ))).CopyTo(result,i)        
        i <- i + count

and the IL for that loop is:

IL_0038: stloc.s i
    // loop start (head: IL_003a)
        IL_003a: ldloc.s i
        IL_003c: call instance !0 class [FSharp.Core]Microsoft.FSharp.Core.FSharpRef`1<int32>::get_contents()
        IL_0041: ldloc.3
        IL_0042: bgt.s IL_007b

        IL_0044: ldarg.0
        IL_0045: ldarg.1
        IL_0046: ldloc.s i
        IL_0048: call instance !0 class [FSharp.Core]Microsoft.FSharp.Core.FSharpRef`1<int32>::get_contents()
        IL_004d: newobj instance void valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!!T>::.ctor(!0[], int32)
        IL_0052: callvirt instance !1 class [FSharp.Core]Microsoft.FSharp.Core.FSharpFunc`2<valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!!T>, valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!!T>>::Invoke(!0)
        IL_0057: stloc.s v
        IL_0059: ldloca.s v
        IL_005b: ldloc.1
        IL_005c: ldloc.s i
        IL_005e: call instance !0 class [FSharp.Core]Microsoft.FSharp.Core.FSharpRef`1<int32>::get_contents()
        IL_0063: call instance void valuetype [System.Numerics.Vectors]System.Numerics.Vector`1<!!T>::CopyTo(!0[], int32)
        IL_0068: ldloc.s i
        IL_006a: ldloc.s i
        IL_006c: call instance !0 class [FSharp.Core]Microsoft.FSharp.Core.FSharpRef`1<int32>::get_contents()
        IL_0071: ldloc.2
        IL_0072: add
        IL_0073: call instance void class [FSharp.Core]Microsoft.FSharp.Core.FSharpRef`1<int32>::set_contents(!0)
        IL_0078: nop
        IL_0079: br.s IL_003a
    // end loop

My IL knowledge is near zero, not sure if there is any low hanging fruit there. I do know that the SIMD stuff creates unnecessary bounds checks but I don't know that those can be elided at the IL level.

jackmott commented 8 years ago

Possible to roll our own copyto without the checks? https://github.com/dotnet/corefx/blob/future/src/System.Numerics.Vectors/src/System/Numerics/Vector.cs#L777

Doesn't seem like JitIntrinsic is an attribute we get to use :( It is possible the JitIntrinsic magic elides all of that somehow anyway, as well.

jackmott commented 8 years ago

Oh wow, using a tail recursive main loop instead of a while loop is consistently faster. This test was done with an array of doubles (half as wide as int32s)


Host Process Environment Information:
BenchmarkDotNet=v0.9.8.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8
Frequency=2240907 ticks, Resolution=446.2479 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1080.0

Type=SIMDBenchmark  Mode=Throughput  Platform=X64  
Jit=RyuJit  GarbageCollection=Concurrent Workstation

Method	Length	Median	StdDev	Gen 0	Gen 1	Gen 2	Bytes Allocated/Op
SIMDMap	10000	13.9526 us	0.2316 us	7,357.00	-	-	35,302.57
SIMDMapRec	10000	9.5063 us	0.1882 us	7,207.81	-	-	34,532.16
Map	10000	11.9088 us	0.1615 us	7,934.99	-	-	34,738.85

jackmott commented 8 years ago

So this trick doesn't work with sum, map2, or map3 unfortunately. Works great with mapInPlace though!


Host Process Environment Information:
BenchmarkDotNet=v0.9.8.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8
Frequency=2240907 ticks, Resolution=446.2479 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1080.0

Type=SIMDBenchmark  Mode=Throughput  Platform=X64  
Jit=RyuJit  GarbageCollection=Concurrent Workstation

Method	Length	Median	StdDev	Gen 0	Gen 1	Gen 2	Bytes Allocated/Op
SIMDMapInPlace	100000	31.8363 us	0.5257 us	-	-	-	63.30
SIMDMapInPlaceRec	100000	19.5686 us	0.4455 us	-	-	-	36.54
Map	100000	166.5645 us	4.6652 us	-	-	4,582.00	156,742.26

cloudRoutine commented 8 years ago

Did you see a big difference in the decompiled source of sum, map2, or map3 using the recursive implementation and the while loop? Is there something happening for the map and map in place that isn't happening for the others?

jackmott commented 8 years ago

Yeah so I got some insight into what was going on from r/fsharp. I was using the loop counter variable in a lambda below, when dealing with the leftover bits of the array. This causes the compiler to turn the mutablei into a reference cell. So, when I fix that, go back to using a while loop, it is faster still.

jackmott commented 8 years ago

Results with the loop stuff all sorted out. Big improvements for small arrays which is nice.


Host Process Environment Information:
BenchmarkDotNet=v0.9.8.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8
Frequency=2240907 ticks, Resolution=446.2479 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1590.0

Type=SIMDBenchmark  Mode=Throughput  Platform=X64  
Jit=RyuJit  GarbageCollection=Concurrent Workstation

Method	Length	Median	StdDev	Gen 0	Gen 1	Gen 2	Bytes Allocated/Op
SIMDMap	10	36.7623 ns	0.5923 ns	0.09	-	-	51.75
Map	10	15.7268 ns	0.1792 ns	0.05	-	-	27.00
SIMDMap	100	72.0358 ns	0.7518 ns	0.35	-	-	194.39
Map	100	120.2898 ns	1.3775 ns	0.33	-	-	188.29
SIMDMap	1000	394.5545 ns	9.1914 ns	2.78	-	-	1,891.49
Map	1000	1,072.4757 ns	19.5905 ns	3.27	-	-	2,225.31
SIMDMap	10000	4,641.6221 ns	121.5923 ns	29.11	-	-	16,394.50
Map	10000	10,056.5177 ns	162.3417 ns	28.07	-	-	15,836.59
SIMDMap	1000000	1,789,189.0427 ns	24,558.2193 ns	-	-	785.00	1,717,301.72
Map	1000000	2,070,883.0174 ns	24,421.1725 ns	-	-	527.23	1,154,018.44

jackmott commented 8 years ago

I think short of exotic hacks things are as fast as they are going to get now. If anyone has exotic hacks feel free to chime in.

cloudRoutine commented 8 years ago

let inline private getLeftovers (array: ^T []) (curIndex: int) : ^T Vector =
    let mutable vi = curIndex
    let d = Unchecked.defaultof< ^T>
    let leftOverArray =
        [| for i=1 to Vector< ^T>.Count do
            if vi < array.Length then
                yield array.[vi]
                vi <- vi + 1
            else
                yield d
        |]
    Vector< ^T> leftOverArray

if we know the leftOverArray is going to be the size of the vector we can just create an array of that size and fill it, we don't have to use yield construct which involves going through

internal static Vector<T> getLeftovers<T>(T[] array, int curIndex) where T : struct
{
    FSharpRef<int> vi = new FSharpRef<int>(curIndex);
    T[] leftOverArray = SeqModule.ToArray<T>(new SIMD<T>.leftOverArray@23(array, vi, default(T), 0, null, 0, default(T)));
    return new Vector<T>(leftOverArray);
}
// --------------------------------------
// Microsoft.FSharp.Collections.SeqModule
[CompilationSourceName("toArray")]
public static T[] ToArray<T>(IEnumerable<T> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    T[] array = source as T[];
    if (array != null)
    {
        T[] array2 = array;
        return (T[])array2.Clone();
    }
    FSharpList<T> fSharpList = source as FSharpList<T>;
    if (fSharpList != null)
    {
        FSharpList<T> l = fSharpList;
        return List.toArray<T>(l);
    }
    ICollection<T> collection = source as ICollection<T>;
    if (collection != null)
    {
        ICollection<T> collection2 = collection;
        T[] array2 = new T[collection2.Count];
        collection2.CopyTo(array2, 0);
        return array2;
    }
    return new List<T>(source).ToArray();
}

jackmott commented 8 years ago

Good point. I'll change that.

buybackoff commented 8 years ago

For leftovers, Instead of

    if i < len then 

        let leftOver = len - i
        let leftOverArray1 = Array.zeroCreate count
        let leftOverArray2 = Array.zeroCreate count
        for j in 0 .. leftOverArray1.Length-1 do
            if j < leftOver then
                leftOverArray1.[j] <- array1.[j+i]
                leftOverArray2.[j] <- array2.[j+i]

        let v = f (Vector< ^T>(leftOverArray1,0 )) (Vector< ^U>(leftOverArray2,0))

        for j in 0 .. leftOver-1 do
            result.[i] <- v.[j]
            i <- i + 1

we could use

    if i < len then 
        let lastVector1 = Vector< ^T>(array1, len - count)
        let lastVector2 = Vector< ^U>(array2, len - count)
        (f (lastVector1) (lastVector2)).CopyTo(result,len - count)

We create a vector at the very end, it overlaps with the last one from the main while loop, but there are no allocations and less instructions overall.

Must check for the edge case when len < count, but for 1001 input size in the map2 benchmark the difference is almost 8%.

jackmott commented 8 years ago

I think this is probably worthwhile to apply to all of the 2 and 3 array functions

jackmott commented 8 years ago

Yeah that had a big payoff for map3 I'm going to apply it to mapi2 as well

fsprojects / SIMDArray

Improve Performance #3

Loop Unrolling