SciSharp / NumSharp

High Performance Computation for N-D Tensors in .NET, similar API to NumPy.
https://github.com/SciSharp
Apache License 2.0
1.38k stars 192 forks source link

Import benchmark tool #63

Closed Oceania2018 closed 6 years ago

Oceania2018 commented 6 years ago

Performance is critical for numerical computation. Import the benchmark tool BenchmarkDotNet to measure and visualize the performance.

fdncred commented 6 years ago

I've been thinking the same thing. I've used BenchmarkDotNet before but I'm not fluent in how it's used. I'll explore it when I get time.

Oceania2018 commented 6 years ago

@fdncred Appriciate, @dotChris90 Any idea on it?

dotChris90 commented 6 years ago

@Oceania2018 never used it before but you are right. we need some benchmark framework since we need to make critical decisions like your idea with the 1D array. If I did not see with my own eyes that matrix multiplication with 2 1D arrays is faster then 2 2D arrays I would not believe it.

fdncred commented 6 years ago

@Oceania2018, @dotChris90, BenchmarkDotNet is definitely one of the best to use. We'll just have to figure out the details on how to use it properly.

fdncred commented 6 years ago

Just pushed first benchmark changes with 343c1bba2422c272e2fc87d2c35492aec5131cd4. I suggest starting here for writing further benchmarks.


BenchmarkDotNet=v0.11.2, OS=Windows 10.0.17763.55 (1809/October2018Update/Redstone5)
Intel Xeon CPU E5-1650 v2 3.50GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=2.1.403
  [Host]     : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
  Job-BSWZJZ : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT

IterationCount=5  RunStrategy=ColdStart  
Method N Mean Error StdDev Median Min Max
Access1 10 117.4 us 998.6 us 259.3 us 0.7000 us 0.7000 us 581.3 us
Access2 10 120.5 us 1,024.8 us 266.1 us 1.0000 us 0.8000 us 596.6 us
Access3 10 244.3 us 2,053.4 us 533.3 us 5.0000 us 4.1000 us 1,198.2 us
Access4 10 264.6 us 2,231.2 us 579.4 us 4.3000 us 4.1000 us 1,301.1 us
Access5 10 842.4 us 7,153.5 us 1,857.7 us 6.6000 us 6.1000 us 4,165.6 us
CheckPlusOperation1 10 1,542.2 us 13,039.4 us 3,386.3 us 26.9000 us 22.8000 us 7,599.8 us
CheckPlusOperation2 10 1,184.6 us 10,102.3 us 2,623.5 us 10.9000 us 10.5000 us 5,877.7 us
CheckMatrixMultiplication1 10 1,766.1 us 15,008.1 us 3,897.6 us 24.1000 us 20.4000 us 8,738.3 us
CheckMatrixMultiplication2 10 1,661.3 us 14,079.7 us 3,656.5 us 19.1000 us 18.7000 us 8,202.1 us
Access1 100 166.2 us 956.2 us 248.3 us 53.2000 us 53.1000 us 610.4 us
Access2 100 191.6 us 1,043.2 us 270.9 us 66.2000 us 64.3000 us 675.9 us
Access3 100 289.7 us 2,013.1 us 522.8 us 56.9000 us 54.1000 us 1,224.9 us
Access4 100 313.7 us 2,166.1 us 562.5 us 62.4000 us 57.1000 us 1,319.9 us
Access5 100 1,063.6 us 7,178.6 us 1,864.3 us 226.4000 us 208.3000 us 4,398.1 us
CheckPlusOperation1 100 2,257.8 us 13,340.9 us 3,464.6 us 698.9000 us 680.7000 us 8,455.3 us
CheckPlusOperation2 100 1,798.1 us 10,819.1 us 2,809.7 us 551.7000 us 530.2000 us 6,824.2 us
CheckMatrixMultiplication1 100 4,851.9 us 14,712.4 us 3,820.8 us 3,145.9000 us 3,112.1000 us 11,686.6 us
CheckMatrixMultiplication2 100 4,815.8 us 14,219.3 us 3,692.7 us 3,169.2000 us 3,125.2000 us 11,421.3 us
dotChris90 commented 6 years ago

@fdncred thanks for sharing. The link is a good "evening book" to read. ;) It is great to see that Numsharp now has testing and benchmark - feels slowly more and more professional.

fdncred commented 6 years ago

Here's a test that took more than an hour to run. It ran the tests with 100, 1000, and 3000 items in the arrays.


BenchmarkDotNet=v0.11.2, OS=Windows 10.0.17763.55 (1809/October2018Update/Redstone5)
Intel Xeon CPU E5-1650 v2 3.50GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=2.1.403
  [Host]     : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
  Job-JWZVYD : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT

IterationCount=5  RunStrategy=ColdStart  
Method N Mean Error StdDev Median Min Max Rank
Access1 100 165.4 us 943.3 us 245.0 us 55.10 us 51.70 us 603.5 us 1
Access2 100 192.3 us 1,014.5 us 263.4 us 66.30 us 64.30 us 662.6 us 1
Access3 100 296.4 us 1,989.8 us 516.7 us 71.30 us 54.20 us 1,220.6 us 1
Access4 100 317.4 us 2,180.8 us 566.3 us 62.40 us 57.30 us 1,330.4 us 1
Access5 100 1,092.0 us 7,449.5 us 1,934.6 us 224.50 us 218.80 us 4,552.7 us 2
CheckPlusOperation1 100 2,450.4 us 13,141.8 us 3,412.9 us 1,077.80 us 676.80 us 8,540.6 us 4
CheckPlusOperation2 100 1,736.3 us 10,195.6 us 2,647.8 us 543.50 us 540.30 us 6,472.7 us 3
CheckMatrixMultiplication1 100 4,787.4 us 14,548.1 us 3,778.1 us 3,102.60 us 3,070.40 us 11,545.7 us 5
CheckMatrixMultiplication2 100 4,816.5 us 14,327.0 us 3,720.7 us 3,108.20 us 3,097.70 us 11,470.4 us 5
Access1 1000 5,816.7 us 3,411.3 us 885.9 us 5,440.70 us 5,316.90 us 7,394.9 us 5
Access2 1000 7,012.3 us 3,665.4 us 951.9 us 6,638.50 us 6,493.40 us 8,711.7 us 6
Access3 1000 5,681.9 us 5,532.3 us 1,436.7 us 5,064.90 us 4,980.90 us 8,250.9 us 5
Access4 1000 6,797.0 us 4,760.9 us 1,236.4 us 6,615.30 us 5,683.30 us 8,640.0 us 6
Access5 1000 29,696.4 us 15,619.8 us 4,056.4 us 28,195.90 us 26,114.70 us 35,100.1 us 7
CheckPlusOperation1 1000 77,846.7 us 16,120.4 us 4,186.4 us 77,195.10 us 72,908.80 us 84,288.6 us 10
CheckPlusOperation2 1000 81,742.8 us 27,940.8 us 7,256.1 us 79,597.10 us 75,809.10 us 93,948.7 us 10
CheckMatrixMultiplication1 1000 21,272,555.0 us 160,448.1 us 41,667.9 us 21,269,672.10 us 21,216,600.90 us 21,326,757.3 us 15
CheckMatrixMultiplication2 1000 12,194,916.3 us 331,263.2 us 86,028.0 us 12,165,985.10 us 12,109,752.30 us 12,307,438.7 us 14
Access1 3000 51,640.1 us 23,709.5 us 6,157.3 us 48,995.50 us 48,641.20 us 62,649.6 us 8
Access2 3000 63,069.5 us 20,244.8 us 5,257.5 us 61,673.40 us 59,025.70 us 71,955.2 us 9
Access3 3000 63,213.8 us 16,639.7 us 4,321.3 us 60,809.80 us 60,278.50 us 70,328.9 us 9
Access4 3000 63,362.4 us 10,016.1 us 2,601.2 us 62,831.70 us 61,477.60 us 67,848.0 us 9
Access5 3000 227,584.3 us 5,332.8 us 1,384.9 us 226,867.30 us 226,233.20 us 229,426.1 us 11
CheckPlusOperation1 3000 635,782.2 us 40,001.7 us 10,388.3 us 632,562.90 us 628,092.30 us 653,942.5 us 12
CheckPlusOperation2 3000 696,463.0 us 19,601.5 us 5,090.4 us 695,063.70 us 692,161.50 us 705,305.3 us 13
CheckMatrixMultiplication1 3000 792,708,661.9 us 14,478,874.2 us 3,760,117.6 us 795,124,884.80 us 786,975,421.80 us 795,414,398.9 us 17
CheckMatrixMultiplication2 3000 425,625,221.2 us 5,820,745.9 us 1,511,629.2 us 425,790,985.10 us 423,823,215.60 us 427,567,933.3 us 16
dotChris90 commented 6 years ago

@fdncred just watch the benchmark test. Extremely nice. Looks easy to use, clean and still very mature. Nice suggestion.

fdncred commented 6 years ago

@dotChris90, Thanks. We've just scratched the surface. There are sooooo many things it can do. We're not using it exactly right but it's a start. I forced the coldstart instead of warm ups because it just took so long.

Oceania2018 commented 6 years ago

@dotChris90 I just found this code extremly slow down NDArray.ARange:

public NDArray<T> ARange(int stop, int start = 0, int step = 1)
        {
            dynamic npDyn = this;

            dynamic npResult = NumSharp.Extensions.NDArrayExtensions.ARange(npDyn, stop,start,step);

            return npResult;
        }

20x slower. We need to fix it. I'll push benchmark code np.arrange.cs.

image

Acutall, ndarray doesn't need arange extension, only NumPy.arange make sense. I'll delete NDArray<T>.ARange.

dotChris90 commented 6 years ago

@Oceania2018 I was expecting this - but not that huge.

please rise an issue "Remove Extension Linking". This methods tries to make extension methods accessible from F#, Php, Powershell and even Python.NET - see #4 .

But I will remove them. And open maybe 1 Side Project in src folder "NumSharp.NonCSharp" for making it better accessible or "NumSharp.Powershell", "NumSharp.FSharp", and so on. The original Core C# must have the best performance. The other "language accessibility" is just a nice to have.

But! If you do not mind - I would like to have a look closely to the extension methods again and check which methods are better as extension methods and which can used as normal. For a better accessibility from other languages I would like to try to use as few extension methods as possible.

Oceania2018 commented 6 years ago

@dotChris90 I've pushed the code. The thing is ndarray doesn't need arange extension, there is no ndarray.arange in python numpy neither. I think you have to remove it and all the other extensions which is not consist with python numpy. I added the NumPy class to adapt the high level API which is mocking numpy. We should use like

var np = new NumPy<int>();
var n = np.arange(3).reshape(3, 1);

User doesn't need to interact with NDArray<T>

@fdncred BenchmarkDoNet really helps us to find performance issue.

dotChris90 commented 6 years ago

@Oceania2018 ah ok got it. I will remove the extension linking class tomorrow - or you do. ;)

Oceania2018 commented 6 years ago

@dotChris90 It looks more sense now, ahh. After I remove the dynamic type. like:

dynamic npDyn = this;
return NumSharp.Extensions.NDArrayExtensions.ArgMax(npDyn);

I think we should not use dynamic type anymore. Performance killer.

image

dotChris90 commented 6 years ago

@Oceania2018 I was expecting this - same performance killer like DynamicObject but honestly I had no idea what it is such critical. 10 or 20 times ... wow --> thats why compilers are better than interpreter lol

I will rethink about a strategy for operators. Operators in C# can not! be! extensions - they must be static nonextensions method. example The input and output for a NDArray + operation is clear. NDArray always. but inside the code I use dynamic. So need to rewrite them again.

Oceania2018 commented 6 years ago

@dotChris90 before you rewrite it, give them a benchmark first. remove dynamic that will be fine.

Oceania2018 commented 6 years ago

@dotChris90 @fdncred Continue to speed up 3x more.

image

Can we improve further?

NEW GOAL: 150 us.

fdncred commented 6 years ago

Nice! Great job.

fdncred commented 6 years ago

Is there any way to use Span because it's super fast?

fdncred commented 6 years ago

Here's another idea. I have not tested this yet but any Enumerable has .AsParallel() method. I wonder if some of the methods would benefit from that?

So it'd be something like this:

public static NDArray<double> ARange(this NDArray<double> np,int stop, int start = 0, int step = 1)
        {
            int index = 0;

            np.Data = Enumerable.Range(start,stop - start)
                                .Where(x => index++ % step == 0)
                                .Select(x => (double)x)
                                .ToArray();
            np.Shape = new Shape(new int[] { stop });
            return np;
        }

Changed to this:

public static NDArray<double> ARange(this NDArray<double> np,int stop, int start = 0, int step = 1)
        {
            int index = 0;

            np.Data = Enumerable.Range(start,stop - start)
                                .Where(x => index++ % step == 0)
                                .Select(x => (double)x)
                                .AsParallel()
                                .ToArray();
            np.Shape = new Shape(new int[] { stop });
            return np;
        }
Oceania2018 commented 6 years ago

Try to avoid using Linq. I used the raw for.

        public static NDArray<int> ARange(this NDArray<int> np,int stop, int start = 0, int step = 1)
        {
            var list = new int[(int)Math.Ceiling((stop - start + 0.0) / step)];
            int index = 0;

            for (int i = start; i < stop; i += step)
                list[index++] = i;

            np.Data = list;
            np.Shape = new Shape(list.Length);

            return np;
        }

@fdncred Nothing to do with Span<T>.

fdncred commented 6 years ago

Yes, I realized that is what you did. I'm suggesting that Linq performance could be improved with .AsParallel(). And of course, just about any for loop could be improved with Parallel.For.

dotChris90 commented 6 years ago

So I suggest we should consider the following benchmark tests. I write them down no matter if already exists.

This are at least the 4 performance checks I would like to see. Please feel free to add more and later can rise issues. 😉

dotChris90 commented 6 years ago

@Oceania2018 @fdncred by the way you 2 ;) - I did just some experiments with casts. I was really surprised that the generic T[] Data property (so a generic array) can not be directly cast to double[] or sth else. An alternative strategy is the boxing - so array.select(x => (double)x) and so on. But I am not a fan of boxing. The most interesting and maybe promising solution is :

double[] A = np.Data as double[];

With this we can avoid using dynamic keyword and cast the whole array without boxing (hope at least - I do not know what "as" keyword is doing in background).

We could also maybe try to bring this into some extension methods since the only alternative is to write every extension method for "NDArray< double >,NDArray< Complex >, ...." so for every possible type (except we have no numerical operation and so maybe can use NDArray< T >).

fdncred commented 6 years ago

Boxing kills performance. I've seen articles that talk about performance difference between "As casting" compared to "direct casting" or "is casting".

https://www.danielcrabtree.com/blog/164/c-sharp-7-micro-benchmarking-the-three-ways-to-cast-safely https://stackoverflow.com/questions/496096/casting-vs-using-the-as-keyword-in-the-clr https://buildstarted.com/2014/06/27/casting-in-csharp/

dotChris90 commented 6 years ago

@fdncred thanks for sharing. The articles explains a lot. I was often wondering about the different casting options.

Most import is that pattern matching is really the strategy to go with for us. :)

fdncred commented 6 years ago

it's probably safe to close this.