Closed Oceania2018 closed 6 years ago
I've been thinking the same thing. I've used BenchmarkDotNet before but I'm not fluent in how it's used. I'll explore it when I get time.
@fdncred Appriciate, @dotChris90 Any idea on it?
@Oceania2018 never used it before but you are right. we need some benchmark framework since we need to make critical decisions like your idea with the 1D array. If I did not see with my own eyes that matrix multiplication with 2 1D arrays is faster then 2 2D arrays I would not believe it.
@Oceania2018, @dotChris90, BenchmarkDotNet is definitely one of the best to use. We'll just have to figure out the details on how to use it properly.
Just pushed first benchmark changes with 343c1bba2422c272e2fc87d2c35492aec5131cd4. I suggest starting here for writing further benchmarks.
BenchmarkDotNet=v0.11.2, OS=Windows 10.0.17763.55 (1809/October2018Update/Redstone5)
Intel Xeon CPU E5-1650 v2 3.50GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=2.1.403
[Host] : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
Job-BSWZJZ : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
IterationCount=5 RunStrategy=ColdStart
Method | N | Mean | Error | StdDev | Median | Min | Max |
---|---|---|---|---|---|---|---|
Access1 | 10 | 117.4 us | 998.6 us | 259.3 us | 0.7000 us | 0.7000 us | 581.3 us |
Access2 | 10 | 120.5 us | 1,024.8 us | 266.1 us | 1.0000 us | 0.8000 us | 596.6 us |
Access3 | 10 | 244.3 us | 2,053.4 us | 533.3 us | 5.0000 us | 4.1000 us | 1,198.2 us |
Access4 | 10 | 264.6 us | 2,231.2 us | 579.4 us | 4.3000 us | 4.1000 us | 1,301.1 us |
Access5 | 10 | 842.4 us | 7,153.5 us | 1,857.7 us | 6.6000 us | 6.1000 us | 4,165.6 us |
CheckPlusOperation1 | 10 | 1,542.2 us | 13,039.4 us | 3,386.3 us | 26.9000 us | 22.8000 us | 7,599.8 us |
CheckPlusOperation2 | 10 | 1,184.6 us | 10,102.3 us | 2,623.5 us | 10.9000 us | 10.5000 us | 5,877.7 us |
CheckMatrixMultiplication1 | 10 | 1,766.1 us | 15,008.1 us | 3,897.6 us | 24.1000 us | 20.4000 us | 8,738.3 us |
CheckMatrixMultiplication2 | 10 | 1,661.3 us | 14,079.7 us | 3,656.5 us | 19.1000 us | 18.7000 us | 8,202.1 us |
Access1 | 100 | 166.2 us | 956.2 us | 248.3 us | 53.2000 us | 53.1000 us | 610.4 us |
Access2 | 100 | 191.6 us | 1,043.2 us | 270.9 us | 66.2000 us | 64.3000 us | 675.9 us |
Access3 | 100 | 289.7 us | 2,013.1 us | 522.8 us | 56.9000 us | 54.1000 us | 1,224.9 us |
Access4 | 100 | 313.7 us | 2,166.1 us | 562.5 us | 62.4000 us | 57.1000 us | 1,319.9 us |
Access5 | 100 | 1,063.6 us | 7,178.6 us | 1,864.3 us | 226.4000 us | 208.3000 us | 4,398.1 us |
CheckPlusOperation1 | 100 | 2,257.8 us | 13,340.9 us | 3,464.6 us | 698.9000 us | 680.7000 us | 8,455.3 us |
CheckPlusOperation2 | 100 | 1,798.1 us | 10,819.1 us | 2,809.7 us | 551.7000 us | 530.2000 us | 6,824.2 us |
CheckMatrixMultiplication1 | 100 | 4,851.9 us | 14,712.4 us | 3,820.8 us | 3,145.9000 us | 3,112.1000 us | 11,686.6 us |
CheckMatrixMultiplication2 | 100 | 4,815.8 us | 14,219.3 us | 3,692.7 us | 3,169.2000 us | 3,125.2000 us | 11,421.3 us |
@fdncred thanks for sharing. The link is a good "evening book" to read. ;) It is great to see that Numsharp now has testing and benchmark - feels slowly more and more professional.
Here's a test that took more than an hour to run. It ran the tests with 100, 1000, and 3000 items in the arrays.
BenchmarkDotNet=v0.11.2, OS=Windows 10.0.17763.55 (1809/October2018Update/Redstone5)
Intel Xeon CPU E5-1650 v2 3.50GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=2.1.403
[Host] : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
Job-JWZVYD : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
IterationCount=5 RunStrategy=ColdStart
Method | N | Mean | Error | StdDev | Median | Min | Max | Rank |
---|---|---|---|---|---|---|---|---|
Access1 | 100 | 165.4 us | 943.3 us | 245.0 us | 55.10 us | 51.70 us | 603.5 us | 1 |
Access2 | 100 | 192.3 us | 1,014.5 us | 263.4 us | 66.30 us | 64.30 us | 662.6 us | 1 |
Access3 | 100 | 296.4 us | 1,989.8 us | 516.7 us | 71.30 us | 54.20 us | 1,220.6 us | 1 |
Access4 | 100 | 317.4 us | 2,180.8 us | 566.3 us | 62.40 us | 57.30 us | 1,330.4 us | 1 |
Access5 | 100 | 1,092.0 us | 7,449.5 us | 1,934.6 us | 224.50 us | 218.80 us | 4,552.7 us | 2 |
CheckPlusOperation1 | 100 | 2,450.4 us | 13,141.8 us | 3,412.9 us | 1,077.80 us | 676.80 us | 8,540.6 us | 4 |
CheckPlusOperation2 | 100 | 1,736.3 us | 10,195.6 us | 2,647.8 us | 543.50 us | 540.30 us | 6,472.7 us | 3 |
CheckMatrixMultiplication1 | 100 | 4,787.4 us | 14,548.1 us | 3,778.1 us | 3,102.60 us | 3,070.40 us | 11,545.7 us | 5 |
CheckMatrixMultiplication2 | 100 | 4,816.5 us | 14,327.0 us | 3,720.7 us | 3,108.20 us | 3,097.70 us | 11,470.4 us | 5 |
Access1 | 1000 | 5,816.7 us | 3,411.3 us | 885.9 us | 5,440.70 us | 5,316.90 us | 7,394.9 us | 5 |
Access2 | 1000 | 7,012.3 us | 3,665.4 us | 951.9 us | 6,638.50 us | 6,493.40 us | 8,711.7 us | 6 |
Access3 | 1000 | 5,681.9 us | 5,532.3 us | 1,436.7 us | 5,064.90 us | 4,980.90 us | 8,250.9 us | 5 |
Access4 | 1000 | 6,797.0 us | 4,760.9 us | 1,236.4 us | 6,615.30 us | 5,683.30 us | 8,640.0 us | 6 |
Access5 | 1000 | 29,696.4 us | 15,619.8 us | 4,056.4 us | 28,195.90 us | 26,114.70 us | 35,100.1 us | 7 |
CheckPlusOperation1 | 1000 | 77,846.7 us | 16,120.4 us | 4,186.4 us | 77,195.10 us | 72,908.80 us | 84,288.6 us | 10 |
CheckPlusOperation2 | 1000 | 81,742.8 us | 27,940.8 us | 7,256.1 us | 79,597.10 us | 75,809.10 us | 93,948.7 us | 10 |
CheckMatrixMultiplication1 | 1000 | 21,272,555.0 us | 160,448.1 us | 41,667.9 us | 21,269,672.10 us | 21,216,600.90 us | 21,326,757.3 us | 15 |
CheckMatrixMultiplication2 | 1000 | 12,194,916.3 us | 331,263.2 us | 86,028.0 us | 12,165,985.10 us | 12,109,752.30 us | 12,307,438.7 us | 14 |
Access1 | 3000 | 51,640.1 us | 23,709.5 us | 6,157.3 us | 48,995.50 us | 48,641.20 us | 62,649.6 us | 8 |
Access2 | 3000 | 63,069.5 us | 20,244.8 us | 5,257.5 us | 61,673.40 us | 59,025.70 us | 71,955.2 us | 9 |
Access3 | 3000 | 63,213.8 us | 16,639.7 us | 4,321.3 us | 60,809.80 us | 60,278.50 us | 70,328.9 us | 9 |
Access4 | 3000 | 63,362.4 us | 10,016.1 us | 2,601.2 us | 62,831.70 us | 61,477.60 us | 67,848.0 us | 9 |
Access5 | 3000 | 227,584.3 us | 5,332.8 us | 1,384.9 us | 226,867.30 us | 226,233.20 us | 229,426.1 us | 11 |
CheckPlusOperation1 | 3000 | 635,782.2 us | 40,001.7 us | 10,388.3 us | 632,562.90 us | 628,092.30 us | 653,942.5 us | 12 |
CheckPlusOperation2 | 3000 | 696,463.0 us | 19,601.5 us | 5,090.4 us | 695,063.70 us | 692,161.50 us | 705,305.3 us | 13 |
CheckMatrixMultiplication1 | 3000 | 792,708,661.9 us | 14,478,874.2 us | 3,760,117.6 us | 795,124,884.80 us | 786,975,421.80 us | 795,414,398.9 us | 17 |
CheckMatrixMultiplication2 | 3000 | 425,625,221.2 us | 5,820,745.9 us | 1,511,629.2 us | 425,790,985.10 us | 423,823,215.60 us | 427,567,933.3 us | 16 |
@fdncred just watch the benchmark test. Extremely nice. Looks easy to use, clean and still very mature. Nice suggestion.
@dotChris90, Thanks. We've just scratched the surface. There are sooooo many things it can do. We're not using it exactly right but it's a start. I forced the coldstart instead of warm ups because it just took so long.
@dotChris90 I just found this code extremly slow down NDArray.ARange:
public NDArray<T> ARange(int stop, int start = 0, int step = 1)
{
dynamic npDyn = this;
dynamic npResult = NumSharp.Extensions.NDArrayExtensions.ARange(npDyn, stop,start,step);
return npResult;
}
20x slower. We need to fix it. I'll push benchmark code np.arrange.cs
.
Acutall, ndarray doesn't need arange extension, only NumPy.arange make sense. I'll delete NDArray<T>.ARange
.
@Oceania2018 I was expecting this - but not that huge.
please rise an issue "Remove Extension Linking". This methods tries to make extension methods accessible from F#, Php, Powershell and even Python.NET - see #4 .
But I will remove them. And open maybe 1 Side Project in src folder "NumSharp.NonCSharp" for making it better accessible or "NumSharp.Powershell", "NumSharp.FSharp", and so on. The original Core C# must have the best performance. The other "language accessibility" is just a nice to have.
But! If you do not mind - I would like to have a look closely to the extension methods again and check which methods are better as extension methods and which can used as normal. For a better accessibility from other languages I would like to try to use as few extension methods as possible.
@dotChris90 I've pushed the code.
The thing is ndarray
doesn't need arange extension, there is no ndarray.arange in python numpy neither.
I think you have to remove it and all the other extensions which is not consist with python numpy.
I added the NumPy
class to adapt the high level API which is mocking numpy.
We should use like
var np = new NumPy<int>();
var n = np.arange(3).reshape(3, 1);
User doesn't need to interact with NDArray<T>
@fdncred BenchmarkDoNet really helps us to find performance issue.
@Oceania2018 ah ok got it. I will remove the extension linking class tomorrow - or you do. ;)
@dotChris90 It looks more sense now, ahh.
After I remove the dynamic
type. like:
dynamic npDyn = this;
return NumSharp.Extensions.NDArrayExtensions.ArgMax(npDyn);
I think we should not use dynamic type anymore. Performance killer.
@Oceania2018 I was expecting this - same performance killer like DynamicObject but honestly I had no idea what it is such critical. 10 or 20 times ... wow --> thats why compilers are better than interpreter lol
I will rethink about a strategy for operators. Operators in C# can not! be! extensions - they must be static nonextensions method. example The input and output for a NDArray + operation is clear. NDArray
@dotChris90 before you rewrite it, give them a benchmark first.
remove dynamic
that will be fine.
@dotChris90 @fdncred Continue to speed up 3x more.
Can we improve further?
150 us
.Nice! Great job.
Is there any way to use Span because it's super fast?
Here's another idea. I have not tested this yet but any Enumerable has .AsParallel() method. I wonder if some of the methods would benefit from that?
So it'd be something like this:
public static NDArray<double> ARange(this NDArray<double> np,int stop, int start = 0, int step = 1)
{
int index = 0;
np.Data = Enumerable.Range(start,stop - start)
.Where(x => index++ % step == 0)
.Select(x => (double)x)
.ToArray();
np.Shape = new Shape(new int[] { stop });
return np;
}
Changed to this:
public static NDArray<double> ARange(this NDArray<double> np,int stop, int start = 0, int step = 1)
{
int index = 0;
np.Data = Enumerable.Range(start,stop - start)
.Where(x => index++ % step == 0)
.Select(x => (double)x)
.AsParallel()
.ToArray();
np.Shape = new Shape(new int[] { stop });
return np;
}
Try to avoid using Linq. I used the raw for
.
public static NDArray<int> ARange(this NDArray<int> np,int stop, int start = 0, int step = 1)
{
var list = new int[(int)Math.Ceiling((stop - start + 0.0) / step)];
int index = 0;
for (int i = start; i < stop; i += step)
list[index++] = i;
np.Data = list;
np.Shape = new Shape(list.Length);
return np;
}
@fdncred Nothing to do with Span<T>
.
Yes, I realized that is what you did. I'm suggesting that Linq performance could be improved with .AsParallel()
. And of course, just about any for
loop could be improved with Parallel.For
.
So I suggest we should consider the following benchmark tests. I write them down no matter if already exists.
This are at least the 4 performance checks I would like to see. Please feel free to add more and later can rise issues. 😉
@Oceania2018 @fdncred by the way you 2 ;) - I did just some experiments with casts. I was really surprised that the generic T[] Data property (so a generic array) can not be directly cast to double[] or sth else. An alternative strategy is the boxing - so array.select(x => (double)x) and so on. But I am not a fan of boxing. The most interesting and maybe promising solution is :
double[] A = np.Data as double[];
With this we can avoid using dynamic keyword and cast the whole array without boxing (hope at least - I do not know what "as" keyword is doing in background).
We could also maybe try to bring this into some extension methods since the only alternative is to write every extension method for "NDArray< double >,NDArray< Complex >, ...." so for every possible type (except we have no numerical operation and so maybe can use NDArray< T >).
Boxing kills performance. I've seen articles that talk about performance difference between "As casting" compared to "direct casting" or "is casting".
https://www.danielcrabtree.com/blog/164/c-sharp-7-micro-benchmarking-the-three-ways-to-cast-safely https://stackoverflow.com/questions/496096/casting-vs-using-the-as-keyword-in-the-clr https://buildstarted.com/2014/06/27/casting-in-csharp/
@fdncred thanks for sharing. The articles explains a lot. I was often wondering about the different casting options.
Most import is that pattern matching is really the strategy to go with for us. :)
it's probably safe to close this.
Performance is critical for numerical computation. Import the benchmark tool BenchmarkDotNet to measure and visualize the performance.