Closed thild closed 2 years ago
Hi Tony,
At the moment I don't have any plans to implement linalg. I work on this library when I have a little bit of free time of which I don't have any at the moment. (I can make time to fix bugs in numpy.net if any are reported).
Linear algebra is not an area of expertise but I believe there are plenty of existing .net libraries. I don't know how those can relate to numpy.net.
Internally, all implementations of Numpy operate strictly on single dimension arrays. Various data structures are implemented to make it look like there a multiple dimensions. Which means that any attempt to convert to or from a multi-D C# array will involve allocating a new buffer and copying the data. I don't think it is possible to cast to different dimension arrays.
If you pass a 2D array to np.array(), we will convert it to a single D array internally but set it up as a 2D ndarray. ndarray m = np.array(new Int32[,] { { 1, 2 }, { 3, 4 } }, np.Int32);
I think you will need to manually copy the data back to a 2D array. You can get the raw data like this: int32[] rawdata = m.AsInt32Array();
Have you investigated the https://github.com/SciSharp/Numpy.NET package? Their web site suggests they support the np.linalg package.
I have experimented with that project and it is not bad. They are not pure .net which is maybe a problem for some applications. I think they dynamically build a wrapper around the python API. That gets them a lot of functionality although I think there are some things they can't handle. It prevents them from being multithreaded and there is a performance hit going between .NET/Python/C lower layer and back. Like any software approach, there are tradeoffs.
Hi Kevin,
I was playing with a very simple python library for reservoir computing (https://github.com/cknd/pyESN) and wondering how simple it would be to port it to C#. I started with https://github.com/SciSharp/NumSharp and https://github.com/SciSharp/Numpy.NET, but both have some caveats. The first seems to have been inactive for months, is incomplete and does not strictly follow the numpy API. The second, although almost complete, depends on third-party libraries that still don't work on Linux. So, digging nugget, I found your project. I managed to port almost all pyESN.py to C # with minor modifications, but pyENS uses numpy.linalg and I was not able to complete the port.
Porting numpy to C # is a huge task. You are brave. Perhaps in the future it will be possible to compile C/C++ directly to CIL and that would help with projects like yours. See https://ericsink.com/entries/dotnet_rust.html
If I have any free time, I would like to collaborate with your project.
I really appreciate your time. Thank you.
If you had a good knowledge of linear algebra perhaps together we could replicate the functionality of np.linalg. It has been many years since I was in a math classroom and I don't think I ever studied linear algebra so that is a huge hurdle for me.
Here is a link to what I think is the python sources for np.linalg:
https://github.com/numpy/numpy/blob/master/numpy/linalg/linalg.py. That in itself does not look too bad to port.
I think the SciPy version is quite a bit larger.
However, I think the heavy lifting is done by an Intel C based library. There may already be some .NET libraries that are the equivalent of this library. We could also try to interop with the Intel library but that might make it windows only. I have never tried to interop with anything but a windows DLL so I don't know what it takes to do that on Linux or Mac.
I think ultimately, an expert in linear algebra would need to get involved and at least offer guidance. It may actually be very simple to do for an expert. I think there is great value in having a pure .NET solution if that can done.
If you had a good knowledge of linear algebra perhaps together we could replicate the functionality of np.linalg. It has been many years since I was in a math classroom and I don't think I ever studied linear algebra so that is a huge hurdle for me.
My knowledge of linear algebra is also not the best, but nothing that unit tests cannot workaround :)
Here is a link to what I think is the python sources for np.linalg: https://github.com/numpy/numpy/blob/master/numpy/linalg/linalg.py. That in itself does not look too bad to port. I think the SciPy version is quite a bit larger.
However, I think the heavy lifting is done by an Intel C based library. There may already be some .NET libraries that are the equivalent of this library. We could also try to interop with the Intel library but that might make it windows only. I have never tried to interop with anything but a windows DLL so I don't know what it takes to do that on Linux or Mac.
They do interop with lapack lite C version. MathNet.Numerics have multi platform bindings to Intel MKL but they don't have license for redistribution. Fortunately, Intel create nuget packages https://www.nuget.org/packages?q=intelmkl. Perhaps a starting point.
I think ultimately, an expert in linear algebra would need to get involved and at least offer guidance. It may actually be very simple to do for an expert. I think there is great value in having a pure .NET solution if that can done.
It would be absolutely great to have a pure .net solution, but I think it would be very difficult to achieve the performance of LAPACK, BLAS etc. Perhaps the bindings are a better solution.
Regarding performance, my library exceeds that of the python/C based version in several key areas now (my original releases were noticeably slower). Of course in a straight single threaded comparison, C# is never going to be as fast as highly optimized C code. However, I was able to employ parallel operations on key areas which means 2 or more CPU cores are being used to run calculations and copy chunks of data. I was not able to parallelize every operation so some are still slower.
All of which means that if we can parallelize the linear algebra calculations, it may be as fast or faster than the single threaded C.
Do you think the Math.Numerics library covers most of the calculations of the Intel library?
Regarding performance, my library exceeds that of the python/C based version in several key areas now (my original releases were noticeably slower). Of course in a straight single threaded comparison, C# is never going to be as fast as highly optimized C code. However, I was able to employ parallel operations on key areas which means 2 or more CPU cores are being used to run calculations and copy chunks of data. I was not able to parallelize every operation so some are still slower.
Do you have any benchmark comparison with numpy?
All of which means that if we can parallelize the linear algebra calculations, it may be as fast or faster than the single threaded C.
Do you think the Math.Numerics library covers most of the calculations of the Intel library?
As far I know MathNet.Numerics have pure .Net linear algebra algorithms implementations and wraps MKL as native back-end for high performance provider.
benchmark comparison? Not officially. What I do is write unit tests against large arrays (> 1GB) in both python and my library on the same PC. The unit tests diagnostic timing to calculate start and finish times. My library often ends up faster.
Is there an official python benchmark test? If there is, I might try to port it and check it against mine.
Also, my library is multi-thread safe meaning multiple operations can be running at the same time in the same application. If your algorithm requires multiple calculations on different ndarray objects then running those in parallel could be a real boost. Python/Numpy has a major limitation that prevents multiple numpy calls simultaneously. They use a GIL lock to prevent it. I forget exactly why they have it, but they have been trying to get rid of that limitation for years without success. It has something to do with python/numpy reference checking. It is not an issue in .NET because of the garbage collector.
If I manage porting pyESN I will make a comparison with the two implementations. For now my porting is 1:1 to the original python code, without any parallelization.
Hi Kevin,
I cooked a repository to compare pyESN vs csESN using your library. See https://github.com/thild/csESN The results are identical for both implementations, but the C# version is much slower, around 27x. Then, I ran BenchmarkDotNet with MemoryDiagnoser to create the memory allocation profile. Each run allocated 28.75 GB on average. If I did everything right, I think the allocation is the culprit. I must say that this is the first time that I am using BenchmarkDotNet, so I may be misinterpreting the results.
> time python mackey.py
test error:
0.139603909889077
python mackey.py 6,62s user 4,40s system 623% cpu 1,766 total
> time dotnet run -c Release
STRING
{ test error:
0,13960391000736766 }
dotnet run -c Release 181,87s user 38,63s system 303% cpu 1:12,68 total
> sudo dotnet run -c Release
// Validating benchmarks:
// ***** BenchmarkRunner: Start *****
// ***** Found 1 benchmark(s) in total *****
// ***** Building 1 exe(s) in Parallel: Start *****
// start dotnet restore /p:UseSharedCompilation=false /p:BuildInParallel=false /m:1 in /mnt/docs/dev/playground/csESN/csESN/bin/Release/netcoreapp3.1/2c0b91d9-a9ab-4231-8210-05613bed0b26
// command took 2,48s and exited with 0
// start dotnet build -c Release --no-restore /p:UseSharedCompilation=false /p:BuildInParallel=false /m:1 in /mnt/docs/dev/playground/csESN/csESN/bin/Release/netcoreapp3.1/2c0b91d9-a9ab-4231-8210-05613bed0b26
// command took 3,66s and exited with 0
// ***** Done, took 00:00:06 (6.26 sec) *****
// Found 1 benchmarks:
// ESNBenchmark.Run: DefaultJob
// **************************
// Benchmark: ESNBenchmark.Run: DefaultJob
// *** Execute ***
// Launch: 1 / 1
// Execute: dotnet "2c0b91d9-a9ab-4231-8210-05613bed0b26.dll" --benchmarkName "csESN.ESNBenchmark.Run" --job "Default" --benchmarkId 0 in /mnt/docs/dev/playground/csESN/csESN/bin/Release/netcoreapp3.1/2c0b91d9-a9ab-4231-8210-05613bed0b26/bin/Release/netcoreapp3.1
// BeforeAnythingElse
// Benchmark Process Environment Information:
// Runtime=.NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
// GC=Concurrent Workstation
// Job: DefaultJob
OverheadJitting 1: 1 op, 309091.00 ns, 309.0910 us/op
STRING
{ test error:
0,13960391000736766 }
WorkloadJitting 1: 1 op, 69972007171.00 ns, 69.9720 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadWarmup 1: 1 op, 68903214677.00 ns, 68.9032 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadWarmup 2: 1 op, 69065764255.00 ns, 69.0658 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadWarmup 3: 1 op, 69012457131.00 ns, 69.0125 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadWarmup 4: 1 op, 69167306448.00 ns, 69.1673 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadWarmup 5: 1 op, 69074680326.00 ns, 69.0747 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadWarmup 6: 1 op, 69097535779.00 ns, 69.0975 s/op
// BeforeActualRun
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 1: 1 op, 83252535854.00 ns, 83.2525 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 2: 1 op, 70054739045.00 ns, 70.0547 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 3: 1 op, 76028864017.00 ns, 76.0289 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 4: 1 op, 77578582333.00 ns, 77.5786 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 5: 1 op, 82379405893.00 ns, 82.3794 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 6: 1 op, 71258021296.00 ns, 71.2580 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 7: 1 op, 82281872768.00 ns, 82.2819 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 8: 1 op, 89614463920.00 ns, 89.6145 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 9: 1 op, 73694150080.00 ns, 73.6942 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 10: 1 op, 79296846165.00 ns, 79.2968 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 11: 1 op, 80639681482.00 ns, 80.6397 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 12: 1 op, 85159746440.00 ns, 85.1597 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 13: 1 op, 80343625558.00 ns, 80.3436 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 14: 1 op, 88973886969.00 ns, 88.9739 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 15: 1 op, 75638625886.00 ns, 75.6386 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 16: 1 op, 90347629038.00 ns, 90.3476 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 17: 1 op, 82542657428.00 ns, 82.5427 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 18: 1 op, 88240114769.00 ns, 88.2401 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 19: 1 op, 78254100026.00 ns, 78.2541 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 20: 1 op, 80485850510.00 ns, 80.4859 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 21: 1 op, 81220587422.00 ns, 81.2206 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 22: 1 op, 99042498969.00 ns, 99.0425 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 23: 1 op, 78255720088.00 ns, 78.2557 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 24: 1 op, 83473585580.00 ns, 83.4736 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 25: 1 op, 90375833126.00 ns, 90.3758 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 26: 1 op, 81609486288.00 ns, 81.6095 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 27: 1 op, 81584916424.00 ns, 81.5849 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 28: 1 op, 94760079642.00 ns, 94.7601 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 29: 1 op, 78077500102.00 ns, 78.0775 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 30: 1 op, 69366160013.00 ns, 69.3662 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 31: 1 op, 82065480738.00 ns, 82.0655 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 32: 1 op, 76367593154.00 ns, 76.3676 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 33: 1 op, 75870250383.00 ns, 75.8703 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 34: 1 op, 86935023053.00 ns, 86.9350 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 35: 1 op, 105256100142.00 ns, 105.2561 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 36: 1 op, 73198904418.00 ns, 73.1989 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 37: 1 op, 88738336488.00 ns, 88.7383 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 38: 1 op, 78501978072.00 ns, 78.5020 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 39: 1 op, 85945894610.00 ns, 85.9459 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 40: 1 op, 101822870577.00 ns, 101.8229 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 41: 1 op, 85503128641.00 ns, 85.5031 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 42: 1 op, 87070674983.00 ns, 87.0707 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 43: 1 op, 88411811037.00 ns, 88.4118 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 44: 1 op, 90806273272.00 ns, 90.8063 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 45: 1 op, 82742509687.00 ns, 82.7425 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 46: 1 op, 74781588362.00 ns, 74.7816 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 47: 1 op, 71830353024.00 ns, 71.8304 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 48: 1 op, 92980971256.00 ns, 92.9810 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 49: 1 op, 81619856567.00 ns, 81.6199 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 50: 1 op, 73959015250.00 ns, 73.9590 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 51: 1 op, 77715044354.00 ns, 77.7150 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 52: 1 op, 78363413862.00 ns, 78.3634 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 53: 1 op, 91221867878.00 ns, 91.2219 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 54: 1 op, 81627798262.00 ns, 81.6278 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 55: 1 op, 79491699473.00 ns, 79.4917 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 56: 1 op, 81558240905.00 ns, 81.5582 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 57: 1 op, 83182206436.00 ns, 83.1822 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 58: 1 op, 84858053509.00 ns, 84.8581 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 59: 1 op, 73621861014.00 ns, 73.6219 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 60: 1 op, 85223562070.00 ns, 85.2236 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 61: 1 op, 84956313958.00 ns, 84.9563 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 62: 1 op, 71229230574.00 ns, 71.2292 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 63: 1 op, 76506494633.00 ns, 76.5065 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 64: 1 op, 77915554130.00 ns, 77.9156 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 65: 1 op, 88787871615.00 ns, 88.7879 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 66: 1 op, 79374401963.00 ns, 79.3744 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 67: 1 op, 73816006873.00 ns, 73.8160 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 68: 1 op, 85156861744.00 ns, 85.1569 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 69: 1 op, 77265495579.00 ns, 77.2655 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 70: 1 op, 77661107035.00 ns, 77.6611 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 71: 1 op, 87645060322.00 ns, 87.6451 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 72: 1 op, 72009081106.00 ns, 72.0091 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 73: 1 op, 80773937371.00 ns, 80.7739 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 74: 1 op, 84717968833.00 ns, 84.7180 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 75: 1 op, 68795626813.00 ns, 68.7956 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 76: 1 op, 79849592722.00 ns, 79.8496 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 77: 1 op, 80392011852.00 ns, 80.3920 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 78: 1 op, 72938952521.00 ns, 72.9390 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 79: 1 op, 82612507066.00 ns, 82.6125 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 80: 1 op, 84044721969.00 ns, 84.0447 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 81: 1 op, 79932930173.00 ns, 79.9329 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 82: 1 op, 82266084800.00 ns, 82.2661 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 83: 1 op, 77732191869.00 ns, 77.7322 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 84: 1 op, 79970313960.00 ns, 79.9703 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 85: 1 op, 87631082026.00 ns, 87.6311 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 86: 1 op, 73769568904.00 ns, 73.7696 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 87: 1 op, 76062322705.00 ns, 76.0623 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 88: 1 op, 78102747395.00 ns, 78.1027 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 89: 1 op, 71616543760.00 ns, 71.6165 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 90: 1 op, 81849260110.00 ns, 81.8493 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 91: 1 op, 82435768438.00 ns, 82.4358 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 92: 1 op, 79614189905.00 ns, 79.6142 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 93: 1 op, 78789572423.00 ns, 78.7896 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 94: 1 op, 76119056157.00 ns, 76.1191 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 95: 1 op, 82825273297.00 ns, 82.8253 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 96: 1 op, 80882977550.00 ns, 80.8830 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 97: 1 op, 82922992524.00 ns, 82.9230 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 98: 1 op, 75901275186.00 ns, 75.9013 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 99: 1 op, 76898547925.00 ns, 76.8985 s/op
STRING
{ test error:
0,13960391000736766 }
WorkloadActual 100: 1 op, 85087254372.00 ns, 85.0873 s/op
// AfterActualRun
STRING
{ test error:
0,13960391000736766 }
WorkloadResult 1: 1 op, 83252535854.00 ns, 83.2525 s/op
WorkloadResult 2: 1 op, 70054739045.00 ns, 70.0547 s/op
WorkloadResult 3: 1 op, 76028864017.00 ns, 76.0289 s/op
WorkloadResult 4: 1 op, 77578582333.00 ns, 77.5786 s/op
WorkloadResult 5: 1 op, 82379405893.00 ns, 82.3794 s/op
WorkloadResult 6: 1 op, 71258021296.00 ns, 71.2580 s/op
WorkloadResult 7: 1 op, 82281872768.00 ns, 82.2819 s/op
WorkloadResult 8: 1 op, 89614463920.00 ns, 89.6145 s/op
WorkloadResult 9: 1 op, 73694150080.00 ns, 73.6942 s/op
WorkloadResult 10: 1 op, 79296846165.00 ns, 79.2968 s/op
WorkloadResult 11: 1 op, 80639681482.00 ns, 80.6397 s/op
WorkloadResult 12: 1 op, 85159746440.00 ns, 85.1597 s/op
WorkloadResult 13: 1 op, 80343625558.00 ns, 80.3436 s/op
WorkloadResult 14: 1 op, 88973886969.00 ns, 88.9739 s/op
WorkloadResult 15: 1 op, 75638625886.00 ns, 75.6386 s/op
WorkloadResult 16: 1 op, 90347629038.00 ns, 90.3476 s/op
WorkloadResult 17: 1 op, 82542657428.00 ns, 82.5427 s/op
WorkloadResult 18: 1 op, 88240114769.00 ns, 88.2401 s/op
WorkloadResult 19: 1 op, 78254100026.00 ns, 78.2541 s/op
WorkloadResult 20: 1 op, 80485850510.00 ns, 80.4859 s/op
WorkloadResult 21: 1 op, 81220587422.00 ns, 81.2206 s/op
WorkloadResult 22: 1 op, 78255720088.00 ns, 78.2557 s/op
WorkloadResult 23: 1 op, 83473585580.00 ns, 83.4736 s/op
WorkloadResult 24: 1 op, 90375833126.00 ns, 90.3758 s/op
WorkloadResult 25: 1 op, 81609486288.00 ns, 81.6095 s/op
WorkloadResult 26: 1 op, 81584916424.00 ns, 81.5849 s/op
WorkloadResult 27: 1 op, 94760079642.00 ns, 94.7601 s/op
WorkloadResult 28: 1 op, 78077500102.00 ns, 78.0775 s/op
WorkloadResult 29: 1 op, 69366160013.00 ns, 69.3662 s/op
WorkloadResult 30: 1 op, 82065480738.00 ns, 82.0655 s/op
WorkloadResult 31: 1 op, 76367593154.00 ns, 76.3676 s/op
WorkloadResult 32: 1 op, 75870250383.00 ns, 75.8703 s/op
WorkloadResult 33: 1 op, 86935023053.00 ns, 86.9350 s/op
WorkloadResult 34: 1 op, 73198904418.00 ns, 73.1989 s/op
WorkloadResult 35: 1 op, 88738336488.00 ns, 88.7383 s/op
WorkloadResult 36: 1 op, 78501978072.00 ns, 78.5020 s/op
WorkloadResult 37: 1 op, 85945894610.00 ns, 85.9459 s/op
WorkloadResult 38: 1 op, 85503128641.00 ns, 85.5031 s/op
WorkloadResult 39: 1 op, 87070674983.00 ns, 87.0707 s/op
WorkloadResult 40: 1 op, 88411811037.00 ns, 88.4118 s/op
WorkloadResult 41: 1 op, 90806273272.00 ns, 90.8063 s/op
WorkloadResult 42: 1 op, 82742509687.00 ns, 82.7425 s/op
WorkloadResult 43: 1 op, 74781588362.00 ns, 74.7816 s/op
WorkloadResult 44: 1 op, 71830353024.00 ns, 71.8304 s/op
WorkloadResult 45: 1 op, 92980971256.00 ns, 92.9810 s/op
WorkloadResult 46: 1 op, 81619856567.00 ns, 81.6199 s/op
WorkloadResult 47: 1 op, 73959015250.00 ns, 73.9590 s/op
WorkloadResult 48: 1 op, 77715044354.00 ns, 77.7150 s/op
WorkloadResult 49: 1 op, 78363413862.00 ns, 78.3634 s/op
WorkloadResult 50: 1 op, 91221867878.00 ns, 91.2219 s/op
WorkloadResult 51: 1 op, 81627798262.00 ns, 81.6278 s/op
WorkloadResult 52: 1 op, 79491699473.00 ns, 79.4917 s/op
WorkloadResult 53: 1 op, 81558240905.00 ns, 81.5582 s/op
WorkloadResult 54: 1 op, 83182206436.00 ns, 83.1822 s/op
WorkloadResult 55: 1 op, 84858053509.00 ns, 84.8581 s/op
WorkloadResult 56: 1 op, 73621861014.00 ns, 73.6219 s/op
WorkloadResult 57: 1 op, 85223562070.00 ns, 85.2236 s/op
WorkloadResult 58: 1 op, 84956313958.00 ns, 84.9563 s/op
WorkloadResult 59: 1 op, 71229230574.00 ns, 71.2292 s/op
WorkloadResult 60: 1 op, 76506494633.00 ns, 76.5065 s/op
WorkloadResult 61: 1 op, 77915554130.00 ns, 77.9156 s/op
WorkloadResult 62: 1 op, 88787871615.00 ns, 88.7879 s/op
WorkloadResult 63: 1 op, 79374401963.00 ns, 79.3744 s/op
WorkloadResult 64: 1 op, 73816006873.00 ns, 73.8160 s/op
WorkloadResult 65: 1 op, 85156861744.00 ns, 85.1569 s/op
WorkloadResult 66: 1 op, 77265495579.00 ns, 77.2655 s/op
WorkloadResult 67: 1 op, 77661107035.00 ns, 77.6611 s/op
WorkloadResult 68: 1 op, 87645060322.00 ns, 87.6451 s/op
WorkloadResult 69: 1 op, 72009081106.00 ns, 72.0091 s/op
WorkloadResult 70: 1 op, 80773937371.00 ns, 80.7739 s/op
WorkloadResult 71: 1 op, 84717968833.00 ns, 84.7180 s/op
WorkloadResult 72: 1 op, 68795626813.00 ns, 68.7956 s/op
WorkloadResult 73: 1 op, 79849592722.00 ns, 79.8496 s/op
WorkloadResult 74: 1 op, 80392011852.00 ns, 80.3920 s/op
WorkloadResult 75: 1 op, 72938952521.00 ns, 72.9390 s/op
WorkloadResult 76: 1 op, 82612507066.00 ns, 82.6125 s/op
WorkloadResult 77: 1 op, 84044721969.00 ns, 84.0447 s/op
WorkloadResult 78: 1 op, 79932930173.00 ns, 79.9329 s/op
WorkloadResult 79: 1 op, 82266084800.00 ns, 82.2661 s/op
WorkloadResult 80: 1 op, 77732191869.00 ns, 77.7322 s/op
WorkloadResult 81: 1 op, 79970313960.00 ns, 79.9703 s/op
WorkloadResult 82: 1 op, 87631082026.00 ns, 87.6311 s/op
WorkloadResult 83: 1 op, 73769568904.00 ns, 73.7696 s/op
WorkloadResult 84: 1 op, 76062322705.00 ns, 76.0623 s/op
WorkloadResult 85: 1 op, 78102747395.00 ns, 78.1027 s/op
WorkloadResult 86: 1 op, 71616543760.00 ns, 71.6165 s/op
WorkloadResult 87: 1 op, 81849260110.00 ns, 81.8493 s/op
WorkloadResult 88: 1 op, 82435768438.00 ns, 82.4358 s/op
WorkloadResult 89: 1 op, 79614189905.00 ns, 79.6142 s/op
WorkloadResult 90: 1 op, 78789572423.00 ns, 78.7896 s/op
WorkloadResult 91: 1 op, 76119056157.00 ns, 76.1191 s/op
WorkloadResult 92: 1 op, 82825273297.00 ns, 82.8253 s/op
WorkloadResult 93: 1 op, 80882977550.00 ns, 80.8830 s/op
WorkloadResult 94: 1 op, 82922992524.00 ns, 82.9230 s/op
WorkloadResult 95: 1 op, 75901275186.00 ns, 75.9013 s/op
WorkloadResult 96: 1 op, 76898547925.00 ns, 76.8985 s/op
WorkloadResult 97: 1 op, 85087254372.00 ns, 85.0873 s/op
GC: 9852 686 2 30868712552 1
Threading: 8409049 347 1
// AfterAll
// Benchmark Process 32642 has exited with code 0
Mean = 80.679 s, StdErr = 0.573 s (0.71%), N = 97, StdDev = 5.639 s
Min = 68.796 s, Q1 = 76.899 s, Median = 80.640 s, Q3 = 84.718 s, Max = 94.760 s
IQR = 7.819 s, LowerFence = 65.169 s, UpperFence = 96.447 s
ConfidenceInterval = [78.735 s; 82.622 s] (CI 99.9%), Margin = 1.944 s (2.41% of Mean)
Skewness = 0.12, Kurtosis = 2.54, MValue = 2
// ***** BenchmarkRunner: Finish *****
// * Export *
BenchmarkDotNet.Artifacts/results/csESN.ESNBenchmark-report.csv
BenchmarkDotNet.Artifacts/results/csESN.ESNBenchmark-report-github.md
BenchmarkDotNet.Artifacts/results/csESN.ESNBenchmark-report.html
// * Detailed results *
ESNBenchmark.Run: DefaultJob
Runtime = .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT; GC = Concurrent Workstation
Mean = 80.679 s, StdErr = 0.573 s (0.71%), N = 97, StdDev = 5.639 s
Min = 68.796 s, Q1 = 76.899 s, Median = 80.640 s, Q3 = 84.718 s, Max = 94.760 s
IQR = 7.819 s, LowerFence = 65.169 s, UpperFence = 96.447 s
ConfidenceInterval = [78.735 s; 82.622 s] (CI 99.9%), Margin = 1.944 s (2.41% of Mean)
Skewness = 0.12, Kurtosis = 2.54, MValue = 2
-------------------- Histogram --------------------
[67.185 s ; 70.983 s) | @@@
[70.983 s ; 74.205 s) | @@@@@@@@@@@@
[74.205 s ; 78.825 s) | @@@@@@@@@@@@@@@@@@@@@@@
[78.825 s ; 82.997 s) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[82.997 s ; 87.608 s) | @@@@@@@@@@@@@@@
[87.608 s ; 90.829 s) | @@@@@@@@@@@
[90.829 s ; 93.712 s) | @@
[93.712 s ; 96.371 s) | @
---------------------------------------------------
// * Summary *
BenchmarkDotNet=v0.12.1, OS=opensuse-tumbleweed 20200714
Intel Core i7-3630QM CPU 2.40GHz (Ivy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.401
[Host] : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
DefaultJob : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |--------:|--------:|--------:|-------------:|------------:|----------:|----------:|
| Run | 80.68 s | 1.944 s | 5.639 s | 9852000.0000 | 686000.0000 | 2000.0000 | 28.75 GB |
// * Hints *
Outliers
ESNBenchmark.Run: Default -> 3 outliers were removed (99.04 s..105.26 s)
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Gen 0 : GC Generation 0 collects per 1000 operations
Gen 1 : GC Generation 1 collects per 1000 operations
Gen 2 : GC Generation 2 collects per 1000 operations
Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
1 s : 1 Second (1 sec)
// * Diagnostic Output - MemoryDiagnoser *
// ***** BenchmarkRunner: End *****
// ** Remained 0 benchmark(s) to run **
Run time: 02:24:59 (8699 sec), executed benchmarks: 1
Global total time: 02:25:05 (8705.27 sec), executed benchmarks: 1
// * Artifacts cleanup *
27x doesn't seem very good at all.
I'll take a look at it and see if I can determine where we are using up time.
I am very happy to hear that you get getting the same results. You have a fairly complex application there.
The quick analysis is that you are calling np.dot many thousands of time on arrays with dimensions of 2000x1. The parallelization technique I deploy for np.dot operates on the second dimension which in the case of 1 provides no boost.
A quick unit test I wrote to compare python to c#, the python executes np.dot in 23ms and C# in 414ms. I believe this accounts for the vast majority of the speed difference.
If this is a real project and you need to make it as fast as possible, we can probably catch up a little bit by parallelizing some of the application layer. For example, in this code, a, b and c can be calculated in separate threads and then combined together when all are completed.
if (this.teacher_forcing)
{
var a = np.dot(this.W, state);
var b = np.dot(this.W_in, input_pattern);
var c = np.dot(this.W_feedb, output_pattern);
preactivation = a + b + c;
}
also this loop can maybe be done in parallel. I think the value of n keeps the data separate.
foreach (var n in Enumerable.Range(0, (int)n_samples))
{
states[n + 1, ":"] = this._update((ndarray)states[n, ":"],
(ndarray)inputs[n + 1, ":"], (ndarray)outputs[n, ":"]);
outputs[n + 1, ":"] =
this.out_activation(np.dot(this.W_out, np.concatenate((states[n + 1, ":"], inputs[n + 1, ":"]))));
}
https://docs.microsoft.com/en-us/dotnet/api/system.threading.tasks.parallel.foreach?view=netcore-3.1
If this library becomes a heavily used resource and performance becomes critical, it could be possible to break off certain parts of the computation engine into a separate library. For those applications that need pure .net they would load with that library. For applications that need the maximum performance, they could load an optimized C implementation.
By running this function in parallel as below, I was able to cut the processing time by almost 50%.
private ndarray _update(ndarray state, ndarray input_pattern, ndarray output_pattern)
{
// """performs one update step.
// i.e., computes the next network state by applying the recurrent weights
// to the last state & and feeding in the current input and output patterns
// """
ndarray preactivation;
if (this.teacher_forcing)
{
ndarray a = null;
ndarray b = null;
ndarray c = null;
if (Program.UseParallel == false)
{
a = np.dot(this.W, state);
b = np.dot(this.W_in, input_pattern);
c = np.dot(this.W_feedb, output_pattern);
}
else
{
var TaskA = Task.Run(() => a = np.dot(this.W, state));
var TaskB = Task.Run(() => b = np.dot(this.W_in, input_pattern));
var TaskC = Task.Run(() => c = np.dot(this.W_feedb, output_pattern));
Task.WaitAll(TaskA, TaskB, TaskC);
}
preactivation = a + b + c;
}
else
{
preactivation = (np.dot(this.W, state)
+ np.dot(this.W_in, input_pattern));
}
return (np.tanh(preactivation)
+ this.noise * (random_state_.rand(new shape(this.n_reservoir)) - 0.5));
}
You can get good performance even in pure CIL. The new RyuJIT generates excellent native code. Are elements of ndarray stored as object? Perhaps the boxing/unboxing from object that is hurting the performance. This explains all the heap allocations. Would generics help here? While using parallelization can increase overall performance, I think it goes against the numpy philosophy of provide the best performance out of the box. Many numpy users are not skilled programmers and will find it difficult to parallelize the code.
Which numpy version did you use for the port? Do you sync with current versions? Could you briefly explain the internals of the library? How it works? How are objects stored? What data structures are used? May this encourage more people to collaborate.
I think I ported off of numpy v 1.16 which might be one behind now. There is no performance difference in the latest version.
The real performance issues are:
1) the numpy C code uses pointers and pointer arithmetic everywhere. Internally, everything is allocated as a malloc block of memory and then cast to desired data type. Then they can have very tight routines that run the calculations. There is no range checking in C. The C# allocates the explicit type of object and then has to index into the array to get the value to execute. My code is also very tight
2) the numpy C code is a "strided" array system where a data structure is used to keep track off offsets into the next column/row of the dimensioned array. They have some elaborate iterator systems that walks the code through the array. These iterators are implemented as complex macros. C# does not have macros so I had to implement those as functions. These functions are marked for inlining but I am not sure the compiler is granting that because they are large functions.
You can get good performance even in pure CIL. The new RyuJIT generates excellent native code. Are elements of ndarray stored as object? Perhaps the boxing/unboxing from object that is hurting the performance. This explains all the heap allocations. Would generics help here? While using parallelization can increase overall performance, I think it goes against the numpy philosophy of provide the best performance out of the box. Many numpy users are not skilled programmers and will find it difficult to parallelize the code.
ndarray is a class object. It contains a reference to an allocated System.Array object. I also use a VoidPtr class object to manage what in the C based code are pointer operations. I went to great lengths to minimize the number of these that get allocated.
Can your tool identify where/what the heap allocations are? It is very possible that I missed a place that can be optimized. Do you need to build with the source code to get that information?
You are calling np.dot thousands of times in your calculations. If we allocate (rough guess) 10 total objects for each call to np.dot then you will have a massive number of allocations.
This is a test app that demonstrates where the difference in performance comes from. I have a C++ and C# apps, each derived from the real world implementation of np.dot. I think you will see that the C++ code is about 4 times faster than the C# code. If you can make the C# code faster, that would be amazing.
Hi Kevin,
I monitored the GC using dotnet-counters and I think the problem is allocations. For small arrays, the pressure of the GC is small, but when the arrays get bigger, GEN 2 kicks in, so things get pretty ugly.
I didn't dig into your code. Can you tell me how the slices work in your code? In numpy, all slices are just views, without copying.
I'll take a look at the test app.
Can your tool identify where/what the heap allocations are? It is very possible that I missed a place that can be optimized. Do you need to build with the source code to get that information?
Not really. You can use some diagnostic tools for dotnet. https://docs.microsoft.com/en-us/dotnet/core/diagnostics/
You are calling np.dot thousands of times in your calculations. If we allocate (rough guess) 10 total objects for each call to np.dot then you will have a massive number of allocations.
It depends on how you are allocating and slicing. I realized that you are not using Span. Have you take a look at this? Perhaps it can help to reduce allocations. https://adamsitnik.com/Array-Pool/ https://adamsitnik.com/Span/
Hi Kevin,
I monitored the GC using dotnet-counters and I think the problem is allocations. For small arrays, the pressure of the GC is small, but when the arrays get bigger, GEN 2 kicks in, so things get pretty ugly.
I didn't dig into your code. Can you tell me how the slices work in your code? In numpy, all slices are just views, without copying.
I'll take a look at the test app.
Yes, slices are always just views. The library builds a series of data structures that helps maintain access to the views. My code works exactly like the python/C code. I ported that code base.
Nice. I will try to inspect where allocations are taking place. When I instantiate a new ndarray in which class is the container instantiated?
\src\NumpyDotNet\NumpyDotNet\ndarray.cs \src\NumpyDotNet\NumpyLib\npy_arrayobject.cs \src\NumpyDotNet\NumpyLib\npy_api.cs (VoidPtr)
Using dotnet-counters
for csESN.
About 330 MB allocations per second. In Visual Studio, you can see allocations by object. Gen 0 collected 111x per second. I think the first thing to do before parallelizing is to optimize the memory.
(Ps. if I interpreted these numbers correctly)
[System.Runtime]
% Time in GC since last GC (%) 3
Allocation Rate / 1 sec (B) 346.075.760
CPU Usage (%) 35
Exception Count / 1 sec 0
GC Heap Size (MB) 14
Gen 0 GC Count / 1 sec 111
Gen 0 Size (B) 142.296
Gen 1 GC Count / 1 sec 0
Gen 1 Size (B) 309.032
Gen 2 GC Count / 1 sec 0
Gen 2 Size (B) 1.318.128
LOH Size (B) 32.323.480
Monitor Lock Contention Count / 1 sec 0
Number of Active Timers 0
Hi,
Great project.
Do you have plans to implement linalg?
I'm trying to port a python code do C# and I need a pseudo inverse matrix.
https://numerics.mathdotnet.com/ have linear algebra package but I don't figure out how to interop ndarray with Matrix.
How can I export ndarray to a vanilla C# 2d array?
Cheers