thild commented 3 years ago

Hi Kevin,

To test on Linux, I had to update to .Net Core 3.1. I had to disable some tests and others are breaking. If this update does not break your dev environment, you may want to consider applying it.

KevinBaselinesw commented 3 years ago

I apologize for the delay, I had a very busy week.

I spent time today analyzing the performance of your app and the call to np.dot. I see that we are allocating huge amounts of memory but I think it is all legitimate. Your code is in a loop calling np.dot on 200020008 byte buffers (3.2MB) Each call to np.dot allocates one or more buffers of that size to get it's work done. It adds up fast.

I also ran performance analysis software on the code to see where we are taking the most time. It turns out that it is my old friend NpyArray_ITER_NEXT plus the other code I shared with you earlier are taking most of the time. As a "strided" system, numpy creates data structures that map "views" into the allocated arrays. When operations are performed on these arrays, each element needs to run through the NpyArray_ITER_NEXT to make sure the correct offset into the buffer is calculated based on the mapped views.

In the C code, this is a MACRO which means the compiler inserts the code into the calling C code. This allows for optimal performance. As you may know, C# does not support MACROS so I had to port those MACROS to C# functions. This makes them way slower than C. I did mark these function for aggressive inlining but I am not sure the compiler is agreeing to do that because the functions are quite large.

I did find a way to parallel perform these iteration loops for some of the operations which allows me to be faster than the C code in some situations. If I am not able to use parallel processing, then this code path will be slower which is what we are seeing with your np.dot calls.

At some point, I/we/someone should look at calling into a C DLL to perform some of these heavy processing functions. This may be necessary in order to make this tool really competitive. If we can parallel process AND use C to perform the calculations, it may end up being way faster than the original python code.

Next up I will try to convert the sample apps and unit tests to .NET core so they can be run on Linux.

thild commented 3 years ago

Hi Kevin,

I apologize for the delay, I had a very busy week.

No problem.

I spent time today analyzing the performance of your app and the call to np.dot. I see that we are allocating huge amounts of memory but I think it is all legitimate. Your code is in a loop calling np.dot on 2000_2000_8 byte buffers (3.2MB) Each call to np.dot allocates one or more buffers of that size to get it's work done. It adds up fast.

The profiler is telling that GEN 0 is getting full and being collected more than 100 times / s. I think there is some overhead in heap allocations.

I also ran performance analysis software on the code to see where we are taking the most time. It turns out that it is my old friend NpyArray_ITER_NEXT plus the other code I shared with you earlier are taking most of the time. As a "strided" system, numpy creates data structures that map "views" into the allocated arrays. When operations are performed on these arrays, each element needs to run through the NpyArray_ITER_NEXT to make sure the correct offset into the buffer is calculated based on the mapped views.

In the C code, this is a MACRO which means the compiler inserts the code into the calling C code. This allows for optimal performance. As you may know, C# does not support MACROS so I had to port those MACROS to C# functions. This makes them way slower than C. I did mark these function for aggressive inlining but I am not sure the compiler is agreeing to do that because the functions are quite large.

I did find a way to parallel perform these iteration loops for some of the operations which allows me to be faster than the C code in some situations. If I am not able to use parallel processing, then this code path will be slower which is what we are seeing with your np.dot calls.

NpyArray_ITER_NEXT has many branches indeed, but I copy an older serial version of your MatrixProduct and got interesting results. The serial version is twice as fast.

#Parallel#

Running Mackey...
Loading...
    Elapsed: 91ms
Constructing ESN...
    Elapsed: 2588ms
Fit...
    Elapsed: 32129ms
Predict...
    Elapsed: 20501ms
Error...
    STRING 
    { test error: 
    0,13960390995923377 }
    Elapsed: 43ms

Total time: 55364ms

#Serial#

Running Mackey...
Loading...
    Elapsed: 96ms
Constructing ESN...
    Elapsed: 2584ms
Fit...
    Elapsed: 15478ms
Predict...
    Elapsed: 3931ms
Error...
    STRING 
    { test error: 
    0,13960390995923377 }
    Elapsed: 31ms

Total time: 22128ms

parallel_dot

As you can see in the previous image, the _update takes up 71% of the time and TaskReplication takes up 67% of the time. On the other hand, in the serial version, the _update takes up only 31% of the time. The bottleneck here is MathNet.Numerics.

serial_dot

At some point, I/we/someone should look at calling into a C DLL to perform some of these heavy processing functions. This may be necessary in order to make this tool really competitive. If we can parallel process AND use C to perform the calculations, it may end up being way faster than the original python code.

My curiosity was to see if it would be possible to beat Numpy's numbers with a pure C# implementation. I think JIT can get close enough to become competitive. For the past few days, I have been messing with your code trying to understand how it is architected. What are your plans for the library architecture? Will you stay close to Numpy architecture or move to a more object-oriented one? What were your criteria when you ported the Numpy code? Is all C code in the numpyinternal class? What is the role of the numpyAPI class?

Next up I will try to convert the sample apps and unit tests to .NET core so they can be run on Linux.

Only the WPF examples do not run on Linux and only a few tests that involve dynamic assembly emitting have had to be disabled because dotnet 3.1 does not support some APIs.

thild commented 3 years ago

Did you consider a version of ndarray using generics?

KevinBaselinesw commented 3 years ago

Can we communicate directly by email rather than through github? I think we can do a better job sharing information than this tool allows. kmckenna@baselinesw.com

KevinBaselinesw commented 3 years ago

I think it is possible that when I converted MatrixProduct to parallel processing, I slowed it down for cases where there is only one thread. I do add some overhead to manage keeping it parallel. It definitely is faster if there is more than one thread that can be employed. Your test case only uses one thread however.

I can probably change the code to check for multiple threads and branch to the old code if not. That would be a nice optimization.

From: Tony Alexander Hild notifications@github.com Sent: Tuesday, September 29, 2020 1:37 PM To: Quansight-Labs/numpy.net numpy.net@noreply.github.com Cc: KevinBaselinesw kmckenna@baselinesw.com; Comment comment@noreply.github.com Subject: Re: [Quansight-Labs/numpy.net] Core (#6)

Hi Kevin,

I apologize for the delay, I had a very busy week.

No problem.

I spent time today analyzing the performance of your app and the call to np.dot. I see that we are allocating huge amounts of memory but I think it is all legitimate. Your code is in a loop calling np.dot on 2000_2000_8 byte buffers (3.2MB) Each call to np.dot allocates one or more buffers of that size to get it's work done. It adds up fast.

The profiler is telling that GEN 0 is getting full and being collected more than 100 times / s. I think there is some overhead in heap allocations.

I also ran performance analysis software on the code to see where we are taking the most time. It turns out that it is my old friend NpyArray_ITER_NEXT plus the other code I shared with you earlier are taking most of the time. As a "strided" system, numpy creates data structures that map "views" into the allocated arrays. When operations are performed on these arrays, each element needs to run through the NpyArray_ITER_NEXT to make sure the correct offset into the buffer is calculated based on the mapped views.

In the C code, this is a MACRO which means the compiler inserts the code into the calling C code. This allows for optimal performance. As you may know, C# does not support MACROS so I had to port those MACROS to C# functions. This makes them way slower than C. I did mark these function for aggressive inlining but I am not sure the compiler is agreeing to do that because the functions are quite large.

I did find a way to parallel perform these iteration loops for some of the operations which allows me to be faster than the C code in some situations. If I am not able to use parallel processing, then this code path will be slower which is what we are seeing with your np.dot calls.

NpyArray_ITER_NEXT has many branches indeed, but I copy an older serial version of your MatrixProduct and got interesting results. The serial version is twice as fast.

Parallel

Running Mackey... Loading... Elapsed: 91ms Constructing ESN... Elapsed: 2588ms Fit... Elapsed: 32129ms Predict... Elapsed: 20501ms Error... STRING { test error: 0,13960390995923377 } Elapsed: 43ms

Total time: 55364ms

Serial

Running Mackey... Loading... Elapsed: 96ms Constructing ESN... Elapsed: 2584ms Fit... Elapsed: 15478ms Predict... Elapsed: 3931ms Error... STRING { test error: 0,13960390995923377 } Elapsed: 31ms

Total time: 22128ms

https://user-images.githubusercontent.com/338795/94589292-7bf70d80-025b-11eb-9e05-3bedab2cd815.jpeg

As you can see in the previous image, the _update takes up 71% of the time and TaskReplication takes up 67% of the time. On the other hand, in the serial version, the _update takes up only 31% of the time. The bottleneck here is MathNet.Numerics.

https://user-images.githubusercontent.com/338795/94589293-7d283a80-025b-11eb-90eb-72234585838c.jpeg

At some point, I/we/someone should look at calling into a C DLL to perform some of these heavy processing functions. This may be necessary in order to make this tool really competitive. If we can parallel process AND use C to perform the calculations, it may end up being way faster than the original python code.

My curiosity was to see if it would be possible to beat Numpy's numbers with a pure C # implementation. I think JIT can get close enough to become competitive. For the past few days, I have been messing with your code trying to understand how it is architected. What are your plans for the library architecture? Will you stay close to Numpy architecture or move to a more object-oriented one? What were your criteria when you ported the Numpy code? Is all C code in the numpyinternal class? What is the role of the numpyAPI class?

Next up I will try to convert the sample apps and unit tests to .NET core so they can be run on Linux.

Only the WPF examples do not run on Linux and only a few tests that involve dynamic assembly emitting have had to be disabled because dotnet 3.1 does not support some APIs.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Quansight-Labs/numpy.net/pull/6#issuecomment-700869862 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACP4GWR2YDVBC6Q563V2M7DSIILMVANCNFSM4RUCTYZQ . https://github.com/notifications/beacon/ACP4GWTSMWRP7YENAAD7T33SIILMVA5CNFSM4RUCTYZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOFHDGZZQ.gif

thild commented 3 years ago

The code was running with 8 threads on average.

Quansight-Labs / numpy.net

Core #6

Parallel

Serial