OpenPIV / openpiv-python

OpenPIV is an open source Particle Image Velocimetry analysis software written in Python and Cython
http://www.openpiv.net
GNU General Public License v3.0
247 stars 142 forks source link

Processing with numba #246

Open hcwinsemius opened 2 years ago

hcwinsemius commented 2 years ago

Is your feature request related to a problem? Please describe.

The fact PIV works over independent interrogation windows, that can be processed in parallel seems to make OpenPIV very suitable for multi-core processing. I propose to implement numba as orchestrator with JIT compilation so that a user is flexible in the platforms used. This may make OpenPIV much more scalable to larger problems, and make these run faster on platforms with high cpu/mem availability, and working (albeit slower) without memory problems on smaller systems.

Describe the solution you'd like Make the functions that perform analysis over interrogation windows numba-compatible. We should then analyse at what level the parallelisation should be organized to really speed up the code. Numba works on numpy/scipy code basis so this should be quite feasible to do.

Describe alternatives you've considered An alternative is to process several pairs of images in parallel (e.g. through dask arrays) but this will not allow solving problems with one image pair requiring lots of memory.

Additional context The use case for this request is scalability of #pyopenrivercam (see https://localdevices.github.io/pyorc). We noticed that users with relatively large areas of interest run into memory problems. Furthermore we would like to increase ability to parallelize on clusters.

alexlib commented 2 years ago

Hi @hcwinsemius - I believe it's a great idea. We'd only like to ensure that people could install openpiv-python also without numba, i.e. to make the import and the algorithm as an option, installable also on some strange computing units or without support of compilers, etc.

@ErichZimmer - could you please be in touch with @hcwinsemius and show him the branch of numba you've been working on?

thanks

alexlib commented 2 years ago

@hcwinsemius if clusters is an option, take a look please on https://github.com/OpenPIV/openpiv-python-gpu

ErichZimmer commented 2 years ago

Hi all, One issue I ran into when using numba is the lack of support for two-dimensional FFTs. One way around the FFT issue is to use numba.objmode, but that resulted in a slower cross correlation algorithm. Other than that, numba provided a decent performance boost while on one thread, especially towards the validation algorithms.

hcwinsemius commented 2 years ago

@ErichZimmer really good point. I started some research on this today and indeed it may be that the cross-correlation estimation may be the bottleneck. Did you also try making a convolution with 2 for loops and see if that creates additional performance? Would love to hear what your experience so far is. Also I was curious if you know which part of the chain is most expensive (my guess was cross correlation but haven't profiled yet).

@alexlib your point on dependencies is also very good. Numba is not easy on edge systems like raspberry-pi. If numba seems promising, I could also write wrappers around the existing functions and put these in a separate library like openpiv-numba, which then can be installed on top of openpiv-python. That would entirely separate core functionality from any numba compilation. How do you feel about this?

ErichZimmer commented 2 years ago

@hcwinsemius I performed a quick performance test on a normalized spatial correlation algorithm based on literature on direct PIV correlation. For window sizes of 32x32 pixels, the direct method is as fast as the FFT method with a search radius of up to 5 pixels (allowing for velocity up to +/- 4 pixels). If the search radius increases beyond 5 pixels, there is a very steep performance hit.

ErichZimmer commented 2 years ago

@hcwinsemius I'll create a branch for my numba version tomorrow (hopefully) so you can get a good idea of what I did and how to keep compatibility with or without numba.

hcwinsemius commented 2 years ago

@ErichZimmer you mean performance increases steeply when the search radius increases? That would be excellent news for river observations where the search radius usually is about the size of the interrogation window. Eager to give this a try.

alexlib commented 2 years ago

Please take a look on what fluiddyn project is doing (they are on bitbucket)

alexlib commented 2 years ago

hi, please take a look at the work done for the project https://github.com/fluiddyn/fluidimage - they claim some CPU/GPU support using Pythran or other libraries.

timdewhirst commented 2 years ago

If I could chip in at this point: this is perhaps an area where the openpiv-python and openpiv-c++ projects could merge? The original observation is correct: the calculation of displacement by evaluation of independent interrogation areas is "embarrasingly parallel" and well suited to multicore processing, and by extension to GPUs.

We have support for multicore processing in openpiv-c++, and running on a CPU (M1PRO - 8 cores of which 6 are performance) we get something like ~10us/interrogation area - it's slower on a i7, probably due to non-unified memory - but still quite fast.

The key, really, to high performance code is to ensure data is transferred/transformed as little as possible, and that operations are done in a predictable and non-random fashion - this lets the branch predictors and caches to give maximal benefit.

We have some work ongoing at the moment to write python bindings for openpiv-c++ which would allow python to be used for high-level structure and c++ for high performance.

Thoughts?

alexlib commented 2 years ago

this is a very good discussion. I'm already dreaming of openpiv2 :) that merged all your ideas and effort into a solid package.

ErichZimmer commented 2 years ago

Pythran (one of the tools fluidimage uses) allows for 1D FFTs from numpy (64 bit floating arithmetics, so it may still be slower than our current implementation using scipy). Since FFTs are separable, we can do FFTs for each dimension to replicate a 2D FFT. On using OpenPIV with a c++ backend, I am slightly worried on making enough wheels for distribution. One thing I love about OpenPIV-Python is that all platform dependent conflicts are done by third party packages. This means there is no compiling of code and therefore, a good deal of cross-platform compatibility.

alexlib commented 2 years ago

Pythran (one of the tools fluidimage uses) allows for 1D FFTs from numpy (64 bit floating arithmetics, so it may still be slower than our current implementation using scipy). Since FFTs are separable, we can do FFTs for each dimension to replicate a 2D FFT. On using OpenPIV with a c++ backend, I am slightly worried on making enough wheels for distribution. One thing I love about OpenPIV-Python is that all platform dependent conflicts are done by third party packages. This means there is no compiling of code and therefore, a good deal of cross-platform compatibility.

I have a reasonable experience with cibuildwheel system - it generates automatically all the major platform binary wheels and uploads them to pypi or elsewhere. If we manage to build it on Github actions, we might be fine with the desktop users. We won't then be able to create something for drones, raspberry pi, etc. so we still need to have a system that prepares a version (a fallback) to pure Python with installable libraries, plus an optional use of C++ high speed computation if the binary wheel is available.

ErichZimmer commented 2 years ago

@hcwinsemius FWI, for the direct correlation method, I meant for a very steep performance loss when the search radius is greater than 5 pixels compared to FFTs.

On my numba accelerated version, I am not able to upload it at this time due to the beginnings of a natural disaster (I won't have access to my laptop for approximately 2 weeks). However, I'll try to explain what I did. For each numba accelerated module, there is a large try-catch statement, one for numba functions if numba is present and one for a "pure" python implementation. The only accelerated modules were validation and process (except correlation algorithms), where the functions were decomposed into for-loops so numba can take advantage of simd intrinsics.

On using a c++ backend, I looked at a few packages using scikit-build and most had somewhat complex packaging systems. One thing we have to realize when using c++ as a backend (using Openpiv-c--qt) is that there are 5 dynamic libraries that have to be created for each platform (jpeg62, liblzma, openpivcore, tiff, and zlib1). That is where I am confused on when attempting a cross-platform compilation (unless we were to precompile openpiv-c--qt for each platform and somehow link the libraries to the python frontend).

timdewhirst commented 2 years ago

On C++ (or rather, non pure python) modules: dependencies are always a possibility, but this can be worked around - any moderately complex python module or package loading images will probably somewhere have the same set of dependencies - just take a look at the guts of a conda install!

However, one solution is statically link - dynamically linked libraries don't have to be used (unless there is a licence requirement) - and e.g. the default build on OS X does just this (it's probably the same on Linux but I don't have a box at hand to check!):

% otool -L libopenpivcore.dylib 
libopenpivcore.dylib:
    @rpath/libopenpivcore.dylib (compatibility version 0.0.0, current version 0.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1311.100.3)
    /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1300.23.0)

i.e. the only dynamic libraries are system libraries.

The same could be done for the module as a whole i.e. statically link against libopenpivcore which in turn is statically linked thereby removing any (including transitive) dependencies. This is one of the reasons that I wanted to make openpiv-c++ as dependency light as possible.

ErichZimmer commented 2 years ago

Implementing a normalized direct correlation algorithm based on Lewis, J.P.. (2001) Fast Normalized Cross-Correlation, the direct correlation algorithm is as fast or faster than the FFT method until the search radius reaches 6 pixels (returns a 12x12px correlation matrix). I think this is a reasonable size for a correlation matrix. I may as well put together a numba/pythran version of OpenPIV as it seems pretty easy to do from experience.

ErichZimmer commented 2 years ago

@hcwinsemius should we use pythran as it supports 1D r2c and c2r FFTs?

hcwinsemius commented 2 years ago

I am currently experimenting a bit with Jax which would be both CPU and GPU compliant and easy to install. Some of the processing steps were quite simple and speed up to a factor of 5 to 10. I haven't optimized the cross correlation yet but am also considering to (for larger search windows) replace it by an optimization search algorithm.

It is really only experimenting so that we can have a proper discussion on whether this would be an interesting alternative for pure scipy numpy or not.

I don't know pythran. Will have a look.

ErichZimmer commented 2 years ago

I performed some more tests with direct non-normalized correlation (fastest) with numba and pythran and found it to be quite sluggish compared to FFTs unless the window size is less than 32x32 px and search radius is less than 5px. Additionally, using FFTW single precision in a c++ test script completely blows everything away due to reuse of plans and using the multi-threading mechanism of openpiv-c--qt (in fact, it was quite a bit faster than PIVview 3.9 when not performing repeated correlations). So, I would be focusing more on c/c++ as it seems reasonably easy to program while keeping python dependencies and the subsequent docker image size small. However, if we want a pythran/numba framework like FluidImage, then I would not mind allocating off-time as PIV and its associated programs is part of my hobby.

timdewhirst commented 2 years ago

Very interesting! On the topic of performance, GPUs aren't really a magic bullet that will automatically make everything fast, rather they are great if you have an "embarrassingly parallel" problem to solve and you can structure your code so that everything can be done in parallel on the GPU: the reason for this is that, unless you're using an M1 chip or a new Grace Hopper chip, there is a non-negligible cost to transferring data from host memory to the GPU for processing, and this will be especially bad if you're transferring lots of very small things instead of one large thing.

I've not looked at the numba or pythran code, but I would guess that the rough architecture to follow should be:

Probably an area for more research. I'd be interested in seeing the results for FFTW - I've long toyed with the idea of incorporating it within openpiv-c-qt to see what the performance is like. I would imagine it might be about 2x as fast as my very basic recursive implementation as it extensively uses SIMD as well as the tricks I've used i.e. plans, thread local data, etc.

ErichZimmer commented 2 years ago

@hcwinsemius any progress with your testing with numba?

hcwinsemius commented 2 years ago

Hi Erich,

I don't have any bandwidth at the moment. I'd probably do some testing on jax first, as I haven't really tried that before and it is very near to numpy in its API. But time is not my friend at the moment.

On Fri, 25 Nov 2022 at 23:25, Erich Zimmer @.***> wrote:

@hcwinsemius https://github.com/hcwinsemius any progress with your testing with numba?

— Reply to this email directly, view it on GitHub https://github.com/OpenPIV/openpiv-python/issues/246#issuecomment-1327914811, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2NZMN4CMEMQ75YD5NK7LLWKE4GBANCNFSM54CWIEQA . You are receiving this because you were mentioned.Message ID: @.***>

hcwinsemius commented 1 month ago

Hi all,

I noticed this issue is still open. I finally have some update on using numba (instead of numpy or pythran). I have a working version of a preliminary library that resolves cross correlation with FFT using the FFT implementation made by @styfenschaer in the rocket-fft package, and using the OpenPIV implementation of most of the functionalities. I of course acknowledge this project for that in the README.md.

The FFT functions used are not yet part of the core numba package, which makes this a little risky, but if that happens at some point, and is maintained properly, this can lead to a 10-fold increase in speed, provided that a large enough problem is sent to the cross correlation solver at once. I therefore implemented options to resolve cross correlation for many image pairs at once. Typically that is done with videos of several seconds long.

If you want to stay updated, I can post a link once I have a first release.

alexlib commented 1 month ago

Yes please, @hcwinsemius keep us posted. Numba is a promising direction because it allows the simplicity of Python code and clarity for the new users. OpenPIV in Python wasn't planned to be super fast, that's why we have C++ packages. Python version is for the new features, new ideas, rapid coding.

ErichZimmer commented 1 month ago

@hcwinsemius This is great news. I struggled to implement a numba-friendly FFT library or find one that presented promising results and flexibility. This was the only real technical difficulty I had when playing around with numba for JIT. Other than that, transcribing OpenPIV-Python to a numba extension with a fallback to the standard OpenPIV library is trivial and may only take up an afternoon to code once the proper engineering is complete. If needed, we can collaborate sometime late January/Early February and get things rolling since I have some previous experience using numba (e.g., prototype Tomo-PIV and marker detection implementations).

ErichZimmer commented 1 month ago

Also, chucking the 3D correlation stack (if stacked interrogations windows are used) allows for an embarrassingly parallel setup and allowing for further performance increases, something I think you touched on in your previous comment. I actually implemented a serial version of a chunked correlation algorithm so I can process 4512x4512px PIV images using little memory and with some decent performance increases. I just forgot about that feature and will make a pull request for it in a future date. Anyhow, numba provides several means for parallelizing workflows, and this seems to be a promising avenue to investigate.

hcwinsemius commented 1 week ago

I just published a first version of FF-PIV, a small and lean code base for FFT based PIV, using the principles of OpenPIV. It can flexibly push many image pairs to the core function at once, making it quite fast with JIT code.

I have implemented it with both numpy and numba so that you can easily cross compare the performance. I got speed ups of a factor 5 with relatively small problems, and larger with high resolution and many windows and using lots of memory in one go.

It would be nice to think if this also could be integrated back into OpenPIV at some point. Please have a look, and let me know if you have comments or ideas. The examples in the readme give a reasonably quick idea how everything works.

https://github.com/localdevices/ffpiv

alexlib commented 1 week ago

I just published a first version of FF-PIV, a small and lean code base for FFT based PIV, using the principles of OpenPIV. It can flexibly push many image pairs to the core function at once, making it quite fast with JIT code.

I have implemented it with both numpy and numba so that you can easily cross compare the performance. I got speed ups of a factor 5 with relatively small problems, and larger with high resolution and many windows and using lots of memory in one go.

It would be nice to think if this also could be integrated back into OpenPIV at some point. Please have a look, and let me know if you have comments or ideas. The examples in the readme give a reasonably quick idea how everything works.

https://github.com/localdevices/ffpiv

Great stuff.

We started working on the OpenPIV 2.0

https://github.com/OpenPIV/openpiv_2.0_governance

Maybe you could join us on that one? Ali has already implemented another numba. We want to choose the one which is more straightforward and more accessible. We can also be used under the hood of a user-friendly piv_run that a) uses one of the existing libraries, b) is more straightforward to compare between different ones (new will show up soon with NN and ML and optical flow types), and c) improves the overall experience.