Add optional fftw for correlation

ErichZimmer commented 2 years ago

@timdewhirst Should we make an option to use FFTW if the wrapper locates it in vcpkg when compiling?

timdewhirst commented 2 years ago

It's something I've thought about - definitely a possibilty if faster (which it should be given it supports SIMD)

ErichZimmer commented 2 years ago

@timdewhirst what about pocketFFT cpp branch? It supports sse and avx 256/512 and is at times just as fast as fftw3. It uses attribute(vector_size(N)) so it is quite platform independent. However, it requires a GCC or clang compiler. I personally used TDM-GCC since it works automatically with vcpkg.

timdewhirst commented 2 years ago

sure, can take a look! It looks like it also supports NEON which is useful for Apple silicon people: added https://github.com/OpenPIV/openpiv-c--qt/issues/34

ErichZimmer commented 2 years ago

If I remember right, the non-vectorized version of pocketFFT cpp branch was ~35% faster than the current implementation of the FFT cross correlation algorithm in openpiv/algos. However I may be off by some margins as I did the performance test ~10 days ago and don't have access to my laptop until Sunday.

timdewhirst commented 2 years ago

I've added in pocketfft support in openpiv-c--qt along with results:

Test run on Intel i5-2520M dual core with hyper threading/ubuntu 22.04.1 LTS:

correlation	average time (s)	improvement
complex	3.254	0.0%
real	2.958	9.1%
pocket	2.375	27.1%
pocket_real	2.305	29.1%

Test run on MacBook Pro M1 PRO, OS X 13.0.1

correlation	average time (s)	improvement
complex	0.537	0.0%
real	0.455	15.2%
pocket	0.305	43.2%
pocket_real	0.281	47.7%

ErichZimmer commented 2 years ago

Nice! I played around with your implementation earlier, and found it to be a little slower then my sloppy implementation. Perhaps I'll profile it and see what is going on.

timdewhirst commented 2 years ago

Please do, and raise a PR if you find anything interesting!

ErichZimmer commented 2 years ago

I did my testing on a single thread. As soon as I utilized all threads, your version was over 2x faster since I was not using thread local storage due to my lack of experience with thread_local keyword. I'll do a follow up if anything of note occurs.

ErichZimmer commented 2 years ago

On correlation algorithms, how does the correlate_real function work? How is both real images combined into one complex image and then processed? I'm very interested.

timdewhirst commented 2 years ago

There are various resources covering this, but I found http://www.robinscheibler.org/2013/02/13/real-fft.html to be quite clear. In short, due to the symmetry properties of an FFT of real only data, it's possible to combine two real images and perform one forward FFT which can then be decomposed.

Given the small size of the images typically used for interrogation areas, I suspect that using two real to complex forward FFTs from pocketFFT may be faster - I used the above method as my initial FFT implementation was harder to change to support real to complex.

ErichZimmer commented 2 years ago

Okay. My version used real to complex and complex to real FFTs in single precision floats for additional speed. I recompiled my basic cross correlation implementation in double precision floats and my implementation manages to somehow be 26% faster on a single core than the PocketFFT class on openpiv/algos. I suppose it is because I am directly reading the extracted interrogation windows with the FFTs and theoretically avoided casting (e.g., core::cf --> core::gf). I'll push a branch for you to see.

ErichZimmer commented 2 years ago

@timdewhirst here is my first implementation of pocketFFT. https://github.com/ErichZimmer/OpenPIV-Python-cxx/blob/PocketFFT-test/src/process/src/openpiv_correlation.cpp

ErichZimmer commented 2 years ago

Yikes! seems like the wrong one. I'll go looking for the correct folder again soon.

ErichZimmer commented 2 years ago

It looks like Iost the folder when I was transferring computers (I don't know why I did not back it up on GitHub). However, it was very simple functions following what OpenPIV-Python does.

ErichZimmer commented 2 years ago

Okay, I found it and pushed my basic implementation to the pocketfft-test branch.

timdewhirst commented 2 years ago

I've taken a look and don't see much difference except the sizing of the complex image in the r2c and c2r images is sized per the pocket_fft notes. I've pushed another change which uses the r2c and c2r transforms, but not much difference in performance is visible (at least on M1/OS X). Having said that, time per interrogation area is now down to under 5us on the M1.

What's your testing methodology?

ErichZimmer commented 2 years ago

I am using the python wrapper so there is the issue right there. When performing the benchmarks on python side, I do 25 runs on PIV challenge 2003 case A image pairs with 32x32 px interrogation windows and 75% overlap. Then, the average and the deviation is calculated giving N seconds +/- M seconds.

ErichZimmer commented 2 years ago

I do not know what is causing the change in performance, but the my main branch and my PocketFFT implementation branch has a large difference on single core performance. Here is a quick benchmark on the wrapper performing 25 runs using python %%timeit.

Correlation	Average Time (s)
My impl.	3.89
Your impl.	5.72

There must be unnecessary copying going on somewhere when I use your implementation, but I could not find where.

ErichZimmer commented 2 years ago

I found the issue through an accident on pass by references, so your implementation + my wrapper now runs faster than my implementation and wrapper. :D

ErichZimmer commented 1 year ago

@timdewhirst I may have ran into a bug with the real pocketfft correlator. Since the real to complex transform requires a real input and a complex output halved in one axis, I feel like there is a strides issue with the current implementation using r2c transforms. This yields some unexpected results and was why my branch in this repository with the updated correlators did not pass CI testing. Perhaps, we should go back to a c2c transform with decomposition like your recursive implementation when using pocketfft?

ErichZimmer commented 1 year ago

I found out it was an image issue that caused correlation artifacts as OpenPIV-Python had the same results. Now all I need to do is debug the Linux CI build to see why the correlation algorithm is failing.

timdewhirst commented 1 year ago

do you have an example that shows the issue?

ErichZimmer commented 1 year ago

I did some experimenting to see what caused the artifacts. I found that the cause of the issue is not due to our implementations, but how the cross correlation algorithm works in the presence of ill-conditioned images. For example, this is what I saw that caused my assumption that there could have been a strides issue (which was incorrect as synthetic tests show that our implementations perform the same as OpenPIV-Python). correlation artifact

The data contained in the interrogation window is low and blends in with noise; thus artifacts form as a result.

ErichZimmer commented 1 year ago

It is also interesting to note that MSVC 2022 compiles pocketfft with simd, giving an on average performance increase of 120% (from 22 μs/window for complex correlator to 10 μs/window for pocket_real correlator with 32x32 interrogation windows).

timdewhirst commented 1 year ago

It is also interesting to note that MSVC 2022 compiles pocketfft with simd, giving an on average performance increase of 120% (from 22 μs/window for complex correlator to 10 μs/window for pocket_real correlator with 32x32 interrogation windows).

Nice! That's a good improvement and nearly as fast as the M1PRO :)

timdewhirst commented 1 year ago

I did some experimenting to see what caused the artifacts. I found that the cause of the issue is not due to our implementations, but how the cross correlation algorithm works in the presence of ill-conditioned images. For example, this is what I saw that caused my assumption that there could have been a strides issue (which was incorrect as synthetic tests show that our implementations perform the same as OpenPIV-Python).

The data contained in the interrogation window is low and blends in with noise; thus artifacts form as a result.

Ah, that's lovely - it's an issue I've not looked at in 20+ years - the issue is that the image is captured by a camera with dual output channels, one for odd rows, one for even, and they have slightly different characteristics. The Kodak ES1.0 used to suffer from this quite badly IIRC. There are two ways to solve it:

perform a low pass on the image data in real space
mask out the strong 2px period component in Fourier space

The second was my preferred option as it's computationally much more efficient! IIRC filtering in Fourier space when applied to PIV is not well researched in general...

ErichZimmer commented 1 year ago

After further observations on pocketfft, I do not know where the performance is coming from as when defining POCKETFFT_NO_VECTORS, the performance remains the same. Something weird is going on, but it is the good kind of weird.

ErichZimmer commented 1 year ago

When using MinGW with -mavx2, the time per interrogation window is now on average 8.72 μs/window for the pocket_real correlator with 32x32 interrogation windows. This means it has the same speed as fftw3 in addition to ~33% faster than PIVview 3.9 (without SIMD since I am using the demo version) with no padding or windowing (tested using 32x32 interrogation windows with a step of 2 pixels)

ErichZimmer / OpenPIV-Python-cxx

Add optional fftw for correlation #4