JetBrains-Research / viktor

Efficient f64-only ndarray in Kotlin
MIT License
125 stars 6 forks source link

Support (I)FFT calculations #42

Open Jolanrensen opened 3 years ago

Jolanrensen commented 3 years ago

For my specific data engineering application, having a fast Fast Fourier Transform algorithm would be great to have. However, there are many applications that could benefit from FFT, like audio analysis.

Most math/vector libraries have some form of (inverse) FFT built in, like Numpy for Python, Apache Commons for Java (which is slow) or JTransforms (which is quite fast and multi threaded but doesn't use SIMD).

I think FFT could benefit a lot from the SIMD capabilities of Viktor.

This is confirmed by some papers of which this is just the first example I could find.

Intel MKL also has a form of FFT built in which uses SIMD (and maybe other Intel stuff) as well.

I don't know if boost SIMD already has FFT built in, but I wouldn't be surprised if it does.

dievsky commented 3 years ago

Thank you for your interest! I will look into it. I didn't find any FFT methods in boost.SIMD, unfortunately. Do you have some specific FFT variety in mind? AFAIR, there are some special cases for different array sizes, like for powers of two, for other composite numbers, for arbitrary numbers etc.

daphil19 commented 3 years ago

I'm coming at this fairly naively, but would wrapping fftw with JNI calls may be an effective way to implement this? The fallback for unsupported architectures could be JTransforms.

dievsky commented 3 years ago

It might be effective, but it would also require either re-licensing viktor or purchasing a dedicated fftw license, since fftw is apparently GPL.

Jolanrensen commented 3 years ago

Thank you for your interest! I will look into it. I didn't find any FFT methods in boost.SIMD, unfortunately. Do you have some specific FFT variety in mind? AFAIR, there are some special cases for different array sizes, like for powers of two, for other composite numbers, for arbitrary numbers etc.

Powers of 2 are fine. I understand they are more efficient that way (and Apache Commons also had this as a hard limitation). Other than that, here is what I specifically am using right now:

A forward Fast Fourier Transform using a real double array as input which outputs a complex array (in the form of [real0, complex0, real1, complex1...]).

I perform a multiplication of two of these "complex" arrays.

I perform an inverse Fast Fourier Transform with scaling on this complex array and from that result I take just the reals.

It might be possible to still wrap another library that uses SIMD. I'm currently trying to get Intel MKL to work for my project, but I can't get the linking to work in Gradle, however there are many libraries: https://github.com/project-gemmi/benchmarking-fft

dievsky commented 3 years ago

A couple insights:

We could also probably implement a SIMD-friendly decimation-in-frequency reduction ourselves, delegating the actual FFT for small arrays to Apache Commons Math (we have it as a dependency anyway).

Jolanrensen commented 3 years ago

A couple insights:

  • MKL seems to be the most versatile, but it's also big, Intel-oriented, and closed-source;
  • PocketFFT is lightweight, but only does SIMD for multidimensional arrays.

We could also probably implement a SIMD-friendly decimation-in-frequency reduction ourselves, delegating the actual FFT for small arrays to Apache Commons Math (we have it as a dependency anyway).

I'd skip Apache Commons actually. It converts its results to Complex class instances and when I rewrote it to use F64FlatArrays, compared to JTransforms, it was just very slow.

I did find another list of libraries: https://community.vcvrack.com/t/complete-list-of-c-c-fft-libraries/9153

dievsky commented 3 years ago

OK, thanks for the info! It's surprisingly hard to find a permissive-licensed library doing 1-D SIMD on doubles on this list, so the option of combining the existing SIMD capabilities with a Java FFT implementation is still viable.

dievsky commented 3 years ago

After some research, it looks doubtful that we'll be able to perform the decimation using existing SIMD capabilities, The problem is that we require complex multiplication for this, and all the libraries I've seen expect the complex data in interleaved format (RIRI), while SIMD would only work well with split format (RRII).

Jolanrensen commented 3 years ago

After some research, it looks doubtful that we'll be able to perform the decimation using existing SIMD capabilities, The problem is that we require complex multiplication for this, and all the libraries I've seen expect the complex data in interleaved format (RIRI), while SIMD would only work well with split format (RRII).

However many of the libraries, like Pffft already do use SIMD, so that should be plug-and-play right? It's only "pretty" fast though, but it does use SIMD, so I guess that would make it at least "quite" fast, haha.

dievsky commented 3 years ago

Yes, the previous comment was related to the option where we would decimate in Kotlin using viktor's existing SIMD capabilities and delegate to JTransforms once the array size became small enough.

dievsky commented 3 years ago

pffft only works with floats, unfortunately. PocketFFT only uses SIMD for 2+ dimensional arrays. FFTW is under GPL.

Jolanrensen commented 3 years ago

I see, you're right. Maybe we can convert pffft to doubles? Not sure how this would affect the program (but I don't consider myself an active low-level programmer anyways, so I might be wrong).

Jolanrensen commented 3 years ago

Oh here we go, with double support: https://github.com/marton78/pffft

dievsky commented 3 years ago

Hmm, might work. I'll have to look into it in more details later.

dievsky commented 3 years ago

Played a bit with the library. It seems that it doesn't support unaligned arrays, which is a problem, since JVM's double arrays are not aligned to 32 bytes (the array object normally is, but its header takes 16 bytes, so the data are only aligned to 16 bytes, and there's no ironclad guarantee even for that). And by "doesn't support" I mean it crashes hard with a segfault. I'll try to ask the author about it.

dievsky commented 3 years ago

https://github.com/marton78/pffft/issues/22 created an issue.

dievsky commented 3 years ago

The maintainers closed the issue as "too much work to do".

dievsky commented 3 years ago

OK, I think I've actually managed to fix the unaligned access problems.

Preliminary (really, really preliminary) benchmarks show very promising results. Like, our wrapper outperforms JTransforms by an order of magnitude. This is in all likeness an unfair comparison, but still.

@Jolanrensen by the way, how do you multiply the complex arrays? And what is the typical size of the array?

Jolanrensen commented 3 years ago

That's very promising! Typical sizes I'm working with at the moment are in magnitude of hundreds (currently at like 250 or something), but this might differ in the future, I'm not sure yet. The calculations do happen often.

To multiply the results from JTransform (which are ordered [real, complex, real complex]), I wrote the following to multiply a in place with b:

for (k in 0 until a.size / 2) {
    val aRe = a[2 * k]
    val aIm = a[2 * k + 1]
    val bRe = b[2 * k]
    val bIm = b[2 * k + 1]

    a[2 * k] = aRe * bRe - aIm * bIm
    a[2 * k + 1] = aRe * bIm + aIm * bRe
}

(this might be able to be done more efficiently, not sure)

dievsky commented 3 years ago

Hmm, 250 is a really small array, I'd say. Not sure we'll see much benefit from SIMD here. Doing a more fair comparison, I start to see the difference at about 1024 complex elements (2048 doubles). The reason I'm asking about complex multiplication is because this looks easily SIMDizable as well.

daphil19 commented 3 years ago

I have a use case that includes complex mutilation of up to 4M (2^22) complex values (so 8M doubles), so a speedup from SIMD would be welcomed. With that said, our FFT code is the critical path, so any speedup there would trump optimizing the multiplication.

dievsky commented 3 years ago

Mostly a side note for future me: I have done quite a few benchmarks, and they generally show that SIMD performance gain is most pronounced at mid-range array size (10K, 100K). I have no ironclad explanation for this, but my guess is that at the lower sizes (100, 1000) JNI overhead becomes is too expensive, while at the higher sizes (1M, 10M) the long JNI call somehow interferes with normal JVM work. This leads me to consider splitting large arrays into palatable chunks and SIMDizing each chunk with a separate call. It would be really easy to implement for element-wise operations and most reductions, but with FFT, we'd need to employ a decimation technique which, in turn, relies on complex multiplication. So we need efficient complex multiplication anyway.

dievsky commented 3 years ago

So, my solution seems to be working on Linux, both for me locally and on TeamCity. The other OSes, well... The macOS build is acting weird. It fails to assemble with unaligned access workaround enabled (citing an undefined __m256d_u), but it also doesn't produce the expected SIGSEGV when assembled without the unaligned access workaround. The Windows build currently fails with multiple C syntax errors when compiling pffft. Investigation ongoing.

dievsky commented 3 years ago

I must confess I'm a bit stuck with Windows build. I can't even link against the original version (marton78/pffft) due to "unresolved external symbol" errors. If anyone participating in this issue has some experience with Visual C tool chain, I'd appreciate help.