JuliaMath / Hadamard.jl

Fast Walsh-Hadamard transforms for the Julia language
MIT License
38 stars 8 forks source link

use FFHT code #5

Open stevengj opened 6 years ago

stevengj commented 6 years ago

The https://github.com/FALCONN-LIB/FFHT repository implements a specialized SIMD FHT, which it says is faster than FFTW for this problem. (Totally possible, since 2×2×2×⋯ multidimensional transforms are not that optimized in FFTW — they are using size-2 DFT kernels, whereas to be fast they should really use larger multidimensional kernels.) Might be interesting to support using FFHT (which is MIT-licensed) in cases where it is applicable.

ghost commented 6 years ago

I'll compare that with my SSE2 SIMD code, thanks for the link. There are a bunch of things you can do with the WHT that are not well known. Mainly stemming from fast random projection algorithms based on the WHT: https://github.com/S6Regen?tab=repositories It also seems the very simple out of place WHT is hardly known, where you go through the input vector pair-wise sequentially and put the sum of each pair sequentially in the lower half of a new array, and the difference sequentially in the upper half of the new array. Repeat for a total log base 2 n times. That is the simplest algorithm in computer science least known. http://randomprojectionai.blogspot.com/

ghost commented 6 years ago

When I ran the the FFHT bench-marking code on my cheap laptop I got 5200 65536-point WHT's per second per core. In my code I get 5000 per core.

My legacy SSE code: https://drive.google.com/open?id=1bhQi4BybXPkvCENIYQcdw9PHAwS9Ocam

I see all the links I have given are broken if you just click on them. It seems you have to copy and paste them to the browser address line for them to work!

Anyway to get GPU comparable results you would definitely need say one of the new AMD 32 core CPUs. From the FFHT result tables that would get you to around 1,000,000 65536-point WHT's per second. A big advantage of CPUs is they are so much easier to program than GPUs. I don't know what the situation is with the new GPU tensor cores. If you could use them for the WHT you might reach 10 million+ 65536-point WHTs per second presuming the memory bandwidth could handle that. That would be 1000 times better than I can do on my 2 core laptop (no harddrive, broken keyboard, CPU heatsink problems, runs off cheap usb stick etc.) I could evolve a neural network in 86.4 seconds that takes me a day at present.

Djoop commented 5 years ago

I did a quick benchmark for the 1D case, it's probably not perfect but suggests that FFHT is one order of magnitude faster, even a bit more in low dimension. Here are the results (Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz).

results

I create each time a matrix A=randn(d,1000), and compare @belapsed fwht_natural!($(A),dims=1) and running fht_double on each column of A (with a simple loop, one can probably parallelize it.). They also provide an out-of-place version, which I guess is better than first copying. For reference, there is also the Spiral version (http://www.spiral.net/software/wht.html), and this other SIMD C++ implementation : https://github.com/curto2/mckernel/ , but I didn't have time to compare. So I don't know what is the best option : using FFHT only, using FFHT+FFTW for unsuported cases (maybe multidimensional cases?), or even creating a different package for FFHT?

stevengj commented 5 years ago

I would continue to use FFTW (because it probably handles more cases), and add FFHT to use it for the cases it supports. Would need to first make a BinaryBuilder package of FFHT.