Add non-AVX2 specific BDGL bucketing

joerowell commented 2 years ago

This PR adds a non-AVX2 specific BDGL bucketer to G6K.

It turns out that GCC ships with some (minimal) extensions for vector programming. This means that it's possible to write vectorised code in C++ without referring directly to any platform-specific intrinsics.

This PR adds two new files:

Simd.h. This file contains a new namespace Simd and a bunch of declarations for vector instructions. These are essentially just emulations of what the Intel intrinsics offer, plus some extra helper functions. Where applicable (i.e if HAVE_AVX2 is set) this wrapper uses the AVX2 instructions as before: if not we let GCC generate vectorised code for us.
Simd.inl. This file contains all of the implementations for Simd.h.

The PR also modifies fht_lsh.h and fht_lsh.cpp: the only changes are to do all calls through the simd wrapper.

There does not appear to be much of a change speed-wise: the intrinsics appear to be roughly the same speed (with some exceptions: m128_broadcastsi128_si256 for example is much slower when using the GCC extensions vs native AVX2 code). However, it's possible there's a machine somewhere where the drop-off is more noticeable than on my laptop.

Note that you can find this with full unit tests etc at https://github.com/joerowell/gcc-bucketer: I used GoogleTest to write these, but that doesn't seem to play well with AutoTools and for now I haven't included them.

malb commented 2 years ago

Looks good to me, but maybe @cr-marcstevens and @lducas can take a look too?

lducas commented 2 years ago

Neat trick, happy to learn about the Simd lib. Experiments looks good to me, no regression noted. I tried adding an option to control the flag in rebuild.sh.

fplll / g6k

Add non-AVX2 specific BDGL bucketing #101