Optimize vector library logic functions on bis

bradunov commented 9 years ago

Currently, most of the vector functions on bits (v_and, v_xor) are not as efficient as they could be. They are efficient on 128+ bits, but everything under 128 bits is done in a loop on 8 bit chunks. Clearly, it could be improved doing on 64, 32 or 16 bit chunks, depending on the input size. v_or is written in such a way, but it does not use 128-bit SSE instruction any more.

What needs to be done is to write all this functions in the most optimal way, that is start applying the widest of all available instructions, from 128 bits down. We also need to write tests for all these cases (which is easy, against a non-vectorized version).

Also, check whether an array is aligned before using SSE.

dimitriv commented 9 years ago

Yep agreed! I explicitly changed v_or to accommodate the needs of permutation but I was well aware that it does not work efficiently for > 128 bits. A related complication that we need to investigate is whether sometimes external functions expect aligned arguments or not (if they pass them directly to SIMD instructions). If they do then we have to do some work in the compiler to ensure this because although we declare all arrays as aligned, arbitrary array slices that we pass on to those functions may not be ...

bradunov commented 9 years ago

To start with, we can check in the external function whether an array is aligned and if not skip over to 64 bits and below.

dimitriv / Ziria

Optimize vector library logic functions on bis #95