Open konsumlamm opened 1 year ago
I did some experiments on SIMD implementations for the other operations. However, I wasn't able to find any implementation that can be vectorized by OpenMP, so I resorted to using SIMD intrinsics (unfortunately, this is a lot harder to get right). For most operations, I wrote SSE and AVX versions. Here are my findings (the factors are speedups over the standard implementation):
bitIndex
: about the same up to 1024 bits, then up to 0.6x (SSE) and 0.4x (AVX)
nthBitIndex
should be similarreverseBits
: up to 0.25x (SSE) and 0.15x (AVX)selectBits
: up to 0.36x (when using -mbmi2
) and 0.06 (AVX)Are you fine with using intrinsics? If so, should there be separate flags for SSE and AVX implementations (both are x86-specific, but quite common)?
I'm fine with intrinsics, but not with additional flags.
x86_64
CPU.__get_cpuid_count
flags. Here are some examples:
@konsumlamm I'd like to make a release soon so that consumers could benefit from your work here. Shall I go ahead as is or do you have plans to work on excludeBits
/ selectBits
soon?
I have an implementation of selectBits
/excludeBits
lying around, that I could make a PR for, but only for the immutable versions. I don't have much time currently, so I can't work on the other things rn.
I'll gladly take immutable versions only.
See also #64.
bitIndex
(#81)nthBitIndex
(#81)selectBits
(#82)excludeBits
(#82)reverseBits
(#71)