For bit shifting each 64-bit word of a 256 bit vector a (and of 512 bit vectors emulated via two 256 ones), all shifted by the same amount b, the current code uses
_mm256_srl_epi64(a, _mm_set_epi32(0,0,0,b));
It should be slightly faster to directly use
_mm256_srli_epi64(a, b);
as this avoids the "set" and the srli variant is more efficient than srl.
For bit shifting each 64-bit word of a 256 bit vector
a
(and of 512 bit vectors emulated via two 256 ones), all shifted by the same amountb
, the current code uses_mm256_srl_epi64(a, _mm_set_epi32(0,0,0,b));
It should be slightly faster to directly use_mm256_srli_epi64(a, b);
as this avoids the "set" and the srli variant is more efficient than srl.