Functionality:
Add new shl / shr to u32x8 and u32x4 that shift by the corresponding number right hand SIMD lane. This is implemented efficiently in AVX2 and Neon. Useful for dividing by constants via algorithms like libdivide. Added example implementation of branch free divide in t_usefulness.
Bug fixes:
Better testing exposed bug u32x8::max and u32x8::min on AVX2 which were calling the signed versions instead of unsigned.
Testing improvements:
Rather than manually calculate the correct scalar scalar output to verify SIMD operations, add a trait similar to the portable Simd library that implements the basic to/from so that the test code can run as a generic instead of copy/pasted.
I can add this to the other types if you think this is a useful enhancement, but I didn't want to do too much before your getting feedback.
Functionality: Add new shl / shr to u32x8 and u32x4 that shift by the corresponding number right hand SIMD lane. This is implemented efficiently in AVX2 and Neon. Useful for dividing by constants via algorithms like libdivide. Added example implementation of branch free divide in t_usefulness.
Bug fixes: Better testing exposed bug u32x8::max and u32x8::min on AVX2 which were calling the signed versions instead of unsigned.
Testing improvements: Rather than manually calculate the correct scalar scalar output to verify SIMD operations, add a trait similar to the portable Simd library that implements the basic to/from so that the test code can run as a generic instead of copy/pasted.
I can add this to the other types if you think this is a useful enhancement, but I didn't want to do too much before your getting feedback.