Wunkolo / blog-comments

This repo exists purely to sustain the 'utterances' comment-system issues for my blog posts!

0 stars 0 forks source link

gf2p8affineqb: int8 shifting – Wunk #2

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

gf2p8affineqb: int8 shifting – Wunk

https://wunkolo.github.io/post/2020/11/gf2p8affineqb-int8-shifting/

rosbif commented 3 years ago

Thank you for these very useful implementations. For _mm_srai_epi8 I think something like the following (in C) is simpler and more understandable: inline __m128i _mm_srai_epi8(const __m128i a, const unsigned int imm8) { const unsigned int shift = (imm8 <= 7) ? imm8 : 7; const uint64_t matrix = (UINT64_C(0x8182848890A0C000) << (shift * 8)) ^ UINT64_C(0x8080808080808080); return _mm_gf2p8affine_epi64_epi8(a, _mm_set1_epi64x(matrix), 0); } Untested in this form but adapted from working code. I also think the test for imm <= 7 is necessary because shifts >= word size are undefined (at least in C).

Wunkolo commented 3 years ago

Oh wow, yea the shift amount should certainly be explicitly std::min-ed for safety and compatibility. I originally wrote this code for a JIT-emitted assembly environment where the shift-amount is coming from a decoded ARM instruction, so I was emitting the instruction directly and doing the matrix calculation "offline" and hadn't considered C's own bit-shifting behavior when writing the non-JIT version. C defines overflow-shifting as undefined but x86 actually masks the lower bits of the shift count depending on the register operands (modulo 8/16/32/64). So shifting a 32-bit value by "35" would equate to a shift by 4. Thanks for the info! I'll update my post with some info about that.

rosbif commented 3 years ago

I added the shift count test to be consistent with existing SSE shift intrinsics (e.g. _mm_srai_epi16) which handle reasonably high shift counts as expected. However from memory only the low byte of the shift count is used so shift counts > 255 are taken modulo 256.

johnplatts commented 1 year ago

GFNI is not needed to implement int8 shift operations on x86 platforms that support MMX/SSE2/AVX as:

_mm_slli_pi8, _mm_slli_epi8, _mm256_slli_epi8, _mm_slli_pi16, _mm_slli_epi16, and _mm256_slli_epi16 can all be implemented using a 16-bit shift operation followed by a bitwise AND operation
_mm_srai_pi8, _mm_srai_epi8, _mm256_srai_epi8 can be implemented as follows (where imm8 is the shift amount):
- doing an unsigned 16-bit right shift operation
- masking out the upper bits of the unsigned 16-bit right shift operation by doing a bitwise AND of each element by (0xFF >> imm8)
- doing a bitwise XOR of each element of the bitwise AND result by (0x80 >> imm8)
- subtracting (0x80 >> imm8) from each element of the bitwise XOR result

Wunkolo commented 1 year ago

@johnplatts:

GFNI is not needed to implement int8 shift operations on x86 platforms that support MMX/SSE2/AVX as:

* `_mm_slli_pi8`, `_mm_slli_epi8`, `_mm256_slli_epi8`, `_mm_slli_pi16`, `_mm_slli_epi16`, and `_mm256_slli_epi16` can all be implemented using a 16-bit shift operation followed by a bitwise AND operation

* `_mm_srai_pi8`, `_mm_srai_epi8`, `_mm256_srai_epi8` can be implemented as follows (where `imm8` is the shift amount):

  * doing an unsigned 16-bit right shift operation
  * masking out the upper bits of the unsigned 16-bit right shift operation by doing a bitwise AND of each element by `(0xFF >> imm8)`
  * doing a bitwise XOR of each element of the bitwise AND result by `(0x80 >> imm8)`
  * subtracting `(0x80 >> imm8)` from each element of the bitwise XOR result

This is known! If you click the PR at the end of the write-up, this GFNI method is provided as a much faster single-instruction alternative to the 16-bit-shift+masking(psllw+pand+etc) implementation.

I provided a similar contribution to Ryujinx recently as well: https://github.com/Ryujinx/Ryujinx/pull/3669