Open renzibei opened 2 years ago
@renzibei, good observation. We've tried multiple strategies for implementing this optimization on Arm. Here are two that we've tried:
#include <arm_neon.h>
uint16_t Match128Neon(uint8_t hash, uint8_t* ptr) {
// Fill a vector with the hash byte
auto match = vdupq_n_u8(hash);
// Compare the ctrl vector with the hash vector. We get 0xFF in each lane
// that matches and 0 in each lane that doesn't.
auto comparison = vceqq_s8(ctrl, match);
// Reduction. First, compute a bitwise AND of each byte with a special
// bitmask that preserves a set bit in the position we want it in the reduced
// bitmask. The final bitmask will need to be 16 bits wide, but since we're
// working on byte-sized lanes, we can't place bits in positions 8-16; we'll
// have to shift those into the right position later.
constexpr uint8x16_t bitmask = { 1, 2, 4, 8, 16, 32, 64, 128,
1, 2, 4, 8, 16, 32, 64, 128 };
uint16x8_t masked = vandq_u8(comp, bitmask);
// Extract the high and low halves; do a horizontal add of each half.
// Shift the high mask by eight bits and OR the two masks together.
uint8x8_t low = vget_low_u8(masked);
uint8x8_t high = vget_high_u8(masked);
return vaddv_u8(low) | (vaddv_u8(high) << 8);
}
#include <arm_neon.h>
uint16_t Match64Neon(uint8_t hash, uint8_t* ptr) {
const uint8x8_t match = vdup_n_u8(hash);
const uint8x8_t ctrl = vld1_u8((uint8_t const*)ptr);
const uint8x8_t comparison = vceq_s8(ctrl, match);
constexpr uint8x8_t bitmask = {1, 2, 4, 8, 16, 32, 64, 128};
const uint8x8_t masked = vand_u8(comparison, bitmask);
return vaddv_u8(masked);
}
There were a few additional variants I can't locate. None improved performance. Ultimately, we found that the relatively high latency of moving data between vector and general-purpose registers made this general approach impractical. But it's of course possible we overlooked something!
Happily, the scalar implementation appears to work quite well.
The absl hash table on the x86 architecture uses SIMD (SSE2) instructions to help filter possible matching keys faster. However, I found that there is no corresponding SIMD implementation on the arm platform.
At first, I thought the absl community was not motivated to optimize it for the less-used arm platform. So, I tried to implement it myself using neon SIMD instructions on the arm architecture, but I quickly found a problem. When using the SSE instruction set,
_mm_movemask_epi8
can be implemented with only one instruction. Yet there is no direct counterpart in the neon instruction set, and every alternative I can find requires several more instructions, which introduces a much larger latency.Anyway, I tried to achieve the same method using SIMD instructions on the arm platform. But as expected, the speed is slightly slower than the portable C++ code.
So I would like to ask Googlers if anyone has ever tried to implement a SIMD adaptation of the Swiss Table for the Arm architecture, and if they encountered similar problems. And is the reason why there is still no SIMD-optimized version for Arm, as I thought because the arm platform lacks instructions that can efficiently implement the Swiss Table.