abseil / abseil-cpp

Abseil Common Libraries (C++)
https://abseil.io
Apache License 2.0
14.91k stars 2.61k forks source link

Discussion: the possibility to provide efficient SIMD implementation of Swiss Table on the Arm platform #1096

Open renzibei opened 2 years ago

renzibei commented 2 years ago

The absl hash table on the x86 architecture uses SIMD (SSE2) instructions to help filter possible matching keys faster. However, I found that there is no corresponding SIMD implementation on the arm platform.

At first, I thought the absl community was not motivated to optimize it for the less-used arm platform. So, I tried to implement it myself using neon SIMD instructions on the arm architecture, but I quickly found a problem. When using the SSE instruction set, _mm_movemask_epi8 can be implemented with only one instruction. Yet there is no direct counterpart in the neon instruction set, and every alternative I can find requires several more instructions, which introduces a much larger latency.

Anyway, I tried to achieve the same method using SIMD instructions on the arm platform. But as expected, the speed is slightly slower than the portable C++ code.

So I would like to ask Googlers if anyone has ever tried to implement a SIMD adaptation of the Swiss Table for the Arm architecture, and if they encountered similar problems. And is the reason why there is still no SIMD-optimized version for Arm, as I thought because the arm platform lacks instructions that can efficiently implement the Swiss Table.

atdt commented 2 years ago

@renzibei, good observation. We've tried multiple strategies for implementing this optimization on Arm. Here are two that we've tried:

128-bit groups:

#include <arm_neon.h>

uint16_t Match128Neon(uint8_t hash, uint8_t* ptr) {
  // Fill a vector with the hash byte
  auto match = vdupq_n_u8(hash);

  // Compare the ctrl vector with the hash vector. We get 0xFF in each lane
  // that matches and 0 in each lane that doesn't.
  auto comparison = vceqq_s8(ctrl, match);

  // Reduction. First, compute a bitwise AND of each byte with a special
  // bitmask that preserves a set bit in the position we want it in the reduced
  // bitmask. The final bitmask will need to be 16 bits wide, but since we're
  // working on byte-sized lanes, we can't place bits in positions 8-16; we'll
  // have to shift those into the right position later.
  constexpr uint8x16_t bitmask = { 1, 2, 4, 8, 16, 32, 64, 128,
                                   1, 2, 4, 8, 16, 32, 64, 128 };
  uint16x8_t masked = vandq_u8(comp, bitmask);

  // Extract the high and low halves; do a horizontal add of each half.
  // Shift the high mask by eight bits and OR the two masks together.
  uint8x8_t low = vget_low_u8(masked);
  uint8x8_t high = vget_high_u8(masked);
  return vaddv_u8(low) | (vaddv_u8(high) << 8);
}

64-bit groups:

#include <arm_neon.h>

uint16_t Match64Neon(uint8_t hash, uint8_t* ptr) {
  const uint8x8_t match = vdup_n_u8(hash);
  const uint8x8_t ctrl = vld1_u8((uint8_t const*)ptr);
  const uint8x8_t comparison = vceq_s8(ctrl, match);
  constexpr uint8x8_t bitmask = {1, 2, 4, 8, 16, 32, 64, 128};
  const uint8x8_t masked = vand_u8(comparison, bitmask);
  return vaddv_u8(masked);
}

There were a few additional variants I can't locate. None improved performance. Ultimately, we found that the relatively high latency of moving data between vector and general-purpose registers made this general approach impractical. But it's of course possible we overlooked something!

Happily, the scalar implementation appears to work quite well.