XXH32/XXH64 modernization

easyaspi314 commented 2 years ago

Idea: XXH32 and XXH64 could be enhanced like so:

Use an array for the single shot code as well
Use small, fixed iteration loops instead of manual unrolling
Extract common code into smaller sub-functions
Fix some extra cruft in the state like mem32/mem64 not being unsigned char

Pros:

Follows the "trust the compiler by default" design of XXH3.
- Modern GCC/Clang can pick up on the patterns and appear to emit code that is equivalent speed
More readable (and we can rename variables to be more clear)
Better for size optimization
If we keep the function signatures the same it shouldn't break ABI

Cons:

Might be slower on dumb compilers

Cyan4973 commented 2 years ago

We can probably make multiple small steps progressively in this direction. Performance matters, since it's an essential property attached to xxhash. I'm sure there are several improvements or code simplifications that wouldn't impact performance, or barely, but if a "dumb" compiler, say MSVC /O2, see important speed regressions, it matters too. So, as usual, it's a matter of balance.

Also, this effort might be partially linked to #550 .

easyaspi314 commented 2 years ago

but if a "dumb" compiler, say MSVC /O2, see important speed regressions...

(╯°□°）╯︵ ┻━┻

C:\code\xxhash> xxhsum.exe -b1
xxhsum.exe 0.8.1 by Yann Collet
compiled as 32-bit i386 + SSE2 little endian with MSVC 19.29.30137.00
Sample of 100 KB...
 1#XXH32      :     102400 ->    48731  it/s ( 4758.9 MB/s)
C:\code\xxhash> xxhsum-outline-reroll.exe -b1
xxhsum-outline-reroll.exe 0.8.1 by Yann Collet
compiled as 32-bit i386 + SSE2 little endian with MSVC 19.29.30137.00
Sample of 100 KB...
 1#XXH32      :     102400 ->    26091  it/s ( 2548.0 MB/s)

Why is msvc x86 allergic to unrolling fixed iteration loops?

Edit: Outlining and extracting without rerolling seems to be fine though...

easyaspi314 commented 2 years ago

This is what I was thinking. It uses some of the naming styles from XXH3.

```c /*! * @internal * @brief Seeds the accumulator lanes for @ref XXH32(). * * @param acc The 4 accumulator lanes from XXH32's internal state * @param seed The initial seed for the hash. */ XXH_FORCE_INLINE void XXH32_resetAccs(xxh_u32 acc[4], xxh_u32 const seed) { XXH_ASSERT(acc != NULL); acc[0] = seed + XXH_PRIME32_1 + XXH_PRIME32_2; acc[1] = seed + XXH_PRIME32_2; acc[2] = seed + 0; acc[3] = seed - XXH_PRIME32_1; } /*! * @internal * @brief The core bulk processing loop for @ref XXH32(). * * @param input, len Directly passed from @ref XXH32(). @p len must be >= 16. * @param acc The 4 accumulator lanes from XXH32's internal state * @param align Whether @p input is aligned. * @return `&input[len - len % 16]` */ XXH_FORCE_INLINE xxh_u8 const* XXH32_hashLong(xxh_u8 const* input, size_t len, xxh_u32 acc[4], XXH_alignment align) { size_t nbBlocks = len / 16; XXH_ASSERT(nbBlocks != 0 && input != NULL && lanes != NULL); do { /* Note: MSVC x86 refuses to unroll this automatically. */ acc[0] = XXH32_round(acc[0], XXH_get32bits(input + 0)); acc[1] = XXH32_round(acc[1], XXH_get32bits(input + 4)); acc[2] = XXH32_round(acc[2], XXH_get32bits(input + 8)); acc[3] = XXH32_round(acc[3], XXH_get32bits(input + 12)); input += 16; } while (--nbBlocks); return input; } /*! * @internal * @brief Merges the accumulator lanes to a single value for @ref XXH32() * * @param acc The 4 accumulator lanes from XXH32's internal state * @return The merged value */ XXH_FORCE_INLINE xxh_u32 XXH32_mergeAccs(xxh_u32 const acc[4]) { XXH_ASSERT(acc != NULL); return XXH_rotl32(acc[0], 1) + XXH_rotl32(acc[1], 7) + XXH_rotl32(acc[2], 12) + XXH_rotl32(acc[3], 18); } /*! * @internal * @brief The implementation for @ref XXH32(). * * @param input , len , seed Directly passed from @ref XXH32(). * @param align Whether @p input is aligned. * @return The calculated hash. */ XXH_FORCE_INLINE xxh_u32 XXH32_endian_align(xxh_u8 const* input, size_t len, xxh_u32 seed, XXH_alignment align) { xxh_u32 h32; if (input == NULL) XXH_ASSERT(len == 0); if (len >= 16) { xxh_u32 acc[4]; XXH32_resetAccs(acc, seed); input = XXH32_hashLong(input, len, acc, align); h32 = XXH32_mergeAccs(acc); } else { h32 = seed + XXH_PRIME32_5; } h32 += (xxh_u32)len; return XXH32_finalize(h32, input, len % 16, align); } ```

Cyan4973 commented 2 years ago

It looks good to me

easyaspi314 commented 2 years ago

I think for XXH64, we should just use a nested loop for the bulk loop, as long as MSVC x64 unrolls it (but MSVC x64 is more liberal in unrolling anyways)

64-bit arithmetic is going to be hot garbage on MSVC x86 anyways thanks to _allmul calls, and GCC and Clang know how to unroll it.

Side note: Extracting XXH64's internals in the same way somehow gave a slight boost to ARMv7-a with Clang 13 (1.5GB/s -> 1.7GB/s), even though it was inlined and unrolled just like before. 🤔

easyaspi314 commented 2 years ago

Draft at easyaspi314:modern_xxh32_xxh64. I will make a PR once I do some benchmarking.

I also changed the mem32/mem64 fields to unsigned char arrays which shouldn't break binary ABI.

easyaspi314 commented 2 years ago

Should we remove XXH_OLD_NAMES as well?

Cyan4973 commented 2 years ago

Should we remove XXH_OLD_NAMES as well?

Let's plan that for v0.9.0

easyaspi314 commented 2 years ago

On a side note, I was toying with a mixed NEON/scalar XXH64.

On my Pixel 4a, clang and GCC get the same 2804 MB/s normally, but with half NEON and half scalar, Clang gets 3156 MB/s and GCC gets 2925 MB/s.

Since I already have the code I might as well make ARMv7-A do full NEON, and that actually gets 2704 MB/s on Clang compared to ~1GB/s normally.

However, the implementation is pretty ugly:

hybrid xxh64 neon

```c #if defined(__ARM_NEON) || defined(__ARM_NEON__) || defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC) # define XXH_HAS_NEON # if defined(__GNUC__) || defined(__clang__) # include # else # include # endif XXH_FORCE_INLINE uint64x2_t XXH_neon_mul64(uint32x2x2_t x, uint32x2_t y) { uint64x2_t cross; /* grade school truncating multiply */ cross = vmull_lane_u32(x.val[0], y, 1); cross = vmlal_lane_u32(cross, x.val[1], y, 0); cross = vshlq_n_u64(cross, 32); return vmlal_lane_u32(cross, x.val[0], y, 0); } #endif #if defined(__aarch64__) || defined(__arm64__) || defined(_M_ARM64) || defined(_M_ARM64EC) /* aarch64 does half NEON and half scalar */ # define XXH64_SCALAR_ROUNDS 2 # define XXH64_NEON_ROUNDS 1 #elif defined(XXH_HAS_NEON) /* armv7-a uses full NEON */ # define XXH64_SCALAR_ROUNDS 0 # define XXH64_NEON_ROUNDS 2 #else /* Everything else uses full scalar */ # define XXH64_SCALAR_ROUNDS 4 #endif /*! * @internal * @brief The core bulk processing loop for @ref XXH64(). * * @param input, len Directly passed from @ref XXH64(). @p len must be >= 16. * @param acc The 4 accumulator lanes from XXH64's internal state * @param align Whether @p input is aligned. * @return `&input[len - len % 32]` */ static xxh_u8 const* XXH64_hashLong(xxh_u8 const* input, size_t len, xxh_u64 acc[4], XXH_alignment align) { size_t nbBlocks = len / 32; XXH_ASSERT(nbBlocks != 0 && input != NULL && acc != NULL); { size_t i; #ifdef XXH_HAS_NEON uint64x2_t accNeon[XXH64_NEON_ROUNDS]; uint32x2_t const prime2 = vreinterpret_u64_u32(vdup_n_u64(XXH_PRIME64_2)); uint32x2_t const prime1 = vreinterpret_u64_u32(vdup_n_u64(XXH_PRIME64_1)); /* Load NEON lanes */ for (i = 0; i < XXH64_NEON_ROUNDS; i++) { accNeon[i] = vld1q_u64(&acc[XXH64_SCALAR_ROUNDS + 2 * i]); } #endif do { for (i = 0; i < XXH64_SCALAR_ROUNDS; i++) { acc[i] = XXH64_round(acc[i], XXH_get64bits(input)); input += 8; } #ifdef XXH_HAS_NEON for (i = 0; i < XXH64_NEON_ROUNDS; i++) { /* interleaved load, putting input in place for mul64 */ uint32x2x2_t pair = vld2_u32((uint32_t const *)input); /* input * PRIME64_2 */ uint64x2_t tmp = XXH_neon_mul64(pair, prime2); uint64x2_t xacc = accNeon[i]; /* acc += input */ xacc = vaddq_u64(xacc, tmp); /* rotl(xacc, 31) >> 32 without dependency */ pair.val[1] = vshrn_n_u64(xacc, 64 - 31 - 32); /* rotl(xacc, 31) */ tmp = vshlq_n_u64(xacc, 31); xacc = vsriq_n_u64(tmp, xacc, 64 - 31); /* xacc & 0xFFFFFFFF */ pair.val[0] = vmovn_u64(xacc); /* xacc *= PRIME64_1 */ accNeon[i] = XXH_neon_mul64(pair, prime1); input += 16; } #endif } while (--nbBlocks); #ifdef XXH_HAS_NEON /* Store NEON lanes back */ for (i = 0; i < XXH64_NEON_ROUNDS; i++) { vst1q_u64(&acc[XXH64_SCALAR_ROUNDS + 2 * i], accNeon[i]); } #endif } return input; } ```

Side side note: Would a mixed SIMD/scalar benefit XXH3 as well? The integer pipeline is basically unused during hashLong, and we might benefit from doing a few lanes scalar.

Edit: Holy shit, it does (at least on aarch64). Doing a 6:2 split on the NEON path on clang makes it jump from 8.8 GB/s to 10.2 GB/s.

Cyan4973 commented 2 years ago

For XXH64, I would rather preserve code simplicity, the very minor performance difference seems not worth it,

For XXH3 on the other hand, since we already manage multiple specialized code paths, a ~+15% performance increase is definitely large enough to justify updating the aarch64 implementation. A complex bonus question though is, will it be beneficial (with various degrees) on all arch64, or beneficial for some, detrimental for others ? Difficult to tell.

easyaspi314 commented 2 years ago

It only seems to affect AArch64, but XXH3 runs incredibly with a 6:2 ratio in #632, even (mostly) fixing the lackluster performance from GCC (30% faster, but still slower than clang lol).

XXH64 definitely isn't worth it especially if it still can't beat XXH32.

Cyan4973 commented 2 years ago

Is this topic still active ? Should we keep this issue opened ? Referring to the XXH32/XXH64 modernization effort in the title, not later topics appearing in the thread.

Cyan4973 / xxHash

XXH32/XXH64 modernization #607