Open vlovich opened 1 year ago
There are two factors: the C code is older than SMHasher's copy (the hash function hasn't changed, only some throughput optimisations), and build.rs compiles it with conservative optimisation flags. There is no FFI overhead; any additional inlining in smasher due to LTO is marginal.
If you have a good idea to optionally build with more specific flags and instruct the compiler to favour speed over code size, that should help take advantage of the new code. Until then, just updating the C code (while a good idea), won't be that beneficial.
Do you know if there's an idiom (with features, I assume?) to optionally compile C code for specific microarchitectures, or with more emphasis on execution speed in tight loops?
Re inlining not mattering: unfortunately while that's the case in real life for string hashes, inlining can have huge (artificial) benefits in microbenchmarks. The "latency" figure for 4-byte xxh3 are pretty suspicious: <10 cycles when the strong_avalanche finaliser chains 3 multiplications serially, on top of loading the hashed string from memory. I think your benchmark might have been optimised away, or lets the compiler and/or the CPU overlap independent hash calls, which usually isn't representative of the workload one cares about.
You can introduce a dependency between iterations by computing the hashed slice's start as a function of the previous hash value; same for the seed argument. Blackboxing the length would probably be more realistic.
The latency for short umash also doesn't make sense to me… I'd suspect legacy unVEXed SIMD from the generic compiler flags, but the 4-8 byte code path doesn't even use SIMD. Are you per chance rebuilding the parameter struct every iteration?
Do you know if there's an idiom (with features, I assume?) to optionally compile C code for specific microarchitectures, or with more emphasis on execution speed in tight loops?
Maybe something like function multi-versioning / function target cloning?
I believe GCC and clang both support it (i.e. attributes in the code, no extra build flags I think).
Are you per chance rebuilding the parameter struct every iteration?
No. The benchmark body looks like:
#[cfg(feature = "bench_umash")]
lazy_static! {
static ref UMASH_PARAMS: umash::Params = umash::Params::new();
}
... snip ....
#[cfg(feature = "bench_umash")]
("umash64", |input| {
black_box(UMASH_PARAMS.hash(input));
}),
#[cfg(feature = "bench_umash")]
("umash128", |input| {
black_box(UMASH_PARAMS.fingerprint(input));
}),
The surrounding framework code is too complex to paste, but what it's doing is creating a random 256 MiB array & then picking out subsets of memory slices from it to make sure I'm evaluating the algorithm's performance on uncached memory as that tends to be the more typical use case (at least for my problem domain).
<10 cycles when the strong_avalanche finaliser chains 3 multiplications serially, on top of loading the hashed string from memory. I think your benchmark might have been optimised away, or lets the compiler and/or the CPU overlap independent hash calls, which usually isn't representative of the workload one cares about.
I can't speak to the CPU overlapping. It's possible, although I expect that since there's 1 hash call in the body of a criterion bench_iter
, there's enough in between hash calls to prevent that? Regardless of the behaviour on 4 bytes, the 100 bytes still shows a 10x discrepancy and 64kib still shows > 2x. I don't think the testing methodology can fully explain away that 64kib number whatever else you think about the other values.
My memory bandwidth on this system is ~100-110 GB/s.
Does umash do any kind of explicit vectorization? I think xxhash_rust has an explicit avx2 path which may explain the discrepancy.
Do you know if there's an idiom (with features, I assume?) to optionally compile C code for specific microarchitectures, or with more emphasis on execution speed in tight loops?
Maybe something like function multi-versioning / function target cloning?
The latest version of umash.c can internally dispatch on cpuid for long hashes. It's not obvious that the dispatching makes sense for smaller hashes, and AVX-SSE transitions caused by unVEXed instructions can still be an issue there.
The surrounding framework code is too complex to paste, but what it's doing is creating a random 256 MiB array & then picking out subsets of memory slices from it to make sure I'm evaluating the algorithm's performance on uncached memory as that tends to be the more typical use case (at least for my problem domain).
Ok, so each 4-byte hash call is expected to hit uncached memory (or L3, at best), right? < 2ns latency doesn't make sense for an L3 hit or main memory hit; that's probably a bit faster than a L2 hit on your machine. Even umash's 20 ns is too fast for a main memory hit, but could be about right for a L3 hit. I think you're not actually benchmarking what you're looking for.
I can't speak to the CPU overlapping. It's possible, although I expect that since there's 1 hash call in the body of a criterion
bench_iter
, there's enough in between hash calls to prevent that?
I don't believe that's enough no, because criterion will run multiple iterations (https://bheisler.github.io/criterion.rs/book/user_guide/timing_loops.html#iter) before taking a timestamp. Unless you really are interested in aggregate throughput of short hash calls, you must introduce data dependencies between consecutive iterations (for example, https://github.com/backtrace-labs/umash/blob/32d3cc7ba1f3e6701008d97a512b57e1b60bb676/bench/runner.c#L103-L105)
The black_box
function is also mostly useful to hide constants from the compiler; criterion's stable fallback definitely works for that use case. While https://bheisler.github.io/criterion.rs/book/user_guide/known_limitations.html says real_blackbox
can more reliably foil dead code elimination, I'm not convinced https://dev-doc.rust-lang.org/beta/src/core/hint.rs.html#112-124 does what criterion wants. The inline asm tries to force LLVM to flush store the value to the stack... but it's not marked volatile, so I'm pretty sure LLVM could just drop it all as dead code. The inline asm also doesn't have a memory clobber (unless that's implicit?), so LLVM can assume that, while the asm can read the black boxed value, it won't overwrite anything in memory, including the value on the stack.
I understand ad hoc static overloads are still hard in rust, so a "+m"(dummy)
constraint and a volatile annotation is probably what the rust core team is looking for.
Regardless of the behaviour on 4 bytes, the 100 bytes still shows a 10x discrepancy and 64kib still shows > 2x. I don't think the testing methodology can fully explain away that 64kib number whatever else you think about the other values.
40 ns for a L3 miss sounds pretty good to me. A priori, the xxh3 loop (2.5 ns for a main memory access) sounds like measurement error. For the 64K numbers, I expect the difference is mostly because the vendored C code in this repo is older than smhasher's.
Does umash do any kind of explicit vectorization? I think xxhash_rust has an explicit avx2 path which may explain the discrepancy.
Yes, it's all explicitly vectorised for inputs > 8 bytes... but this repo deliberately avoids 256-bit AVX, because of downclocking and voltage transition hiccups on older microarchitectures. 3 years later, that's somewhat less of an issue, but I think the specialised throughput-optimised subroutines should still be opt in.
If you want, you can copy umash.c, umash.h, and umash_long.inc from https://github.com/backtrace-labs/umash in a fork of this repo, and compile with O3, -march=native
, -mtune=native
. That'll give you a quick idea of how much there is to gain from the 256-bit version with VPCLMUL. https://github.com/flier/rust-fasthash/tree/master might also work for you; I see it has a solution to the target microarchitecture problem. I'll look into copying the approach if someone can confirm that's how Rust does it.
Despite what smhasher indicates, the Rust version isn't able to get anywhere near that level of performance. I benchmarked with rust 1.72 on Linux 6.5 and a 13900 with varying key sizes:
It takes quite large data for umash to get to peak performance (didn't test data beyond 64kb) and even then the performance isn't up to snuff as the native Rust xxhash-rust crate:
Performance for xxh3-64/128 seems to converge between 100 and 500 bytes.
According to smhasher, umash should only be ~1.4x slower than xxh3 for small keys and on par for ~250kib keys which makes me think the discrepancy here comes from ffi or having a stale version of the C code.