backtrace-labs / umash-sys

Rust FFI bindings for umash
MIT License
0 stars 1 forks source link

Not competitive with xxh and other algorithms #2

Open vlovich opened 1 year ago

vlovich commented 1 year ago

Despite what smhasher indicates, the Rust version isn't able to get anywhere near that level of performance. I benchmarked with rust 1.72 on Linux 6.5 and a 13900 with varying key sizes:

umash64/4               time:   [18.884 ns 18.960 ns 19.038 ns]
                        thrpt:  [200.38 MiB/s 201.20 MiB/s 202.01 MiB/s]
umash64/100             time:   [41.217 ns 41.407 ns 41.603 ns]
                        thrpt:  [2.2386 GiB/s 2.2492 GiB/s 2.2596 GiB/s]
umash64/64000           time:   [1.5422 µs 1.5494 µs 1.5571 µs]
                        thrpt:  [38.280 GiB/s 38.470 GiB/s 38.649 GiB/s]
umash128/4              time:   [20.927 ns 21.042 ns 21.162 ns]
                        thrpt:  [180.26 MiB/s 181.29 MiB/s 182.29 MiB/s]
umash128/100            time:   [45.772 ns 45.915 ns 46.086 ns]
                        thrpt:  [2.0208 GiB/s 2.0284 GiB/s 2.0347 GiB/s]
umash128/64000          time:   [2.6298 µs 2.6457 µs 2.6623 µs]
                        thrpt:  [22.389 GiB/s 22.529 GiB/s 22.665 GiB/s]

It takes quite large data for umash to get to peak performance (didn't test data beyond 64kb) and even then the performance isn't up to snuff as the native Rust xxhash-rust crate:

xxh3-64/4               time:   [1.3894 ns 1.3919 ns 1.3945 ns]
                        thrpt:  [2.6713 GiB/s 2.6765 GiB/s 2.6813 GiB/s]
xxh3-64/100             time:   [3.2518 ns 3.2560 ns 3.2612 ns]
                        thrpt:  [28.558 GiB/s 28.603 GiB/s 28.640 GiB/s]
xxh3-64/64000           time:   [1.0922 µs 1.1007 µs 1.1094 µs]
                        thrpt:  [53.727 GiB/s 54.150 GiB/s 54.573 GiB/s]
xxh3-128/4              time:   [1.7857 ns 1.7994 ns 1.8146 ns]
                        thrpt:  [2.0530 GiB/s 2.0703 GiB/s 2.0862 GiB/s]
xxh3-128/100            time:   [4.8149 ns 4.8217 ns 4.8304 ns]
                        thrpt:  [19.280 GiB/s 19.315 GiB/s 19.343 GiB/s]
xxh3-128/64000          time:   [1.1099 µs 1.1181 µs 1.1265 µs]
                        thrpt:  [52.912 GiB/s 53.309 GiB/s 53.703 GiB/s]

Performance for xxh3-64/128 seems to converge between 100 and 500 bytes.

According to smhasher, umash should only be ~1.4x slower than xxh3 for small keys and on par for ~250kib keys which makes me think the discrepancy here comes from ffi or having a stale version of the C code.

pkhuong commented 1 year ago

There are two factors: the C code is older than SMHasher's copy (the hash function hasn't changed, only some throughput optimisations), and build.rs compiles it with conservative optimisation flags. There is no FFI overhead; any additional inlining in smasher due to LTO is marginal.

If you have a good idea to optionally build with more specific flags and instruct the compiler to favour speed over code size, that should help take advantage of the new code. Until then, just updating the C code (while a good idea), won't be that beneficial.

Do you know if there's an idiom (with features, I assume?) to optionally compile C code for specific microarchitectures, or with more emphasis on execution speed in tight loops?

pkhuong commented 1 year ago

Re inlining not mattering: unfortunately while that's the case in real life for string hashes, inlining can have huge (artificial) benefits in microbenchmarks. The "latency" figure for 4-byte xxh3 are pretty suspicious: <10 cycles when the strong_avalanche finaliser chains 3 multiplications serially, on top of loading the hashed string from memory. I think your benchmark might have been optimised away, or lets the compiler and/or the CPU overlap independent hash calls, which usually isn't representative of the workload one cares about.

You can introduce a dependency between iterations by computing the hashed slice's start as a function of the previous hash value; same for the seed argument. Blackboxing the length would probably be more realistic.

pkhuong commented 1 year ago

The latency for short umash also doesn't make sense to me… I'd suspect legacy unVEXed SIMD from the generic compiler flags, but the 4-8 byte code path doesn't even use SIMD. Are you per chance rebuilding the parameter struct every iteration?

vlovich commented 1 year ago

Do you know if there's an idiom (with features, I assume?) to optionally compile C code for specific microarchitectures, or with more emphasis on execution speed in tight loops?

Maybe something like function multi-versioning / function target cloning?

https://stackoverflow.com/questions/71000786/how-to-tell-gccs-target-clones-to-compile-for-all-simd-levels

I believe GCC and clang both support it (i.e. attributes in the code, no extra build flags I think).

Are you per chance rebuilding the parameter struct every iteration?

No. The benchmark body looks like:

#[cfg(feature = "bench_umash")]
lazy_static! {
    static ref UMASH_PARAMS: umash::Params = umash::Params::new();
}
... snip ....
    #[cfg(feature = "bench_umash")]
    ("umash64", |input| {
        black_box(UMASH_PARAMS.hash(input));
    }),
    #[cfg(feature = "bench_umash")]
    ("umash128", |input| {
        black_box(UMASH_PARAMS.fingerprint(input));
    }),

The surrounding framework code is too complex to paste, but what it's doing is creating a random 256 MiB array & then picking out subsets of memory slices from it to make sure I'm evaluating the algorithm's performance on uncached memory as that tends to be the more typical use case (at least for my problem domain).

<10 cycles when the strong_avalanche finaliser chains 3 multiplications serially, on top of loading the hashed string from memory. I think your benchmark might have been optimised away, or lets the compiler and/or the CPU overlap independent hash calls, which usually isn't representative of the workload one cares about.

I can't speak to the CPU overlapping. It's possible, although I expect that since there's 1 hash call in the body of a criterion bench_iter, there's enough in between hash calls to prevent that? Regardless of the behaviour on 4 bytes, the 100 bytes still shows a 10x discrepancy and 64kib still shows > 2x. I don't think the testing methodology can fully explain away that 64kib number whatever else you think about the other values.

My memory bandwidth on this system is ~100-110 GB/s.

Does umash do any kind of explicit vectorization? I think xxhash_rust has an explicit avx2 path which may explain the discrepancy.

pkhuong commented 1 year ago

Do you know if there's an idiom (with features, I assume?) to optionally compile C code for specific microarchitectures, or with more emphasis on execution speed in tight loops?

Maybe something like function multi-versioning / function target cloning?

The latest version of umash.c can internally dispatch on cpuid for long hashes. It's not obvious that the dispatching makes sense for smaller hashes, and AVX-SSE transitions caused by unVEXed instructions can still be an issue there.

The surrounding framework code is too complex to paste, but what it's doing is creating a random 256 MiB array & then picking out subsets of memory slices from it to make sure I'm evaluating the algorithm's performance on uncached memory as that tends to be the more typical use case (at least for my problem domain).

Ok, so each 4-byte hash call is expected to hit uncached memory (or L3, at best), right? < 2ns latency doesn't make sense for an L3 hit or main memory hit; that's probably a bit faster than a L2 hit on your machine. Even umash's 20 ns is too fast for a main memory hit, but could be about right for a L3 hit. I think you're not actually benchmarking what you're looking for.

I can't speak to the CPU overlapping. It's possible, although I expect that since there's 1 hash call in the body of a criterion bench_iter, there's enough in between hash calls to prevent that?

I don't believe that's enough no, because criterion will run multiple iterations (https://bheisler.github.io/criterion.rs/book/user_guide/timing_loops.html#iter) before taking a timestamp. Unless you really are interested in aggregate throughput of short hash calls, you must introduce data dependencies between consecutive iterations (for example, https://github.com/backtrace-labs/umash/blob/32d3cc7ba1f3e6701008d97a512b57e1b60bb676/bench/runner.c#L103-L105)

The black_box function is also mostly useful to hide constants from the compiler; criterion's stable fallback definitely works for that use case. While https://bheisler.github.io/criterion.rs/book/user_guide/known_limitations.html says real_blackbox can more reliably foil dead code elimination, I'm not convinced https://dev-doc.rust-lang.org/beta/src/core/hint.rs.html#112-124 does what criterion wants. The inline asm tries to force LLVM to flush store the value to the stack... but it's not marked volatile, so I'm pretty sure LLVM could just drop it all as dead code. The inline asm also doesn't have a memory clobber (unless that's implicit?), so LLVM can assume that, while the asm can read the black boxed value, it won't overwrite anything in memory, including the value on the stack.

I understand ad hoc static overloads are still hard in rust, so a "+m"(dummy) constraint and a volatile annotation is probably what the rust core team is looking for.

Regardless of the behaviour on 4 bytes, the 100 bytes still shows a 10x discrepancy and 64kib still shows > 2x. I don't think the testing methodology can fully explain away that 64kib number whatever else you think about the other values.

40 ns for a L3 miss sounds pretty good to me. A priori, the xxh3 loop (2.5 ns for a main memory access) sounds like measurement error. For the 64K numbers, I expect the difference is mostly because the vendored C code in this repo is older than smhasher's.

Does umash do any kind of explicit vectorization? I think xxhash_rust has an explicit avx2 path which may explain the discrepancy.

Yes, it's all explicitly vectorised for inputs > 8 bytes... but this repo deliberately avoids 256-bit AVX, because of downclocking and voltage transition hiccups on older microarchitectures. 3 years later, that's somewhat less of an issue, but I think the specialised throughput-optimised subroutines should still be opt in.

If you want, you can copy umash.c, umash.h, and umash_long.inc from https://github.com/backtrace-labs/umash in a fork of this repo, and compile with O3, -march=native, -mtune=native. That'll give you a quick idea of how much there is to gain from the 256-bit version with VPCLMUL. https://github.com/flier/rust-fasthash/tree/master might also work for you; I see it has a solution to the target microarchitecture problem. I'll look into copying the approach if someone can confirm that's how Rust does it.