Avoid stack allocation in xxhash64

apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator

https://datafusion.apache.org/comet

Apache License 2.0

819 stars 163 forks source link

Avoid stack allocation in xxhash64 #547

Closed andygrove closed 4 months ago

andygrove commented 5 months ago

What is the problem the feature request solves?

Here is the current xxhash64 implementation:

pub(crate) fn spark_compatible_xxhash64<T: AsRef<[u8]>>(data: T, seed: u64) -> u64 {
    // TODO: Rewrite with a stateless hasher to reduce stack allocation?
    let mut hasher = XxHash64::with_seed(seed);
    hasher.write(data.as_ref());
    hasher.finish()
}

As the TODO comment states, it may be better to implement without stack allocation.

Here is Spark's version for reference:

  public static long hashInt(int input, long seed) {
    long hash = seed + PRIME64_5 + 4L;
    hash ^= (input & 0xFFFFFFFFL) * PRIME64_1;
    hash = Long.rotateLeft(hash, 23) * PRIME64_2 + PRIME64_3;
    return fmix(hash);
  }

Describe the potential solution

No response

Additional context

No response

parthchandra commented 5 months ago

Are we expecting a performance improvement as a result? (I doubt stack allocations are a performance bottleneck)

advancedxy commented 5 months ago

Linking this to #517 too.

Are we expecting a performance improvement as a result? (I doubt stack allocations are a performance bottleneck)

+1. I think we may need to figure out the root cause first. It might that the implementation of XxHash64 introduces some performance bottleneck, rather the stack allocation. I can try to reproduce and check this issue this weekend if I have some spare time.

andygrove commented 5 months ago

Thanks @advancedxy. Yes, it definitely needs more investigation.

andygrove commented 5 months ago

Here are benchmark results comparing murmur3 to xxhash64. xxhash64 is 3x slower (not sure if that is the expectation) but what is more interesting is that there is a warning during warmup (Unable to complete 100 samples in 5.0s).

     Running benches/hash.rs (target/release/deps/hash-3beecac83c235dce)
Gnuplot not found, using plotters backend
hash/murmur3/8192       time:   [342.78 µs 345.32 µs 349.49 µs]
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

Benchmarking hash/xxhash64/8192: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.0s, enable flat sampling, or reduce sample count to 70.
hash/xxhash64/8192      time:   [992.21 µs 994.27 µs 996.88 µs]
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

andygrove commented 5 months ago

@advancedxy I am working on re-implementing this now in a simpler way as an experiement ... I will aim to have a PR up by end of day

parthchandra commented 5 months ago

xxhash64 is 3x slower (not sure if that is the expectation)

xxhash64 is by no means slower than murmur3. Here's a sample benchmark: https://lz4.github.io/lz4-java/1.3.0/xxhash-benchmark/

advancedxy commented 5 months ago

I am working on re-implementing this now in a simpler way as an experiement ... I will aim to have a PR up by end of day

Thanks for your effort.

Here are benchmark results comparing murmur3 to xxhash64. xxhash64 is 3x slower (not sure if that is the expectation)

I don't think this is expected as @parthchandra already pointed out xxhash64 should be much faster than murmur3. Even the TPC-H q8 is not related to XXHash64, we should still investigate it further and improve its performance. I can help with that, though it's not urgent then.