Initialization is too costly

arthurprs commented 1 month ago

Xx3 is great as it's widely deployed and portable, but the streaming implementation suffers from a high initialization cost which hurts in some use cases.

Consider the example below, which hashes a u64. A significant amount of time is spent zeroing out the buffer but mostly memcpy'ing the default secret.

fn hash(i: u64) -> u64 {
    let mut hasher = xxhash_rust::xxh3::Xxh3Core::new();
    i.hash(&mut hasher);
    hasher.finish()
}

xorps xmm0, xmm0
movups xmmword ptr [rsp + 232], xmm0 <<< zeroing the buffer
movups xmmword ptr [rsp + 216], xmm0
movups xmmword ptr [rsp + 200], xmm0
movups xmmword ptr [rsp + 184], xmm0
movups xmmword ptr [rsp + 168], xmm0
movups xmmword ptr [rsp + 152], xmm0
movups xmmword ptr [rsp + 136], xmm0
movups xmmword ptr [rsp + 120], xmm0
movups xmmword ptr [rsp + 104], xmm0
movups xmmword ptr [rsp + 88], xmm0
movups xmmword ptr [rsp + 72], xmm0
movups xmmword ptr [rsp + 56], xmm0
movups xmmword ptr [rsp + 40], xmm0
movups xmmword ptr [rsp + 24], xmm0
movups xmmword ptr [rsp + 8], xmm0
mov qword ptr [rsp + 248], 0
movaps xmm0, xmmword ptr [rip + .L__unnamed_2]
movaps xmmword ptr [rsp + 256], xmm0
movaps xmm0, xmmword ptr [rip + .L__unnamed_2+16]
movaps xmmword ptr [rsp + 272], xmm0
movaps xmm0, xmmword ptr [rip + .L__unnamed_2+32]
movaps xmmword ptr [rsp + 288], xmm0
movaps xmm0, xmmword ptr [rip + .L__unnamed_2+48]
movaps xmmword ptr [rsp + 304], xmm0
lea rdi, [rsp + 320]
lea rsi, [rip + .L__unnamed_3]
mov edx, 192
call qword ptr [rip + memcpy@GOTPCREL] <<< memcopying the secret
xorps xmm0, xmm0
movaps xmmword ptr [rsp + 512], xmm0
movups xmmword ptr [rsp + 522], xmm0
mov qword ptr [rsp + 520], 8
mov qword ptr [rsp], 0
mov word ptr [rsp + 536], 8
mov rdi, rsp
call qword ptr [rip + xxhash_rust::xxh3::Xxh3::digest@GOTPCREL]
add rsp, 552
.cfi_def_cfa_offset 8
ret

While the buffer zeroing can be mitigated with MaybeUninit, it's surprisingly hard to convince the compiler not to use a memcpy call for those 192 bytes.

I tried to sketch a solution but didn't find it nice enough.


pub type Xxh3 = Xxh3Core<Aligned64<[u8; DEFAULT_SECRET_SIZE]>>;

// Can be static for the default seed + secret
pub type Xxh3Ref<'a> = Xxh3Core<&'a Aligned64<[u8; DEFAULT_SECRET_SIZE]>>;

#[derive(Clone)]
///XXH3 Streaming algorithm
///
///Internal state uses rather large buffers, therefore it might be beneficial
///to store hasher on heap rather than stack.
///Implementation makes no attempts at that, leaving choice entirely to user.
pub struct Xxh3Core<T> {
    acc: Acc,
    custom_secret: T,
    buffer: Aligned64<[MaybeUninit<u8>; INTERNAL_BUFFER_SIZE]>,
    buffered_size: u16,
    nb_stripes_acc: usize,
    total_len: u64,
    seed: u64,
}

pub unsafe trait Xxh3Secret {
    fn array(&self) -> &[u8; DEFAULT_SECRET_SIZE];

    #[inline]
    fn slice(&self) -> &[u8] {
        &self.array()[..]
    }

    #[inline]
    fn len(&self) -> usize {
        self.array().len()
    }
}

DoumanAsh commented 1 month ago

This is generally acceptable because you're supposed to re-use state by resetting it which has minimal necessary operations

I can remove buffer zeroing though as it is early mostly for the sake of my ergonomics due to how ugly Rust works with uninit memory

Secret though is a bit more problematic without introducing breaking change Although I wish compiler would elide secret copy

DoumanAsh commented 1 month ago

I will keep changes to the state in this PR https://github.com/DoumanAsh/xxhash-rust/pull/43

For secret I need to consider if there is some way to ensure copy elision because it is just too dumb that compiler cannot elide secret copy

DoumanAsh commented 1 month ago

Btw, if you only need to hash full input you should use one shot functions all the time https://docs.rs/xxhash-rust/0.8.11/xxhash_rust/xxh3/fn.xxh3_64.html

arthurprs commented 1 month ago

This is generally acceptable because you're supposed to re-use state by resetting it which has minimal necessary operations

Right, but it becomes impractical if you need to do it in a codepath that's accessed from multiple threads. That said, I haven't tried storing it in a thread local, but I suppose the call to __tls_get_addr will erase most of the improvements.

For secret I need to consider if there is some way to ensure copy elision because it is just too dumb that compiler cannot elide secret copy

Indeed, at a minimum I was expecting the compiler to transform the memcpy into a SIMD copy.

Btw, if you only need to hash full input you should use one shot functions all the time docs.rs/xxhash-rust/0.8.11/xxhash_rust/xxh3/fn.xxh3_64.html

More often than not the APIs are T: Hash based though

DoumanAsh commented 1 month ago

More often than not the APIs are T: Hash based though

Unfortunately std's Hash trait is fundamentally broken (due to how Hasher is desinged rather than Hash itself) to be used efficiently as it is designed to be used within HasMap so it expects to have small state I would not design efficient code around it

Nevertheless changing secret type would require breaking change to the API itself unless I just add new stateful hasher which takes secret by reference

arthurprs commented 1 month ago

.. unless I just add new stateful hasher which takes secret by reference

That was my thinking with the snippet in the first message. ~99% of use cases will use the default secret and seed, which can refer to the static secret constant. The rest is likely to change only the seed and keep the secret.

arthurprs commented 1 month ago

Btw, #43 simplifies the u64 example to

push rbp
mov rbp, rsp
and rsp, -64
sub rsp, 640
movaps xmm0, xmmword ptr [rip + .L__unnamed_2+48]
movaps xmmword ptr [rsp + 304], xmm0
movaps xmm0, xmmword ptr [rip + .L__unnamed_2+32]
movaps xmmword ptr [rsp + 288], xmm0
movaps xmm0, xmmword ptr [rip + .L__unnamed_2+16]
movaps xmmword ptr [rsp + 272], xmm0
movaps xmm0, xmmword ptr [rip + .L__unnamed_2]
movaps xmmword ptr [rsp + 256], xmm0
lea rdi, [rsp + 320]
lea rsi, [rip + .L__unnamed_3]
mov edx, 192
call qword ptr [rip + memcpy@GOTPCREL]
xorps xmm0, xmm0
movaps xmmword ptr [rsp + 512], xmm0
movups xmmword ptr [rsp + 522], xmm0
mov qword ptr [rsp + 520], 8
mov qword ptr [rsp], 0
mov word ptr [rsp + 536], 8
mov rdi, rsp
call qword ptr [rip + xxhash_rust::xxh3::Xxh3::digest@GOTPCREL]
mov rsp, rbp
pop rbp
ret

With this reduction in size, we can consider inlining digest. I'll follow up with a PR for that.

DoumanAsh commented 1 month ago

I added new stateful hasher that uses defaults

Secret is moved to const, but we need to check code generation https://github.com/DoumanAsh/xxhash-rust/pull/43/commits/be6dc27971c5b09834ae4ef6ebd368eeff12bb68

I'm not sure what is the best here to use const or static to avoid unnecessary copies

arthurprs commented 1 month ago

Performance-wise, I think it'd be roughly the same. In this case, code simplicity should take precedence.

DoumanAsh commented 1 month ago

I think I got it in decent shape after extracting common functions between Xxh3 and Xxh3Default This shit is ugly as fuck, but it should at least simplify maintenance for me

I will make new release soon after run some performance tests

DoumanAsh commented 1 month ago

Well this is certainly a change (although considering how it jumps around, I suppose I cannot say it is most reliable comparison):

u64

xxh3_64 Rust            time:   [319.54 ns 322.11 ns 324.86 ns]
                        change: [+4.5583% +5.2329% +5.9772%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

xxh3_64 Rust Stateful   time:   [427.21 ns 428.09 ns 429.03 ns]
                        change: [+6.0619% +7.3238% +8.1032%] (p = 0.00 < 0.05)
                        Performance has regressed.

xxh3_64 Rust Default Stateful
                        time:   [390.05 ns 390.64 ns 391.27 ns]
                        change: [+2.6818% +3.8782% +4.6410%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

u128

xxh3_128 Rust           time:   [361.54 ns 362.67 ns 364.07 ns]
                        change: [+3.9054% +5.0466% +5.8583%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

xxh3_128 Rust Stateful  time:   [449.88 ns 451.16 ns 452.60 ns]
                        change: [+4.0180% +5.3911% +6.6149%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

xxh3_128 Rust Default Stateful
                        time:   [426.30 ns 429.42 ns 433.03 ns]
                        change: [+2.8970% +3.9519% +4.8923%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)

DoumanAsh commented 1 month ago

I released 0.8.12 with this new hasher Feel free to use it Thanks for helping me out

arthurprs commented 1 month ago

FYI, I also benchmarked the time to hash u64 using Xxh3::new + finalize (times for 10 iterations).

before
test bench ... bench:         121.76 ns/iter (+/- 7.05) (Xxh3)
after
test bench     ... bench:          81.91 ns/iter (+/- 2.80) (Xxh3)
test bench_def ... bench:          45.38 ns/iter (+/- 1.29) (Xx3Default)

The default one is a lot faster (~2.7x) but the standard one is also better (~1.5x). String hashes won't be as impressive but regardless, this illustrates the improvement.

DoumanAsh / xxhash-rust

Initialization is too costly #42

u64

u128