Closed stepancheg closed 2 years ago
I didn't do proper benchmarks, only some for my own reference. twox-hash uses older implementation as basis so it lacks proper xxh3 implementation.
In addition to that it doesn't provide one-shot version of hashing algorithm which tends to be times faster than streaming one.
In general xxhash-rust
closely matches to original C implementation in terms of performance, although there can be some minor performance degradations which I didn't really investigate.
So to answer concrete questions:
Is it faster?
Generally speaking yes (albeit in streaming variants of xxh64 you only gain marginal improvement)
Is it easier to use?
Hashing functions are pretty trivial to use in general.
So it is pretty much similar you have single function for one-shot variant and structs that implement Hash
trait
Or just an alternative implementation?
It is just implementation of xxhash algorithm.
But it is not just simple alternative because it provides you with const fn
variants of all algorithms
it doesn't provide one-shot version of hashing algorithm
Looks like it does:
Ah my bad, for some reason I thought it doesn't. Well then there is that
I'm interested in comparison benchmarks between the two crates.
@firasuke The only algorithm that is worth benchmarking is xxh3, which is not available in its latest version for twox-hash
so there is really not much to benchmark(twox-hash uses WIP version of xxh3)
Sorry I kinda missed your message
P.s. Just for reference, I strive to achieve as much parity with C implementation as possible (which is at the current moment very close to perfect)
@DoumanAsh Thanks for the reply, and no worries!
I already switched to xxhash-rust
as it felt more minimal and had XXH3_128bits
working like a charm.
The only algorithm that is worth benchmarking is xxh3,
I'm not super familiar with xxh3 (or xxhash in general). I want a good hashing function for data integrity - potentially to replace CRC32. Why xxh3?
In first place CRC32 is not intended as general purpose hashing function so I can only suggest to use xxhash if you need non-cryptographic hashing function.
xxh3 should be your choice due to improved performance comparing to older hashing functions in xxh family. you might consider xxh32 on platforms where you lack 64bit integers or even 32bit integers
Yeah; I'm just throwing a checksum at the end of a new binary file format (for CRDTs) for the sake of detecting silent data corruption. CRC32c is probably fine, but if xxhash would be a better choice, then nows the time to make that decision. xxh3 looks good on paper, but I'm also targetting wasm where code size matters. And xxh32 adds much less code size compared to xxh3. A 32 bit checksum is plenty for what I'm using it for.
But if xxh3 is "the future" then I'm worried that xxh32 might stay niche and not be widely implemented going forward? Stability & portability is very important. O_o I have no idea what the best call is here.
xxh32/xxh64 are good for code size I suppose, but if you want something akin to checksum you need to reduce potential for collisions Only xxh3 with 128bit variant allows it (128bit is wide enough for collisions to be non-existing in case of xxh3 while 64bit variant has some small chance of collision), I do not think xxh3 code is all that big honestly, not to mention I specifically designed library to enable/disable specific algorithms.
Considering your purpose is to use hash as checksum, I believe xxh3 with 128bit variant is the best option. Hash is not inherently designed to be used as checksum, but as long as collisions are non-existing it should work for your case See wiki of xxhash original implementation https://github.com/Cyan4973/xxHash/wiki/Collision-ratio-comparison#testing-128-bit-hashes-=
P.s. 32bit hash is not enough for your purpose, do not look at code size, WASM is not optimized for that anyway
P.s.s if you worry about code size, always enable optimizations for code size:
[profile.release]
opt-level = "z" # Optimize for size.
lto = true # Enable Link Time Optimization
codegen-units = 1 # Reduce number of codegen units to increase optimizations.
panic = "abort" # Abort on panic
strip = true # Automatically strip symbols from the binary.
Thanks! I've spent a lot of time tweaking compilation flags. My CRDT library is ~250kb, and I'm keeping a careful eye on it because I want it to get smaller, not bigger.
In theory TCP / filesystems should keep data consistent anyway. The checksum is really mostly there out of paranoia. Years ago I read that checksums fail in practice (even in situations they shouldn't) much more often than you expect. So I'm a bit curious to see if / when my checksums start failing.
CRDT patches get tiny sometimes since we send one patch per keystroke. A 128 bit hash would make sense for a 1mb file or something, but when I'm sending 20 bytes of patch content, adding 16 bytes of 128-bit xxh3 hash seems excessive.
I'll stick with crc32c for now. Thanks for humoring me! I really appreciate it!
Ah I understand, in this case yeah the best is to choose between xxh32 and crc32. Unfortunately I'm not really familiar with their collision rates, both should have problem like that due to 32bit not being enough for perfect hash, but which is worse.... that's unclear to me. So yeah sticking to crc32 is probably fine since it was designed for checksums rather than general purpose hashes
Would be helpful to mention in readme how this is different from twox-hash. Is it faster? Is it easier to use? Or just an alternative implementation?