Closed donovan-h-parks closed 11 months ago
Hi Donovan,
Thanks for this bug report. It looks like you're correct. I've tested out the change briefly and can concur that the new results don't affect results that much. Much thanks for hunting this down and testing it.
I think I will not push the bug fix until skani v0.3.0 for two reasons. Firstly, it does give slightly different results. Secondly, old sketches generated by skani sketch
will fail silently after this bug fix, which is very bad, so I'll need to do a bit of engineering.
Because this breaks backward compatibility, I won't push this to the main branch immediately. I swapped in the new hash function for a new branch https://github.com/bluenote-1577/skani/tree/v0.3.0-breaking that you're welcome to use.
P.S. I am slightly curious how you ended up finding this bug...
Hi Jim,
Glad to hear the change has minimal impact. It isn't particularly clear why one would want to complement the key as an initial operation, but I imagine this might have some useful mathematical properties (e.g.. allows an easy proof the hash has an inverse). Agreed that the change breaking old sketches silently is a real issue so fully support a v0.3.0 roll out.
I've been playing around with MinFracHash stuff similar to sourmash starting from the Finch codebase. Somehow I ended up down the rabbit hole of optimizing the HashMap and the Hash function which sent me digging into the minimap2 code and eventually skani and sylph (looking forward to seeing this published!).
Have you taken a look at ntHash at all? https://github.com/bcgsc/ntHash
Cheers, Donovan
Yes, because the hash function is invertible, it has guaranteed good collision properties, so it may give a better performance hashmap. I've found tradeoffs between good hash functions (Thomas Wang's/minimap2's seems to be good), fast hash functions (FxHash is faster than Thomas Wang's) -- but to be honest, the reason I use minimap2's hash function is:
I looked at nthash before, but last I checked, but it appears that file I/O and accessing/allocating memory becomes a bigger bottleneck at some point compared to hashing, so I stopped trying to optimize. I could be completely wrong though, as I am far from an expert on code optimization
sylph will hopefully be out within the next month or so... the software will not see many algorithmic changes from here on out, but maybe a few new features will be added. Feel free to experiment with it :)
All good points, and I'm also (informally) seeing that file I/O can be the limiting factor.
Hi.
I think there is a bug in the implementation of Thomas Wang's integer hash function. Specifically, the first line should be
key = (!key).wrapping_add(key << 21);
instead ofkey = !key.wrapping_add(key << 21);
. I'm basing this off of Heng Li's C implementation here: https://gist.github.com/lh3/974ced188be2f90422cc#file-inthash-cAnd the Rust port description here: https://aebou.rbind.io/post/a-rust-glimpse-at-thomas-wang-integer-hash-function/
In my testing, this does change the resulting hashes, though from my testing it doesn't seem to materially impact results in my application.
Cheers, Donovan