bluenote-1577 / skani

Fast, robust ANI and aligned fraction for (metagenomic) genomes and contigs.
MIT License
159 stars 10 forks source link

Implementation of Thomas Wang's integer hash function #20

Closed donovan-h-parks closed 11 months ago

donovan-h-parks commented 11 months ago

Hi.

I think there is a bug in the implementation of Thomas Wang's integer hash function. Specifically, the first line should be key = (!key).wrapping_add(key << 21); instead of key = !key.wrapping_add(key << 21);. I'm basing this off of Heng Li's C implementation here: https://gist.github.com/lh3/974ced188be2f90422cc#file-inthash-c

And the Rust port description here: https://aebou.rbind.io/post/a-rust-glimpse-at-thomas-wang-integer-hash-function/

In my testing, this does change the resulting hashes, though from my testing it doesn't seem to materially impact results in my application.

Cheers, Donovan

bluenote-1577 commented 11 months ago

Hi Donovan,

Thanks for this bug report. It looks like you're correct. I've tested out the change briefly and can concur that the new results don't affect results that much. Much thanks for hunting this down and testing it.

I think I will not push the bug fix until skani v0.3.0 for two reasons. Firstly, it does give slightly different results. Secondly, old sketches generated by skani sketch will fail silently after this bug fix, which is very bad, so I'll need to do a bit of engineering.

Because this breaks backward compatibility, I won't push this to the main branch immediately. I swapped in the new hash function for a new branch https://github.com/bluenote-1577/skani/tree/v0.3.0-breaking that you're welcome to use.

P.S. I am slightly curious how you ended up finding this bug...

donovan-h-parks commented 11 months ago

Hi Jim,

Glad to hear the change has minimal impact. It isn't particularly clear why one would want to complement the key as an initial operation, but I imagine this might have some useful mathematical properties (e.g.. allows an easy proof the hash has an inverse). Agreed that the change breaking old sketches silently is a real issue so fully support a v0.3.0 roll out.

I've been playing around with MinFracHash stuff similar to sourmash starting from the Finch codebase. Somehow I ended up down the rabbit hole of optimizing the HashMap and the Hash function which sent me digging into the minimap2 code and eventually skani and sylph (looking forward to seeing this published!).

Have you taken a look at ntHash at all? https://github.com/bcgsc/ntHash

Cheers, Donovan

bluenote-1577 commented 11 months ago

Yes, because the hash function is invertible, it has guaranteed good collision properties, so it may give a better performance hashmap. I've found tradeoffs between good hash functions (Thomas Wang's/minimap2's seems to be good), fast hash functions (FxHash is faster than Thomas Wang's) -- but to be honest, the reason I use minimap2's hash function is:

  1. It is battle-tested on minimizers within minimap2, so I have a tiny bit more confidence that it doesn't do bad things like hash AAAAAAAAAA to a low number
  2. I found it quite easy to vectorize (i.e. with SIMD instructions), which makes skani's k-mer processing a bit faster.

I looked at nthash before, but last I checked, but it appears that file I/O and accessing/allocating memory becomes a bigger bottleneck at some point compared to hashing, so I stopped trying to optimize. I could be completely wrong though, as I am far from an expert on code optimization

sylph will hopefully be out within the next month or so... the software will not see many algorithmic changes from here on out, but maybe a few new features will be added. Feel free to experiment with it :)

donovan-h-parks commented 11 months ago

All good points, and I'm also (informally) seeing that file I/O can be the limiting factor.