bcgsc / ntHash

Fast hash function for DNA/RNA sequences
http://bcgsc.github.io/ntHash/
MIT License
96 stars 13 forks source link

Unstable hash values for k>64 #41

Closed shenwei356 closed 1 year ago

shenwei356 commented 2 years ago

Hi, thanks for inventing this great algorithm! And glad to see ntHash2 is also published.

I've used ntHash in several projects of mine, specifically, using a Golang implementation (by @will-rowe) of ntHash v1.0.4. It's really fast!

I just find that the hash values are unstable for k > 64. I guess it's related to the algorithm itself, cause it happens to be 64, not another number. I am not familiar with C/C++, so I can't check the original implementation. Here are steps with tools using the Golang implementation to reproduce the issue.

$ echo -ne ">s\nACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" | seqkit stats
file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
-     FASTA   DNA          1       65       65       65       65

# compute canonical ntHash with https://github.com/shenwei356/unikmer
$ echo -ne ">s\nACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
    | unikmer count -H -K -k 65 -l | unikmer view
14872199115326727818  ****

# now we add one base in the start position.
$ echo -ne ">s\naACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
    | unikmer count -H -K -k 65 -l | unikmer view
5042997269030491403
13219011773654478434  ****

$ echo -ne ">s\ncACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
    | unikmer count -H -K -k 65 -l | unikmer view
5252439041897790003
1248486797404628174  ****

$ echo -ne ">s\ngACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
    | unikmer count -H -K -k 65 -l | unikmer view
6432712771380638299
10232415768797241538  ****

$ echo -ne ">s\ntACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
    | unikmer count -H -K -k 65 -l | unikmer view
5774425108696765737
11299031900349869606  ****

While for k=64, it's stable (7718595180140858881):

$ echo -ne ">s\nCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n"  \
    | unikmer count -H -K -k 64 -l | unikmer view
7718595180140858881

$ echo -ne ">s\naCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n"  \
    | unikmer count -H -K -k 64 -l | unikmer view
8752650216170443135
7718595180140858881

$ echo -ne ">s\ncCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n"  \
    | unikmer count -H -K -k 64 -l | unikmer view
9222164657016850147
7718595180140858881

$ echo -ne ">s\ntCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n"  \
    | unikmer count -H -K -k 64 -l | unikmer view
8329673441993552238
7718595180140858881

echo -ne ">s\ntCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n"   \
    | unikmer count -H -K -k 64 -l | unikmer view
8329673441993552238
7718595180140858881
parham-k commented 2 years ago

Hi @shenwei356 and thanks for your comments! This was a known issue with the early versions of ntHash (see this issue and also ntHash2's paper). We've fixed the periodicity problem in ntHash2. The Golang implementation probably contains the old bug that causes your issue.

Unfortunately, we don't have a Golang implementation for ntHash2. That said, ntHash2 can be used in Python with btllib.

mohamadi commented 2 years ago

Thanks @shenwei356 for reporting this! We'll try to sync up with @will-rowe and @luizirber to have new releases for Golang and Rust implementation.

shenwei356 commented 2 years ago

I see, thank you all! Currently, I'll limit the k-mer size below 64 for my projects.

parham-k commented 2 years ago

Apparently, SWIG (the tool used in btllib to generate Python wrappers) also supports Go. I'll try it out on ntHash and post any updates here.