bcgsc / ntHash

Fast hash function for DNA/RNA sequences
http://bcgsc.github.io/ntHash/
MIT License
96 stars 13 forks source link

NtHash object and ntf64(seq, k) produce different forward hash values when k is a multiple of 4 #44

Closed VeryAmazed closed 1 year ago

VeryAmazed commented 1 year ago

Hello, While testing some code for a library I am writing using ntHash, I noticed that the initial/first call of roll() using the NtHash object, and the ntf64(seq, k) (line 100 in nthash_lowlevel.hpp) return different hash values but ntr64(seq, k) (line 112 in nthash_lowlevel.hpp) does not produce a different hash value between the object and the function call, and this only occurs when k is multiple of 4.

I believe the initial call of roll() on the NtHash object calls the function ntmc64(seq + pos, k, hash_num, forward_hash, reverse_hash, posN, hashes_array.get()) (line 397 in lowlevel.hpp) so essentially I am saying that the forward hash value produced by ntmc64() and ntf64() are different but only when k is a multiple of 4, and the reverse hash values produced by ntmc64 and ntr64 are the same. As the code used in the NtHash library is quite complicated, I'm not 100% sure if this is intentional or not.

The Code I used to test this is in this repository. output.txt is where all the testing results are and ntHash_testing.cpp is the code I used to test. https://github.com/VeryAmazed/ntHash_Testing

parham-k commented 1 year ago

Thanks for testing this! The output from the low-level functions should be the same in the class members. I'll add your case as a unit test and try to fix it in a future release. However, the intended functionality is implemented in the NtHash class, and the low-level functions aren't reliable when used independently.