bcgsc / ntHash

Fast hash function for DNA/RNA sequences
http://bcgsc.github.io/ntHash/
MIT License
96 stars 13 forks source link

nthash.hpp: faster ntBase via building with tetramer #18

Closed jwcodee closed 4 years ago

jwcodee commented 4 years ago

Faster ntBase implementation using existing tables with tetramer, trimer and dimer hashes. ntBase i now 3.5x faster.

hmohamadi commented 4 years ago

Hi @jowong4, can you share the performance improvement stats?

jwcodee commented 4 years ago
Hashing 1 million randomly generated 128-mers. Time in seconds version run 1 run 2 run 3 run 4 run 5 avg
Before 0.847 0.825 0.820 0.830 0.822 30.829
After 0.288 0.280 0.270 0.275 0.267 0.276

~3x faster

hmohamadi commented 4 years ago

@jowong4 Thanks for sending the stats. To have more robust results I'd like to see the performance on 1 billion k-mers with different lengths similar to what we did in the ntHash paper.

jwcodee commented 4 years ago

@hmohamadi sorry this was much longer than expected. I ended up redoing this a couple times and used your test code in the end.

old

(base) [jowong@hpce705 ntHash]$ ./nttest -k 64 -q 10000000000 -l 64  -j 48  -i reads.fa genes.fa
CPU time (sec) for hash algorithms for kmer=64
nhash=1
ntbase  nthash
10020   5279.52
(base) [jowong@hpce705 ntHash]$ ./nttest -k 96 -q 10000000000 -l 96  -j 48  -i reads.fa genes.fa
CPU time (sec) for hash algorithms for kmer=96
nhash=1
ntbase  nthash
12747.8 11301.8
(base) [jowong@hpce705 ntHash]$ ./nttest -k 128 -q 10000000000 -l 128  -j 48  -i reads.fa genes.fa
CPU time (sec) for hash algorithms for kmer=128
nhash=1
ntbase  nthash
14728.2 11121.1

New


(base) [jowong@hpce705 ntHash]$  ./nttest -k 64 -q 10000000000 -l 64  -j 48  -i reads.fa genes.fa                    
CPU time (sec) for hash algorithms for kmer=64                                                                       
nhash=1                                                                                                              
ntbase  nthash                                                                                                      
6595.33 3304.04                                                                                                      
(base) [jowong@hpce705 ntHash]$  ./nttest -k 96 -q 10000000000 -l 96  -j 48  -i reads.fa genes.fa                   
CPU time (sec) for hash algorithms for kmer=96                                                                       
nhash=1                                                                                                              
ntbase  nthash                                                                                                       
5713.5  5525.11                                                                                                     
 (base) [jowong@hpce705 ntHash]$  ./nttest -k 128 -q 10000000000 -l 128  -j 48  -i reads.fa genes.fa                  
CPU time (sec) for hash algorithms for kmer=128                                                                     
nhash=1                                                                                                              
ntbase  nthash                                                                                                       
8274.17 6487.31    

Note: The server was a bit slow during k64 run so that is why the result is worse than k96. That said. both runs were done at the same time so they should still reflect reality. Second, proportionally speaking, it is ~1/4 less rolls between new and oldbut, in absolute numbers, the number of rolls between old and new are increasing as k gets larger so you expect more gains with higher k. Thirdly, note that these times include IO which means if we deduct the same amount of time to the values, the difference should be much larger.