Closed jwcodee closed 4 years ago
Hi @jowong4, can you share the performance improvement stats?
Hashing 1 million randomly generated 128-mers. Time in seconds | version | run 1 | run 2 | run 3 | run 4 | run 5 | avg | |
---|---|---|---|---|---|---|---|---|
Before | 0.847 | 0.825 | 0.820 | 0.830 | 0.822 | 30.829 | ||
After | 0.288 | 0.280 | 0.270 | 0.275 | 0.267 | 0.276 |
~3x faster
@jowong4 Thanks for sending the stats. To have more robust results I'd like to see the performance on 1 billion k-mers with different lengths similar to what we did in the ntHash paper.
@hmohamadi sorry this was much longer than expected. I ended up redoing this a couple times and used your test code in the end.
old
(base) [jowong@hpce705 ntHash]$ ./nttest -k 64 -q 10000000000 -l 64 -j 48 -i reads.fa genes.fa
CPU time (sec) for hash algorithms for kmer=64
nhash=1
ntbase nthash
10020 5279.52
(base) [jowong@hpce705 ntHash]$ ./nttest -k 96 -q 10000000000 -l 96 -j 48 -i reads.fa genes.fa
CPU time (sec) for hash algorithms for kmer=96
nhash=1
ntbase nthash
12747.8 11301.8
(base) [jowong@hpce705 ntHash]$ ./nttest -k 128 -q 10000000000 -l 128 -j 48 -i reads.fa genes.fa
CPU time (sec) for hash algorithms for kmer=128
nhash=1
ntbase nthash
14728.2 11121.1
New
(base) [jowong@hpce705 ntHash]$ ./nttest -k 64 -q 10000000000 -l 64 -j 48 -i reads.fa genes.fa
CPU time (sec) for hash algorithms for kmer=64
nhash=1
ntbase nthash
6595.33 3304.04
(base) [jowong@hpce705 ntHash]$ ./nttest -k 96 -q 10000000000 -l 96 -j 48 -i reads.fa genes.fa
CPU time (sec) for hash algorithms for kmer=96
nhash=1
ntbase nthash
5713.5 5525.11
(base) [jowong@hpce705 ntHash]$ ./nttest -k 128 -q 10000000000 -l 128 -j 48 -i reads.fa genes.fa
CPU time (sec) for hash algorithms for kmer=128
nhash=1
ntbase nthash
8274.17 6487.31
Note: The server was a bit slow during k64 run so that is why the result is worse than k96. That said. both runs were done at the same time so they should still reflect reality. Second, proportionally speaking, it is ~1/4 less rolls between new and oldbut, in absolute numbers, the number of rolls between old and new are increasing as k gets larger so you expect more gains with higher k. Thirdly, note that these times include IO which means if we deduct the same amount of time to the values, the difference should be much larger.
Faster ntBase implementation using existing tables with tetramer, trimer and dimer hashes. ntBase i now 3.5x faster.