Open qingpeng opened 9 years ago
thanks @qingpeng. I'm guessing this has to be due to the different hash function implementations, as apart from that the HLL should be faster in every possible way...
Hi @qingpeng , thanks for the tests!
The catch for faster HLL is using multiple OpenMP threads, by default only one is used. I'm still figuring the best way to add this to the script, but if you run with
OMP_NUM_THREADS=16 python sandbox/unique-kmers.py
you can set how many threads are being used (16 in this case).
If you're running on HPCC or a Linux box it should work, but clang (OSX default compiler) doesn't have OpenMP support (yet).
I am not aware if @luizirber has done something like this. But here is an example of the application to deal with some real data.
I tested the two approaches to count the number of unique k-mers in a soil metagenomic data set, (wisconsin_restored.trimmed.fasta.gz, 11G, about 14 billion unique k-mers)
Using hashbits approach, I used 1 hashtable with size as 1e11, which consumed 25G memory. The number is 14541376383, with fp rate as 0.073. time log: 3916.08user 207.59system 1:08:52elapsed 99%CPU (0avgtext+0avgdata 97720112maxresident)k
Using HLL approach, I used default parameters. The number is 14777747688, with error rate as 0.01 time log: 12415.68user 45.12system 3:28:12elapsed 99%CPU (0avgtext+0avgdata 36256maxresident)k
Firstly, HLL consumed much smaller memory, which is expected and very nice. (36m vs 97G). So we don't need to worry about the memory usage generally in contrast to the hashbit approach, for which I have to designate specific size of hash table to use and wish it is large enough to handle the data set, sometime this is annoying.
Secondly, the HLL is slower than hashbits. (3hours vs 1 hours) Is this expected? The speed is fine for now, but it will be great if it can be optimized on this. I have some other larger data sets to count, (like 70G zipped file, with 100 billion unique k-mers). :)
Thanks for your effort, @luizirber
Hope this is helpful.
@ctb