Open jianshu93 opened 2 years ago
Hi, ATCG is being represented only using 2 bits (00 is 0, 01 is 1, 10 is 2 and 11 is 3) https://github.com/lh3/kmer-cnt/blob/e2574719cfb784915d80eb5828e78dfae4cfdd7b/kc-c1.c#L36
Thanks Chirag,I also noticed this in that kc-c1.c
why all veesus all is consuming so many memory?any possibility to reduce somehow if dna string is already compressed.
Thanks
Jianshu
or we need to implement compression for fastANI?
Thanks.
Jianshu
Hello Chirag,
If there is no need to do string compression for fastANI, I will close this issue.
Thanks,
Jianshu
Sorry Jianshu, I am not clear what string compression means in this context. FastANI maintains a k-mer database extracted from all genomes, that is subsequently queried during mapping stage.
Hello Chirag,
Does fastANI compress kmer/minimizer strings by default? I did not see it after checking. I realized that kmer counting from Heng Li's repo (based on kseq.h) (https://github.com/lh3/kmer-cnt/blob/master/kc-c1.c) compress AGCT into 0,1,2,3 et.al. We could do better actually to represent AGCT using only 2 bits memory(00, 01, 10, 11), Since fastANI consumes a lot of memory when running all versus all, I am wondering this could save a lot of memory. There are several Rust libraries that compression kmer into 2 bits and save a lot of memory (https://github.com/jean-pierreBoth/kmerutils/blob/master/src/base/alphabet.rs). I noticed there is also one here for C++: https://github.com/dassencio/dna-compression
Thanks,
Jianshu