ParBLiSS / FastANI

Fast Whole-Genome Similarity (ANI) Estimation
Apache License 2.0
359 stars 65 forks source link

minimizer/kmer string compression #107

Open jianshu93 opened 1 year ago

jianshu93 commented 1 year ago

Hello Chirag,

Does fastANI compress kmer/minimizer strings by default? I did not see it after checking. I realized that kmer counting from Heng Li's repo (based on kseq.h) (https://github.com/lh3/kmer-cnt/blob/master/kc-c1.c) compress AGCT into 0,1,2,3 et.al. We could do better actually to represent AGCT using only 2 bits memory(00, 01, 10, 11), Since fastANI consumes a lot of memory when running all versus all, I am wondering this could save a lot of memory. There are several Rust libraries that compression kmer into 2 bits and save a lot of memory (https://github.com/jean-pierreBoth/kmerutils/blob/master/src/base/alphabet.rs). I noticed there is also one here for C++: https://github.com/dassencio/dna-compression

Thanks,

Jianshu

cjain7 commented 1 year ago

Hi, ATCG is being represented only using 2 bits (00 is 0, 01 is 1, 10 is 2 and 11 is 3) https://github.com/lh3/kmer-cnt/blob/e2574719cfb784915d80eb5828e78dfae4cfdd7b/kc-c1.c#L36

jianshu93 commented 1 year ago

Thanks Chirag,I also noticed this in that kc-c1.c

why all veesus all is consuming so many memory?any possibility to reduce somehow if dna string is already compressed.

Thanks

Jianshu

jianshu93 commented 1 year ago

or we need to implement compression for fastANI?

Thanks.

Jianshu

jianshu93 commented 1 year ago

Hello Chirag,

If there is no need to do string compression for fastANI, I will close this issue.

Thanks,

Jianshu

cjain7 commented 1 year ago

Sorry Jianshu, I am not clear what string compression means in this context. FastANI maintains a k-mer database extracted from all genomes, that is subsequently queried during mapping stage.