ParBLiSS / FastANI

Fast Whole-Genome Similarity (ANI) Estimation
Apache License 2.0
368 stars 66 forks source link

kmer size selection #42

Closed dacheampong closed 5 years ago

dacheampong commented 5 years ago

I have been using fastANI for sometime now but wanted to see how my results differ based on the kmer size. I realized (kmer <=16, default=16), From the help description: ''' -k , --kmer kmer size <= 16 [default : 16] '''

I tried different kmers (13, 16, 18, 21) on a set of four bacteria genomes as seen below; k-mer size = 13 4 x.fna |   |   |   y.fna | NA |   |   z.fna | NA | NA |   m.fna | NA | 75.762955 | NA   k-mer size = 16 4 x.fna |   |   |   y.fna | 75.345757 |   |   z.fna | NA | NA |   m.fna | 74.794037 | 75.225105 | NA   K-mer size = 18 4 x.fna |   |   |   y.fna | 74.216934 |   |   z.fna | 74.254578 | 74.217453 |   m.fna | 74.190994 | 74.626328 | 74.193939    k-mer size = 21 4 x.fna |   |   |   y.fna | 75.161926 |   |   z.fna | 75.197411 | 75.177795 |   m.fna | 75.154465 | 75.465622 | 75.155106

I want to know why it was possible for me to get all pairwise ANI results using kmer sizes greater than the maximum allowed kmer size (16) in fastANI and poor results (most were NA) for kmer size < 16? Using fastANI version v1.1 My command example: fastANI -k 13 --ql query.list --rl reference.list -o out --matrix

cjain7 commented 5 years ago

FastANI is designed to report ANI values within 80--100 range; outside that range, it's not reliable. NA shouldn't be interpreted as a poor result, it just means that ANI would be less than 80%. For those, you should compute distances using protein sequences (AAI). Also see CompareM, that may be helpful for distant genomes. Lastly, I don't think k-mer size of more than 16 would give you better result. In the implementation, I'm using 32 bit hashes, so 16 or more sized-kmer won't be effective.