ParBLiSS / FastANI

Fast Whole-Genome Similarity (ANI) Estimation
Apache License 2.0
368 stars 66 forks source link

Recommended N50 #47

Closed donovan-h-parks closed 5 years ago

donovan-h-parks commented 5 years ago

Hello,

The README suggest that both the query and reference genomes should have an N50 >10Kb. I'm a little unclear why lower N50, say 5 Kb, is problematic when the query fragment size is only 3 Kb. I can imagine if both the query and reference genome have a low N50 correctly matching homologous regions can be compromised. However, if just one of the reference or query has a low N50 it would seem the method would still work well enough. Just wondering if you have some insights that would help me understand how N50 impacts results and if N50 >10Kb should be consider a hard requirement.

Thanks.

cjain7 commented 5 years ago

Yes, it should still work well enough in the case you describe. No I don't think it is a hard requirement.

10 Kb is just a safe cutoff that we thought we should recommend users (also supported from couple of our own experiments); but in general FastANI needs enough longer query fragments than 3 Kb that can be mapped to reference. Note that fragments in the reference need to be long enough at the same time, because if only a fraction of a query fragment maps, then it might lead to lower identity (currently estimated using kmer jaccard similarity). Let me know if that answers your question..

donovan-h-parks commented 5 years ago

Sorry for the delayed reply. I've been on vacation.

Thank you for the clarification. This is inline with my understanding of the method and informal results I have.