ParBLiSS / FastANI

Fast Whole-Genome Similarity (ANI) Estimation
Apache License 2.0
368 stars 66 forks source link

Multiple ANI per query/reference genome pairs #74

Open rebeccagarner opened 4 years ago

rebeccagarner commented 4 years ago

Hi there, I ran fastANI to calculate pairwise ANI between a few hundred MAGs by supplying the same genome list as queries and references.. The output file contains multiple rows with ANI stats for the same query-reference genome pair. Below is a subset of the output showing multiple (all 24) pairwise ANI results between the same query and reference genomes. Is this an expected output for fastANI, and if it is, then what is the significance of multiple ANI metrics for each pair of genomes? Many thanks!

mag001 mag004 75.5428 455 2348 mag001 mag004 75.1953 110 2348 mag001 mag004 75.1627 135 2348 mag001 mag004 75.1547 212 2348 mag001 mag004 75.1537 385 2348 mag001 mag004 75.1532 232 2348 mag001 mag004 75.1513 298 2348 mag001 mag004 75.1504 283 2348 mag001 mag004 75.148 112 2348 mag001 mag004 75.1459 237 2348 mag001 mag004 75.1445 142 2348 mag001 mag004 75.1435 154 2348 mag001 mag004 75.1404 258 2348 mag001 mag004 75.1399 252 2348 mag001 mag004 75.1388 255 2348 mag001 mag004 75.1376 54 2348 mag001 mag004 75.136 251 2348 mag001 mag004 75.134 365 2348 mag001 mag004 75.1323 209 2348 mag001 mag004 75.1303 88 2348 mag001 mag004 75.1302 206 2348 mag001 mag004 75.1301 143 2348 mag001 mag004 75.1219 493 2348 mag001 mag004 75.1171 203 2348

cjain7 commented 4 years ago

Assuming you supplied the set of query genomes using --ql and the set of target genomes as --rl, FastANI would compare all query genomes to all target genomes. For instance, if both files have 2 genomes each, it would do 4 comparisons (q1-r1, q1-r2, q2-r1,q2-r2). In cases when both query and target genome sets are the same, then you should expect to see each pair being reported twice (q1-q1, q1-q2, q2-q1,q2-q2).

If you are seeing 24 comparisons between a pair, then I'd think there may be some duplicates in your input set. You may want to check your input set and list each genome only once.

Also see #36

rebeccagarner commented 4 years ago

Thank you kindly for clarifying the expected output: the pairwise comparisons you describe make a lot more sense than the output I obtained. I double-checked the genome list that I supplied as both the --ql and --rl argument, but each row has the file path and name of a unique MAG (there are no duplicate/repeated genomes). The genome list is formatted so that each row holds the file path and name of a genome .fa file, as is specified in fastANI's documentation. I'll try to rerun the program using a small subset of the MAGs and let you know if the issue repeats itself. Thanks again for your quick response!

cjain7 commented 4 years ago

You're welcome. Also, please make sure you are using the latest version.