Thank you for the package and regular updates of bacteriophage genome databases and other resources. They have been incredibly helpful for my analysis. I have a question regarding genome comparison that falls outside the scope of the Inphared package. However, I thought you might have some insights on this.
I have a set of metagenomically identified phage contigs (complete as well as incomplete). I used the IMGVR and RefSeq phage databases for comparison with my phage contigs using both the MASH algorithm and a pairwise ANI comparison script provided by CheckV. The pairwise ANI comparison script utilizes ANI calculation and clustering similar to UCLAST.
I found that the pairwise ANI clustering method failed to cluster any of my contigs with the above databases at 95% ANI + 85% alignment fraction. This is somewhat surprising as my sample is from coastal waters. However, when I calculated ANI based on MASH distance using the formula ((1 - mash_dist) * 100), more than 25 contigs showed similarity above 99% and matching hashes of 996 out of 1000.
I understand that both algorithms are different, but why is there such a significant difference in the output?
In this case, which method should be considered more reliable?
Hi @RyanCook94,
Thank you for the package and regular updates of bacteriophage genome databases and other resources. They have been incredibly helpful for my analysis. I have a question regarding genome comparison that falls outside the scope of the Inphared package. However, I thought you might have some insights on this.
I have a set of metagenomically identified phage contigs (complete as well as incomplete). I used the IMGVR and RefSeq phage databases for comparison with my phage contigs using both the MASH algorithm and a pairwise ANI comparison script provided by CheckV. The pairwise ANI comparison script utilizes ANI calculation and clustering similar to UCLAST.
I found that the pairwise ANI clustering method failed to cluster any of my contigs with the above databases at 95% ANI + 85% alignment fraction. This is somewhat surprising as my sample is from coastal waters. However, when I calculated ANI based on MASH distance using the formula ((1 - mash_dist) * 100), more than 25 contigs showed similarity above 99% and matching hashes of 996 out of 1000.
I understand that both algorithms are different, but why is there such a significant difference in the output? In this case, which method should be considered more reliable?
Thank you