RyanCook94 / inphared

Providing up-to-date phage genome databases, metrics and useful input files for a number of bioinformatic pipelines.
GNU Affero General Public License v3.0
61 stars 8 forks source link

Regarding genome comparison #23

Closed ShailNair closed 6 months ago

ShailNair commented 1 year ago

Hi @RyanCook94,

Thank you for the package and regular updates of bacteriophage genome databases and other resources. They have been incredibly helpful for my analysis. I have a question regarding genome comparison that falls outside the scope of the Inphared package. However, I thought you might have some insights on this.

I have a set of metagenomically identified phage contigs (complete as well as incomplete). I used the IMGVR and RefSeq phage databases for comparison with my phage contigs using both the MASH algorithm and a pairwise ANI comparison script provided by CheckV. The pairwise ANI comparison script utilizes ANI calculation and clustering similar to UCLAST.

I found that the pairwise ANI clustering method failed to cluster any of my contigs with the above databases at 95% ANI + 85% alignment fraction. This is somewhat surprising as my sample is from coastal waters. However, when I calculated ANI based on MASH distance using the formula ((1 - mash_dist) * 100), more than 25 contigs showed similarity above 99% and matching hashes of 996 out of 1000.

I understand that both algorithms are different, but why is there such a significant difference in the output? In this case, which method should be considered more reliable?

Thank you