ParBLiSS / FastANI

Fast Whole-Genome Similarity (ANI) Estimation
Apache License 2.0
368 stars 66 forks source link

[Feature Request] Many-vs-Many upper triangle of pairwise genome ANI calculation #127

Open jolespin opened 9 months ago

jolespin commented 9 months ago

A new tool called skani has a very convenient option that avoids a lot of duplicate computation: https://github.com/bluenote-1577/skani/wiki/skani-basic-usage-guide#skani-triangle---all-to-all-ani-computation

Here's a screenshot: image

Would it be possible for FastANI to use this functionality as well to only calculate the upper triangle?

cjain7 commented 9 months ago

The current implementation of FastANI indexes all genomes in the reference list at the preprocessing stage. The index is not changed afterwards when each query genome is processed. As a result, doing n^2 computations is more convenient for us.

We can try periodically recomputing the index with fewer genomes. I am not sure how much time we will gain by this.

jolespin commented 9 months ago

Would it be possible to provide a single list, index all of the genomes, and then query each non redundant pair?

I was thinking of implementing a wrapper to do the pairs myself and call FastANI around it but then realized the index would be created for each process.

I'm currently having memory issues with FastANI and I think the n^2 might be the culprit.

jolespin commented 9 months ago

Also, if you provide a list of 1k genomes for --rl and the same list for --ql, does it calculate the index of each genome twice?