Open jolespin opened 11 months ago
The current implementation of FastANI indexes all genomes in the reference list at the preprocessing stage. The index is not changed afterwards when each query genome is processed. As a result, doing n^2 computations is more convenient for us.
We can try periodically recomputing the index with fewer genomes. I am not sure how much time we will gain by this.
Would it be possible to provide a single list, index all of the genomes, and then query each non redundant pair?
I was thinking of implementing a wrapper to do the pairs myself and call FastANI around it but then realized the index would be created for each process.
I'm currently having memory issues with FastANI and I think the n^2 might be the culprit.
Also, if you provide a list of 1k genomes for --rl and the same list for --ql, does it calculate the index of each genome twice?
A new tool called skani has a very convenient option that avoids a lot of duplicate computation: https://github.com/bluenote-1577/skani/wiki/skani-basic-usage-guide#skani-triangle---all-to-all-ani-computation
Here's a screenshot:
Would it be possible for FastANI to use this functionality as well to only calculate the upper triangle?