A few of the commands need to access specific contig sequences while iterating over all contigs—for example, we go through all contigs → for those that match some condition, retrieve the contig sequence → do some more stuff using the sequence.
When there are thousands of contigs, I think this process will become slow (or at least slower than needed) because we retrieve sequences using fasta_utils.get_single_seq()—which in turn goes through, in the worst case, every sequence in the FASTA file. Since this is done during an iteration over all contigs, it's one of those O(|contigs|^2) situations.
It is possible to speed this up by indexing the FASTA file to allow for "random access," which should cut the runtime of accessing a given contig down to ~O(1) (and thus cut the total runtime down to O(|contigs|). It would probably make sense to just use an existing library that supports indexing / random access in FASTA files (rather than re-inventing the wheel); pyfaidx seems good. (We currently read FASTA files using scikit-bio, and we can probably keep that around for most things, but I don't think it supports indexing / random access.)
A few of the commands need to access specific contig sequences while iterating over all contigs—for example, we go through all contigs → for those that match some condition, retrieve the contig sequence → do some more stuff using the sequence.
When there are thousands of contigs, I think this process will become slow (or at least slower than needed) because we retrieve sequences using
fasta_utils.get_single_seq()
—which in turn goes through, in the worst case, every sequence in the FASTA file. Since this is done during an iteration over all contigs, it's one of thoseO(|contigs|^2)
situations.It is possible to speed this up by indexing the FASTA file to allow for "random access," which should cut the runtime of accessing a given contig down to ~
O(1)
(and thus cut the total runtime down toO(|contigs|)
. It would probably make sense to just use an existing library that supports indexing / random access in FASTA files (rather than re-inventing the wheel); pyfaidx seems good. (We currently read FASTA files using scikit-bio, and we can probably keep that around for most things, but I don't think it supports indexing / random access.)