fedarko / strainFlye

Pipeline for analyzing (rare) mutations in metagenome-assembled genomes
BSD 3-Clause "New" or "Revised" License
8 stars 1 forks source link

Use pyfaidx to access contigs in FASTA files #64

Open fedarko opened 2 years ago

fedarko commented 2 years ago

A few of the commands need to access specific contig sequences while iterating over all contigs—for example, we go through all contigs → for those that match some condition, retrieve the contig sequence → do some more stuff using the sequence.

When there are thousands of contigs, I think this process will become slow (or at least slower than needed) because we retrieve sequences using fasta_utils.get_single_seq()—which in turn goes through, in the worst case, every sequence in the FASTA file. Since this is done during an iteration over all contigs, it's one of those O(|contigs|^2) situations.

It is possible to speed this up by indexing the FASTA file to allow for "random access," which should cut the runtime of accessing a given contig down to ~O(1) (and thus cut the total runtime down to O(|contigs|). It would probably make sense to just use an existing library that supports indexing / random access in FASTA files (rather than re-inventing the wheel); pyfaidx seems good. (We currently read FASTA files using scikit-bio, and we can probably keep that around for most things, but I don't think it supports indexing / random access.)