fedarko / strainFlye

Pipeline for analyzing (rare) mutations in metagenome-assembled genomes
BSD 3-Clause "New" or "Revised" License
8 stars 1 forks source link

More efficient contig subsetting #62

Open fedarko opened 1 year ago

fedarko commented 1 year ago

Currently, the user can "focus on" certain contigs for most downstream commands by providing as input a FASTA file that has a smaller amount of sequences than were used upstream. This is easy to reason about, etc.

However, it's inefficient: it means that all contigs stored in this smaller FASTA file are stored twice on disk (once in the smaller FASTA file, and once in the original FASTA file). This gets worse when you start creating multiple sets of contigs to focus on, etc.

It would be nice to modify things so that, for all commands that support subsetting (everything that takes as input a FASTA file besides align and fdr fix, I think?), the user could optionally provide a simple list of contig names (e.g. a file where each row is just a contig name, something like edge_6104\nedge_1671\nedge_2358\n) to subset to.

This would take a fair amount of work and isn't urgent, but it'd be nice.