Closed clami66 closed 2 years ago
I have now tested samtools faidx
vs seqtk subseq
. On average samtools faidx
is 17 times faster than seqtk subseq
, and there seems to be no large penalties to extracting more than one id (which I believe we rarely need to do?); see attached figures. Indexing a 300GB database took ~20min using ~9G memory, so no big deal. I will add an indexing rule and change Breadth_Of_Coverage accordingly.
In the authentication pipeline we have:
seqtk subseq {params.malt_fasta} {output.name_list} > {output.fasta}
This will take a while (I stopped the test after a few minutes) when running on a full-blown DB, and would be repeated for all tax IDs found for each sample.
The help for
seqtk subseq
mentions:So maybe we could do just that? In which case we would need to add the constraint that the
malt_fasta
DB is indexed