NBISweden / aMeta

Ancient microbiome snakemake workflow
MIT License
19 stars 14 forks source link

seqtk subseq is slow on large databases #58

Closed clami66 closed 2 years ago

clami66 commented 2 years ago

In the authentication pipeline we have: seqtk subseq {params.malt_fasta} {output.name_list} > {output.fasta}

This will take a while (I stopped the test after a few minutes) when running on a full-blown DB, and would be repeated for all tax IDs found for each sample.

The help for seqtk subseq mentions:

Note: Use 'samtools faidx' if only a few regions are intended.

So maybe we could do just that? In which case we would need to add the constraint that the malt_fasta DB is indexed

percyfal commented 2 years ago

I have now tested samtools faidx vs seqtk subseq. On average samtools faidx is 17 times faster than seqtk subseq, and there seems to be no large penalties to extracting more than one id (which I believe we rarely need to do?); see attached figures. Indexing a 300GB database took ~20min using ~9G memory, so no big deal. I will add an indexing rule and change Breadth_Of_Coverage accordingly.

fig-sequence-retrieval-plot-runtime-1 fig-sequence-retrieval-plot-memory-1