NBISweden / aMeta

Ancient microbiome snakemake workflow
MIT License
19 stars 15 forks source link

The Malt_QuantifyAbundance rule takes la long time to run #68

Closed clami66 closed 1 year ago

clami66 commented 2 years ago

As @zoepochon mentioned on slack, this rule takes sometimes day to run. It is a simple bash loop where tax IDs are grepped out of gzipped sam files. It might become unpractical to run as the number of tax IDs grows.

Another issue we noted while looking at this, the grepping is done with zgrep \"|tax|$i\" ... which will make it so that, e.g. tax ID 123 will match |tax|123 but also |tax|12345 and so on.

I propose we redo this counting step in python, so that all tax IDs are counted at once while doing a single pass of the sam file (instead of, say, 100 grep operations when looking for 100 tax IDs)