As @zoepochon mentioned on slack, this rule takes sometimes day to run. It is a simple bash loop where tax IDs are grepped out of gzipped sam files. It might become unpractical to run as the number of tax IDs grows.
Another issue we noted while looking at this, the grepping is done with zgrep \"|tax|$i\" ... which will make it so that, e.g. tax ID 123 will match |tax|123 but also |tax|12345 and so on.
I propose we redo this counting step in python, so that all tax IDs are counted at once while doing a single pass of the sam file (instead of, say, 100 grep operations when looking for 100 tax IDs)
As @zoepochon mentioned on slack, this rule takes sometimes day to run. It is a simple bash loop where tax IDs are grepped out of gzipped sam files. It might become unpractical to run as the number of tax IDs grows.
Another issue we noted while looking at this, the grepping is done with
zgrep \"|tax|$i\" ...
which will make it so that, e.g. tax ID 123 will match|tax|123
but also|tax|12345
and so on.I propose we redo this counting step in python, so that all tax IDs are counted at once while doing a single pass of the sam file (instead of, say, 100 grep operations when looking for 100 tax IDs)