Closed nick-youngblut closed 4 years ago
Would it work to just filter out low complexity reads with fastp prior to the initial bowtie2 rule? If that would work, then it seems that one could remove the rules:
Kz is Komplexity (https://github.com/eclarke/komplexity). Komplexity is in the environment yaml, but perhaps it's not installing correctly? Are you running this and getting errors?
The reason for aligning then filtering is speed. Most of a metagenomic dataset will not align to the euk marker dataset, so aligning then filtering means filtering many fewer sequences. Filtering then aligning means processing all/most of the fastq twice.
I missed that in the yaml. Thanks!
It could possibly be faster to filter first, given that the pipeline currently uses many rules to i) convert the bam to fastq ii) find the low complexity sequences iii) filter the low complexity sequences. Pre-filtering would instead just be one step of filtering prior to running bowtie2 (eg., kz --filter < seqs.fq > seqs_filt.fq
). Maybe I'm missing something about the workflow?
It is several rules, but in practice I've found that having several rules that aren't heavy computation running on a very small dataset (usually hundreds to thousands of sequences) is much, much faster than one rule processing 50 million or more sequences. I've tested both ways and filtering second runs faster.
Your
find_low_complexity
snakemake rule useskz
, which seems to be an unlisted dependency. I haven't been able to find any software named "kz". How can the user install it? Is there any way to avoid that dependency (eg., using a python script instead)?