allind / EukDetect

MIT License
46 stars 16 forks source link

kz dependency #5

Closed nick-youngblut closed 4 years ago

nick-youngblut commented 4 years ago

Your find_low_complexity snakemake rule uses kz, which seems to be an unlisted dependency. I haven't been able to find any software named "kz". How can the user install it? Is there any way to avoid that dependency (eg., using a python script instead)?

nick-youngblut commented 4 years ago

Would it work to just filter out low complexity reads with fastp prior to the initial bowtie2 rule? If that would work, then it seems that one could remove the rules:

allind commented 4 years ago

Kz is Komplexity (https://github.com/eclarke/komplexity). Komplexity is in the environment yaml, but perhaps it's not installing correctly? Are you running this and getting errors?

The reason for aligning then filtering is speed. Most of a metagenomic dataset will not align to the euk marker dataset, so aligning then filtering means filtering many fewer sequences. Filtering then aligning means processing all/most of the fastq twice.

nick-youngblut commented 4 years ago

I missed that in the yaml. Thanks!

It could possibly be faster to filter first, given that the pipeline currently uses many rules to i) convert the bam to fastq ii) find the low complexity sequences iii) filter the low complexity sequences. Pre-filtering would instead just be one step of filtering prior to running bowtie2 (eg., kz --filter < seqs.fq > seqs_filt.fq). Maybe I'm missing something about the workflow?

allind commented 4 years ago

It is several rules, but in practice I've found that having several rules that aren't heavy computation running on a very small dataset (usually hundreds to thousands of sequences) is much, much faster than one rule processing 50 million or more sequences. I've tested both ways and filtering second runs faster.