dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
757 stars 295 forks source link

Add filter length option to split-paired-reads.py #1677

Open Sanrrone opened 7 years ago

Sanrrone commented 7 years ago

Dears, sometimes the reads that filter-abund and normalize-by-median give (and then processed by split-paired-reads.py), are too short for the assembler giving some errors, so is necessary filter that reads before the assembly step but there is no option to do and using external software is required. will be a very time-save option that split-paired-reads.py have an parameter to filter the reads of specific length.

regards

standage commented 7 years ago

Thank you Sandro.

Unless I'm mistaken, normalize-by-median.py doesn't trim reads at all. Rather it chooses whether to keep or discard the entire read based on the abundances of its constituent k-mers. Scripts like filter-abund.py and trim-low-abund.py do trim reads, and if trimming produces a read that is shorter than the specified --ksize the read is discarded entirely.

If I understand, you're suggesting we provide an option for users to specify their own length threshold that reads must pass to be reported?

I think this is a good idea, but would probably be more appropriate in the trimming/filtering scripts than in the split-paired-reads.py script.

ctb commented 7 years ago

Also see script 'extract-long-sequences.py' which does this - not to say it wouldn't be clearer elsewhere, tho :)

Sanrrone commented 7 years ago

thanks to everyone, you are right @ctb , extract-long-sequences.py do the work but the next step will be paired the reads that now are singletons (in case you have .pe.fastq), so again extra tools might be required, I think the best way is that @standage says.

I follow the protocol where mention that is necessary a quality control (prinseq), before do a digital normalization. About the filter, the max size of the k-mer in normalize-by-median.py is 32, you say that if a read is less than 32 the read if descarted, that is ok, but if the read is longer than 32 but not longer to overlap with their pair is also keep (my case). this actually I solved using prinseq that filter the length and conserve the pairs, that's why I put this commit, just for no calling prinseq again.