Closed sfiligoi closed 5 months ago
@ch4rr0 Could you please review?
Hello Igor, I will take a look today.
The resulting output file is huge, putting a lot of strain on the IO system. Reducing the IO cost at the source would be highly preferred.
CC @wasade
Thanks, @sfiligoi!
@ch4rr0, shaving IO natively within bowtie2 would be pleasant
@BenLangmead, thoughts?
Just a reminder....
I think this kind of straightforward postprocessing is best left to awk and similar tools. Otherwise we accumulate too many command-line options that make later changes trickier.
I know that this is in tension with the fact that Bowtie had the --suppress
option for this purpose: https://bowtie-bio.sourceforge.net/manual.shtml#bowtie-options-suppress. But I think keeping it simple is key.
Unfortunately, --suppress
does not work with -S/--sam
.
Correct
Hi @BenLangmead, this option is valuable to our efforts with Qiita (https://qiita.ucsd.edu/). Qiita right now houses .sam output from 50-100k metagenomic samples, which are typically mapped against a few databases. The volume of data overall is large, and reprocessing occurs periodically. We currently post process to reduce storage burden, but it would be an appreciable runtime improvement to avoid the significant IO needed to stage .sam temporarily for filtering.
I appreciate your comments; I suggest awk or mawk or similar should be a good expedient, or feel free to use a fork with your change. We do not plan to integrate this feature into the master branch.
Thanks, @BenLangmead! We appreciate the follow up, and all of incredible work that has, and continues, to go into bowtie2!
Add --sam-omit-prim-seq, with the same semantics as --omit-sec-seq but operating on primary alignments.
Addresses #457