Closed dpark01 closed 6 years ago
This would be nice. I'd also like to revisit unaligned duplicate removal. I've done some reading on it, but have yet to commit to a path of action (writing something of our own, maybe MinHash-based, or using something off the shelf). Brian Bushnell of JGI has a new duplicate removal tool mostly for optical duplicates (via spatial proximity), clumpfiy, but it acts on fastq files.
There's also cd-hit that Simon added, but has unresolved bugs at the moment. A tool acting only on fastq files isn't too big of a deal, since we have design patterns to deal with that all over our code base, including our current duplicate removal step.
It may also be worth revisiting the depletion step (my gut is to replace bmtagger with bwa mem), but that's a separate issue.
I found this helpful for thinking about duplicates, from an Illumina doc somewhere. It does not include tile-edge duplicates (mostly a NextSeq issue).
The bug in cd-hit
is that it basically only removes identical duplicates, but it's also unfriendly to piped fastq so I wouldn't recommend it generally.
In most cases, mvicuna duplicate removal is only called on a single sequencing library, because much of the time that we run
taxon_filter.py deplete_human
, it is only being called on a single library of a single sample (e.g. auto-demux_plus post DNAnexus upload, any MiSeq-based analyses, even snakemake executions of large data sets of multiple runs will break up jobs by run prior to merging bams).But sometimes people will have a single merged large BAM file they want to run through the entire workflow (e.g. @tomkinsc). In those cases,
read_utils.rmdup_mvicuna_bam
(which is a part oftaxon_filter.deplete_human
) is smart enough to break out the BAM file into read groups and run duplicate removal on a per-library basis (since that's the correct way to remove duplicates).mvicuna DupRm is single-threaded and never utilizes more than one core. If we have more than one core available, and if more than one library exists in the BAM file read group headers, we might as well do the same concurrent futures thing that we do elsewhere in viral-ngs: just call mvicuna in parallel (set the ProcessPoolExecutor to max out based on the number of cores given to it) and save a little time that way. We could then call the FilterSamReadsTool once on the concatenated read list.
As mentioned above, this optimization wouldn't matter most of the time, but in some use cases, it would be nice.