Closed crazyhottommy closed 6 years ago
Tommy,
Thanks for your interest in samblaster. While read-group awareness this is an often discussed feature, as of now samblaster is not read-group aware. This is because the typical usage scenario, samblaster is used in a pipe right after the alignment step (say, with BWA MEM) while the reads are still grouped by read-id (QNAME) before the file is sorted by genome position. Such an alignment run usually includes reads from a single read-group. In the pipeline you describe, the final merge and re-mark dup step for all the read-groups for a sample is done on a position sorted file (unless you specifically resort by read-id, use samblaster to mark duplicates, and then resort by position).
In case I do add read-group awareness, I am curious. How many read-groups would be sufficient for your samples?
Thanks Greg
Hi Greg,
Thanks for your reply. I typically uses samblaster in a pipe when reads are still grouped by id not position sorted: https://gitlab.com/tangming2005/snakemake_DNAseq_pipeline/blob/lancet/Snakefile#L297 but that's for samples with a single read group.
if there are multiple read groups, I have to align by @RG first for each bam, merge the bam (note that picard MergeSamFiles can handle the readgroup as well, while samtools merge infer readgroup from filename which may not be what you want) https://gitlab.com/tangming2005/snakemake_DNAseq_pipeline/blob/multiRG/Snakefile#L362
and then mark duplicates using Picard markduplicates https://gitlab.com/tangming2005/snakemake_DNAseq_pipeline/blob/multiRG/Snakefile#L397
For this particular set of samples, I have 5 read groups. I think usually it is from a flow-cell with 8 lanes, so usually 8 is enough? Does the number of readgroup affect the implementation?
picard markduplicates uses 50G ram for a 20G WES bam, which is just way too big... for WGS, I have to split by chromosome and do the markduplicates.( I do indel realign, base recalibration and mutect call by chromosome anyways...)
Thanks again! Tommy
Yes, breaking the position sorted files by chromosome is an old trick to either saves on resources, or increase parallelism if you have large enough machines (or both).
If you can combine all the reads from each read-group separately before alignment, then use samblaster to mark duplicates as usual, you will be marking each read-group independently as desired. If so, I don't see why you would need to re mark duplicates at the end. Does that solve your problem?
Hi @GregoryFaust, thanks again.
From the link https://software.broadinstitute.org/gatk/documentation/article.php?id=3060
- Merge read groups and mark duplicates per sample (aggregation + dedup) Once you have pre-processed each read group individually, you merge read groups belonging to >the same sample into a single BAM file. You can do this as a standalone step, bur for the sake of >efficiency we combine this with the per-readgroup duplicate marking step (it's simply a matter of >passing the multiple inputs to MarkDuplicates in a single command).
The example data becomes:
sampleA.merged.dedup.bam sampleB.merged.dedup.bam To be clear, this is the round of marking duplicates that matters. It eliminates PCR duplicates (arising from library preparation) across all lanes in addition to optical duplicates (which are by definition only per-lane).
if the same library is sequenced in different lanes, one wants to merge the bam per lane and then markduplicates.
https://www.biostars.org/p/57143/ and it seems one can skip the per lane markduplicates and only do it for the merged bam. https://gatkforums.broadinstitute.org/gatk/discussion/6199/picard-mark-duplicates-handling-of-library-information
Tommy
Hi,
Is samblaster read group aware? I am following GATK best practice. https://software.broadinstitute.org/gatk/documentation/article.php?id=3060
Picard Mark Duplicates handles that. want to know if samblaster handles that as well. https://gatkforums.broadinstitute.org/gatk/discussion/6199/picard-mark-duplicates-handling-of-library-information
Thanks! Tommy