GregoryFaust / samblaster

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.
MIT License
225 stars 30 forks source link

Is samblaster readgroup aware? #32

Closed crazyhottommy closed 6 years ago

crazyhottommy commented 7 years ago

Hi,

Is samblaster read group aware? I am following GATK best practice. https://software.broadinstitute.org/gatk/documentation/article.php?id=3060

Here at the Broad Institute, we run the initial steps of the pre-processing workflow (mapping, sorting and marking duplicates) separately on each individual read group. Then we merge the data to produce a single BAM file for each sample (aggregation); this is done by re-running Mark Duplicates, this time on all read group BAM files for a sample at the same time. Then we run Indel Realignment and Base Recalibration on the aggregated per-sample BAM files.

Picard Mark Duplicates handles that. want to know if samblaster handles that as well. https://gatkforums.broadinstitute.org/gatk/discussion/6199/picard-mark-duplicates-handling-of-library-information

Thanks! Tommy

GregoryFaust commented 6 years ago

Tommy,

Thanks for your interest in samblaster. While read-group awareness this is an often discussed feature, as of now samblaster is not read-group aware. This is because the typical usage scenario, samblaster is used in a pipe right after the alignment step (say, with BWA MEM) while the reads are still grouped by read-id (QNAME) before the file is sorted by genome position. Such an alignment run usually includes reads from a single read-group. In the pipeline you describe, the final merge and re-mark dup step for all the read-groups for a sample is done on a position sorted file (unless you specifically resort by read-id, use samblaster to mark duplicates, and then resort by position).

In case I do add read-group awareness, I am curious. How many read-groups would be sufficient for your samples?

Thanks Greg

crazyhottommy commented 6 years ago

Hi Greg,

Thanks for your reply. I typically uses samblaster in a pipe when reads are still grouped by id not position sorted: https://gitlab.com/tangming2005/snakemake_DNAseq_pipeline/blob/lancet/Snakefile#L297 but that's for samples with a single read group.

if there are multiple read groups, I have to align by @RG first for each bam, merge the bam (note that picard MergeSamFiles can handle the readgroup as well, while samtools merge infer readgroup from filename which may not be what you want) https://gitlab.com/tangming2005/snakemake_DNAseq_pipeline/blob/multiRG/Snakefile#L362

and then mark duplicates using Picard markduplicates https://gitlab.com/tangming2005/snakemake_DNAseq_pipeline/blob/multiRG/Snakefile#L397

For this particular set of samples, I have 5 read groups. I think usually it is from a flow-cell with 8 lanes, so usually 8 is enough? Does the number of readgroup affect the implementation?

picard markduplicates uses 50G ram for a 20G WES bam, which is just way too big... for WGS, I have to split by chromosome and do the markduplicates.( I do indel realign, base recalibration and mutect call by chromosome anyways...)

Thanks again! Tommy

GregoryFaust commented 6 years ago

Yes, breaking the position sorted files by chromosome is an old trick to either saves on resources, or increase parallelism if you have large enough machines (or both).

If you can combine all the reads from each read-group separately before alignment, then use samblaster to mark duplicates as usual, you will be marking each read-group independently as desired. If so, I don't see why you would need to re mark duplicates at the end. Does that solve your problem?

crazyhottommy commented 6 years ago

Hi @GregoryFaust, thanks again.

From the link https://software.broadinstitute.org/gatk/documentation/article.php?id=3060

  1. Merge read groups and mark duplicates per sample (aggregation + dedup) Once you have pre-processed each read group individually, you merge read groups belonging to >the same sample into a single BAM file. You can do this as a standalone step, bur for the sake of >efficiency we combine this with the per-readgroup duplicate marking step (it's simply a matter of >passing the multiple inputs to MarkDuplicates in a single command).

The example data becomes:

sampleA.merged.dedup.bam sampleB.merged.dedup.bam To be clear, this is the round of marking duplicates that matters. It eliminates PCR duplicates (arising from library preparation) across all lanes in addition to optical duplicates (which are by definition only per-lane).

if the same library is sequenced in different lanes, one wants to merge the bam per lane and then markduplicates.

https://www.biostars.org/p/57143/ and it seems one can skip the per lane markduplicates and only do it for the merged bam. https://gatkforums.broadinstitute.org/gatk/discussion/6199/picard-mark-duplicates-handling-of-library-information

Tommy