broadinstitute / gtex-pipeline

GTEx & TOPMed data production and analysis pipelines
BSD 3-Clause "New" or "Revised" License
343 stars 175 forks source link

MarkDuplicates Does Not Affect Subsequent Analyses? #29

Closed maegsul closed 4 years ago

maegsul commented 4 years ago

Hi, first of all thanks a lot for this incredibly useful repository.

I am following GTEx v9 (using the branch, because v8 run_rnaseqc.py gives me 0 counts for all exons/transcripts/genes) pipeline for expression quantification and eQTL mapping.

To see the effect of Mark Duplicates step on quantification, I took an original bam file and its version processed by Mark Duplicates (that keeps all the reads, but changes only the second column of the bam file [flag] as far as I know).

Then, I run "run_rnaseqc.py" on both the original bam file and the output file of Mark Duplicates. When I compare the output, they seem to be identical.

Is this expected? MarkDuplicates step is not meant to affect expression quantification and later steps? If so, what is its functionality?

Thanks!

francois-a commented 4 years ago

Hi,

That's correct. The MarkDuplicates step is included to provide this annotation in the BAM files, but duplicates are not excluded from downstream quantifications in the core pipeline. This is due to ambiguities in resolving the source of duplicates, which can be biological or technical (see for example here).

maegsul commented 4 years ago

Thank you very much for the clarification François!