Closed naumenko-sa closed 3 years ago
Which metric is right? In a recent project I got 90% dups in all samples in MultiQC report. I thought it was due to bad library preparation or degradation issues.
I found it has been discussed before: https://github.com/bcbio/bcbio-nextgen/issues/3029 https://github.com/bcbio/bcbio-nextgen/issues/2461
Dup % = Duplication_Rate_of_Mapped https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/qc/multiqc.py#L462
It is calculated by:
bammarkduplicates
https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/bam/readstats.py#L37 https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/qc/qualimap.py#L219
So it is a proper metric, and a high duplication rate in an RNA-seq dataset just means it is well saturated.
fastqc gives a close estimate in work/qc/sample/fastqc/fastqc_data.txt
,
in my case, it is 86.3% by bcbio and 85.2% by fastqc, but in multiqc_report.html there is 14.7%.
I think it is calculated as 14.7% ~ 100 - 85.2%.
Maybe in multiqc/fastqc module:
https://github.com/ewels/MultiQC/blob/master/multiqc/modules/fastqc/fastqc.py#L230
For now, we are just ignoring fastqc duplicates statistics in the General Stats Table (the column is switched off). If anybody wants to go to the bottom of this issue and fix it to perfection, let us know.
Here is the bug in Dup % = Duplication_Rate_of_Mapped https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/qc/multiqc.py#L462
dup_count is counting the reads without the Duplicated SAM flag, instead of the Duplicated ones dup_count = readstats.number_of_mapped_reads(data, dup_align_bam, keep_dups=False) tot_count = readstats.number_of_mapped_reads(data, dup_align_bam, keep_dups=True)
Therefore rate = dupes / total should be rate = 1 - dupes / total
For example, Dup = 88.7%, % Dups = 9.1% Dup = 76.5%, % Dups = 11.0% and so on