bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
985 stars 353 forks source link

Dup % and FASTQC % Dups in bulk RNA-seq multiqc report are anti-correlated = misleading #3406

Closed naumenko-sa closed 3 years ago

naumenko-sa commented 3 years ago

For example, Dup = 88.7%, % Dups = 9.1% Dup = 76.5%, % Dups = 11.0% and so on

kokyriakidis commented 3 years ago

Which metric is right? In a recent project I got 90% dups in all samples in MultiQC report. I thought it was due to bad library preparation or degradation issues.

naumenko-sa commented 3 years ago

I found it has been discussed before: https://github.com/bcbio/bcbio-nextgen/issues/3029 https://github.com/bcbio/bcbio-nextgen/issues/2461

Dup % = Duplication_Rate_of_Mapped https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/qc/multiqc.py#L462

It is calculated by:

https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/bam/readstats.py#L37 https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/qc/qualimap.py#L219

So it is a proper metric, and a high duplication rate in an RNA-seq dataset just means it is well saturated.

fastqc gives a close estimate in work/qc/sample/fastqc/fastqc_data.txt, in my case, it is 86.3% by bcbio and 85.2% by fastqc, but in multiqc_report.html there is 14.7%. I think it is calculated as 14.7% ~ 100 - 85.2%. Maybe in multiqc/fastqc module: https://github.com/ewels/MultiQC/blob/master/multiqc/modules/fastqc/fastqc.py#L230

For now, we are just ignoring fastqc duplicates statistics in the General Stats Table (the column is switched off). If anybody wants to go to the bottom of this issue and fix it to perfection, let us know.

sib-bcf commented 3 years ago

Here is the bug in Dup % = Duplication_Rate_of_Mapped https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/qc/multiqc.py#L462

dup_count is counting the reads without the Duplicated SAM flag, instead of the Duplicated ones dup_count = readstats.number_of_mapped_reads(data, dup_align_bam, keep_dups=False) tot_count = readstats.number_of_mapped_reads(data, dup_align_bam, keep_dups=True)

Therefore rate = dupes / total should be rate = 1 - dupes / total