[User Story] Record optical duplicates in QC report

pbiology commented 12 months ago

Need

As the lab head of unit I want to be able to trend the optical duplicates in samples processed by BALSAMIC so that we can see if the levels change when we change methods an instruments in the lab. Most recently due to the new NovaSeq X.

Suggested approach

Start collecting the optical duplicates per case and record in multiQC. This should be done both for panels, WES and WGS.

Considered alternatives

None

Deviation

None

System requirements assessed

[ ] Yes, I have reviewed the system requirements
[x] No

Requirements affected by this story

N/A

Risk assessment needed

[ ] Needed
[x] Not needed

Risk assessment

Gathering new metrics doesn't change the analysis in any way so no risk.

SOUPs

N/A

Can be closed when

[ ] BALSAMIC MultiQC reports includes optical duplicates

Blockers

Fix detect optical duplicates in TGA workflow #1291

Anything else?

Do we need to include something new in which documents to save?

mathiasbio commented 11 months ago

I think collecting this statistic already existed for TGA-cases (panel and exome) which was using picard markdups, but was lacking for WGS because the sentieon dedup report wasn't picked up by multiqc, however in the development version after creating more unified alignment workflows where the sentieon dedup tool is used for dedup in all workflows (and where the report was modified to be acceptable by multiqc) the same duplication stats which includes the optical duplicates is reported by all workflows. However! For some reason no optical duplicates are detected in TGA...

Below are the top 2 examples from TGA tumor and normal sample, and bottom an example WGS test sample run on the develop branch.

LIBRARY	UNPAIRED_READS_EXAMINED	READ_PAIRS_EXAMINED	SECONDARY_OR_SUPPLEMENTARY_RDS	UNMAPPED_READS	UNPAIRED_READ_DUPLICATES	READ_PAIR_DUPLICATES	READ_PAIR_OPTICAL_DUPLICATES	PERCENT_DUPLICATION	ESTIMATED_LIBRARY_SIZE
Unknown Library	7776	40365947	661574	10128	7047	14367911	0	0,355994	42248922
Unknown Library	6150	25823111	540829	8460	5402	6595770	0	0,255495	41515496
Unknown Library	231373	347284974	3329457	817093	84083	31283787	2697729	0,090172	1960368099

These values are also included in the multiqc_data.json including some small calculations that multiqc has done on the stats. Here's the example from the WGS case from multiqc_data.json

LIBRARY:	Unknown Library,
UNPAIRED_READS_EXAMINED:	231373
READ_PAIRS_EXAMINED:	347284974
SECONDARY_OR_SUPPLEMENTARY_RDS:	3329457
UNMAPPED_READS:	817093
UNPAIRED_READ_DUPLICATES:	84083
READ_PAIR_DUPLICATES:	31283787
READ_PAIR_OPTICAL_DUPLICATES:	2697729
PERCENT_DUPLICATION:	0,090172
ESTIMATED_LIBRARY_SIZE:	1960368099
READS_IN_DUPLICATE_PAIRS:	62567574
READS_IN_UNIQUE_PAIRS:	632002374
READS_IN_UNIQUE_UNPAIRED:	147290
READS_IN_DUPLICATE_PAIRS_OPTICAL:	5395458
READS_IN_DUPLICATE_PAIRS_NONOPTICAL:	57172116
READS_IN_DUPLICATE_UNPAIRED:	84083
READS_UNMAPPED:	817093

mathiasbio commented 11 months ago

There are no percent optical duplicates however, but it can be calculated quite simply from these values. Is that sufficient to close this issue? @pbiology

mathiasbio commented 11 months ago

I will investigate a little why there are no optical dups in TGA first however...

mathiasbio commented 11 months ago

Ok...I think I see the issue...for TGA we modify the headers of the reads in the fastq by adding the extracted UMI-sequence, which creates problems for how Picard MarkDuplicates extracts the flowcell tile information.

https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-

--READ_NAME_REGEX <String>    MarkDuplicates can use the tile and cluster positions to estimate the rate of optical
                              duplication in addition to the dominant source of duplication, PCR, to provide a more
                              accurate estimation of library size. By default (with no READ_NAME_REGEX specified),
                              MarkDuplicates will attempt to extract coordinates using a split on ':' (see Note below). 
                              Set READ_NAME_REGEX to 'null' to disable optical duplicate detection. Note that without
                              optical duplicate counts, library size estimation will be less accurate. If the read name
                              does not follow a standard Illumina colon-separation convention, but does contain tile and
                              x,y coordinates, a regular expression can be specified to extract three variables:
                              tile/region, x coordinate and y coordinate from a read name. The regular expression must
                              contain three capture groups for the three variables, in order. It must match the entire
                              read name.   e.g. if field names were separated by semi-colon (';') this example regex
                              could be specified      (?:.*;)?([0-9]+)[^;]*;([0-9]+)[^;]*;([0-9]+)[^;]*$ Note that if no
                              READ_NAME_REGEX is specified, the read name is split on ':'.   For 5 element names, the
                              3rd, 4th and 5th elements are assumed to be tile, x and y values.   For 7 element names
                              (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values. 
                              Default value: <optimized capture of last three ':' separated fields as numeric values>.

So it seems that the problem is slightly worse than not just tracking the optical duplicates, we have probably not detected and marked them ever since the UMIs were added to the TGA.

pbiology commented 11 months ago

There are no percent optical duplicates however, but it can be calculated quite simply from these values. Is that sufficient to close this issue? @pbiology

I would say so? Perhaps @karlnyr has some other input?

mathiasbio commented 11 months ago

I think it would be good to try to fix the MarkDuplicates issue for Optical Duplicate detection in TGA as well. Hopefully it's as easy as adding a new --READ_NAME_REGEX to Sentieon Dedup for the TGA cases to adjust for the UMI-sequences in the readheader

mathiasbio commented 9 months ago

Blocked by this: https://github.com/Clinical-Genomics/BALSAMIC/issues/1291

pbiology commented 9 months ago

Updated the issues to the User Story format

mathiasbio commented 1 month ago

This issue will be solved by PR https://github.com/Clinical-Genomics/BALSAMIC/pull/1358

Clinical-Genomics / BALSAMIC