Open pbiology opened 12 months ago
I think collecting this statistic already existed for TGA-cases (panel and exome) which was using picard markdups, but was lacking for WGS because the sentieon dedup report wasn't picked up by multiqc, however in the development version after creating more unified alignment workflows where the sentieon dedup tool is used for dedup in all workflows (and where the report was modified to be acceptable by multiqc) the same duplication stats which includes the optical duplicates is reported by all workflows. However! For some reason no optical duplicates are detected in TGA...
Below are the top 2 examples from TGA tumor and normal sample, and bottom an example WGS test sample run on the develop branch.
LIBRARY | UNPAIRED_READS_EXAMINED | READ_PAIRS_EXAMINED | SECONDARY_OR_SUPPLEMENTARY_RDS | UNMAPPED_READS | UNPAIRED_READ_DUPLICATES | READ_PAIR_DUPLICATES | READ_PAIR_OPTICAL_DUPLICATES | PERCENT_DUPLICATION | ESTIMATED_LIBRARY_SIZE |
---|---|---|---|---|---|---|---|---|---|
Unknown Library | 7776 | 40365947 | 661574 | 10128 | 7047 | 14367911 | 0 | 0,355994 | 42248922 |
Unknown Library | 6150 | 25823111 | 540829 | 8460 | 5402 | 6595770 | 0 | 0,255495 | 41515496 |
Unknown Library | 231373 | 347284974 | 3329457 | 817093 | 84083 | 31283787 | 2697729 | 0,090172 | 1960368099 |
These values are also included in the multiqc_data.json including some small calculations that multiqc has done on the stats. Here's the example from the WGS case from multiqc_data.json
LIBRARY: | Unknown Library, |
---|---|
UNPAIRED_READS_EXAMINED: | 231373 |
READ_PAIRS_EXAMINED: | 347284974 |
SECONDARY_OR_SUPPLEMENTARY_RDS: | 3329457 |
UNMAPPED_READS: | 817093 |
UNPAIRED_READ_DUPLICATES: | 84083 |
READ_PAIR_DUPLICATES: | 31283787 |
READ_PAIR_OPTICAL_DUPLICATES: | 2697729 |
PERCENT_DUPLICATION: | 0,090172 |
ESTIMATED_LIBRARY_SIZE: | 1960368099 |
READS_IN_DUPLICATE_PAIRS: | 62567574 |
READS_IN_UNIQUE_PAIRS: | 632002374 |
READS_IN_UNIQUE_UNPAIRED: | 147290 |
READS_IN_DUPLICATE_PAIRS_OPTICAL: | 5395458 |
READS_IN_DUPLICATE_PAIRS_NONOPTICAL: | 57172116 |
READS_IN_DUPLICATE_UNPAIRED: | 84083 |
READS_UNMAPPED: | 817093 |
There are no percent optical duplicates however, but it can be calculated quite simply from these values. Is that sufficient to close this issue? @pbiology
I will investigate a little why there are no optical dups in TGA first however...
Ok...I think I see the issue...for TGA we modify the headers of the reads in the fastq by adding the extracted UMI-sequence, which creates problems for how Picard MarkDuplicates extracts the flowcell tile information.
https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard-
--READ_NAME_REGEX <String> MarkDuplicates can use the tile and cluster positions to estimate the rate of optical
duplication in addition to the dominant source of duplication, PCR, to provide a more
accurate estimation of library size. By default (with no READ_NAME_REGEX specified),
MarkDuplicates will attempt to extract coordinates using a split on ':' (see Note below).
Set READ_NAME_REGEX to 'null' to disable optical duplicate detection. Note that without
optical duplicate counts, library size estimation will be less accurate. If the read name
does not follow a standard Illumina colon-separation convention, but does contain tile and
x,y coordinates, a regular expression can be specified to extract three variables:
tile/region, x coordinate and y coordinate from a read name. The regular expression must
contain three capture groups for the three variables, in order. It must match the entire
read name. e.g. if field names were separated by semi-colon (';') this example regex
could be specified (?:.*;)?([0-9]+)[^;]*;([0-9]+)[^;]*;([0-9]+)[^;]*$ Note that if no
READ_NAME_REGEX is specified, the read name is split on ':'. For 5 element names, the
3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names
(CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values.
Default value: <optimized capture of last three ':' separated fields as numeric values>.
So it seems that the problem is slightly worse than not just tracking the optical duplicates, we have probably not detected and marked them ever since the UMIs were added to the TGA.
There are no percent optical duplicates however, but it can be calculated quite simply from these values. Is that sufficient to close this issue? @pbiology
I would say so? Perhaps @karlnyr has some other input?
I think it would be good to try to fix the MarkDuplicates issue for Optical Duplicate detection in TGA as well. Hopefully it's as easy as adding a new --READ_NAME_REGEX to Sentieon Dedup for the TGA cases to adjust for the UMI-sequences in the readheader
Blocked by this: https://github.com/Clinical-Genomics/BALSAMIC/issues/1291
Updated the issues to the User Story format
This issue will be solved by PR https://github.com/Clinical-Genomics/BALSAMIC/pull/1358
Need
As the lab head of unit I want to be able to trend the optical duplicates in samples processed by BALSAMIC so that we can see if the levels change when we change methods an instruments in the lab. Most recently due to the new NovaSeq X.
Suggested approach
Start collecting the optical duplicates per case and record in multiQC. This should be done both for panels, WES and WGS.
Considered alternatives
None
Deviation
None
System requirements assessed
Requirements affected by this story
N/A
Risk assessment needed
Risk assessment
Gathering new metrics doesn't change the analysis in any way so no risk.
SOUPs
N/A
Can be closed when
Blockers
Fix detect optical duplicates in TGA workflow #1291
Anything else?