[User Story] Deduplication with consensus collapse

mathiasbio commented 7 months ago

Need

As a clinician I want to be able to detect variants to a low allele frequency, as cheaply as possible, and with as few false positives as possible.

Using Sentieon Dedup from the new version of Sentieon 202308 gives the extra flag of --consensus which allows deduplication based on UMI tags as well as read position, and creation of consensus reads based on the duplicates. This has 2 benefits:

It allows us to rescue reads that are falsely discarded as duplicates, coverage has been observed to improve by ~ 28% in some samples (see table below) for a comparison of mean coverage with simple dedup, and using UMIs.
It allows creation of better quality reads when duplicates are found.

Suggested approach

We would need to update Sentieon first to have access to this feature:

Update Sentieon https://github.com/Clinical-Genomics/BALSAMIC/issues/1384
Introduce new consensus option in balsamic. (has begun, see linked PR)-
Carry out this assessment to ensure that we understand how TNscope handles strandedness after UMI collapse into singletons: https://github.com/Clinical-Genomics/BALSAMIC/issues/1438

Considered alternatives

No response

Deviation

No response

System requirements assessed

[ ] Yes, I have reviewed the system requirements

Requirements affected by this story

No response

Risk assessment needed

[ ] Needed
[ ] Not needed

Risk assessment

No response

SOUPs

No response

Can be closed when

No response

Blockers

Update Sentieon (https://github.com/Clinical-Genomics/BALSAMIC/issues/1250)
This update is also blocked by some bugs in the Dedup UMI consensus approach. (Sentieon has been notified and have confirmed the issue, and they will fix it in the next release)

Anything else?

No response

mathiasbio commented 7 months ago

At the moment it seems that using Sentieon Dedup with --consensus option collapses read-pairs when the mates are overlapping into a singleton. I have emailed Sentieon to ask what their thoughts are on this approach, and if there is any way to preserve the reads as pairs.

I'm a bioinformatician working at Clinical Genomics Stockholm and at the moment running either the "umi consensus" tool or "Dedup" with "--consensus" option supplied overlapping read-pairs are merged into singletons.

Is there a reason it was implemented this way for the benefit of downstream analysis? At the moment I'm thinking it would be more ideal to preserve the read pairs in order to have more standard data for structural variants calling, but I haven't fully thought this through.

mathiasbio commented 7 months ago

Sentieon reply from Don Freed:

The umi consensus tool (either umi consensus or Dedup) tries to output the most accurate representation of the original DNA fragment, before amplification and sequencing. For overlapping paired reads, the tool needs to consider the sequence and base quality of both pairs simultaneously (as they are not independent observations) to output the most accurate representation. Accordingly, the tools then outputs a single read, containing the combined information from both read pairs.

In theory, merging of overlapping reads should not reduce the ability of downstream tools to detect structural variation. However, this will depend on the SV caller's handling of the mixed paired-end and single-end data.

I also want to mention that output of overlapping paired reads is supported in the umi consensus tool with the hidden argument, --output_overlapped_pairs. You are welcome to try this argument but please be aware that this is a relatively untested feature, and may produce unexpected output.

However after discussing with @khurrammaqbool and @J35P312 , it seems that the singleton UMI collapse should not affect the SV calling negatively -- as long as the SV callers are sophisticated enough to make use of singletons and is capable of handling mixed read-pairs and singleton data.

mathiasbio commented 3 months ago

There are two issues at the moment with this:

The deduplication stats are missing
The bamfiles have integrity errors which Picard complains about

I have emailed Sentieon about both of these issues, and the summary is that Sentieon will try to fix both issues in their next release. That is to fix the integrity errors and to add more useful stats into their Dedup UMI consensus workflow so that we can calculate % Duplicates.

In the meantime I have implemented instead the regular UMI workflow without any filtering on size of the UMI group into the standard TGA workflow, which should mimic the Dedup approach with the exception of needing more tools. https://github.com/Clinical-Genomics/BALSAMIC/pull/1434

Sentieon however still recommends that we use the Dedup approach since they will deprecate the UMI tools. However...I don't know how we can do that when the tool isn't working. A solution could of course be to use the regular UMI tools for now and when they fix the issue, start using the Dedup solution, and maybe it will be fixed before we have a chance to make a new release. So maybe this work should be down-prioritised for now, and focus on the WGS type of features for release 16 instead.

An added level of complexity is that introducing these UMI collapse into standard TGA production requires us to rebuild the PONs for the existing panels, and it would be nice to create these PONs with a type of bamfile which will last longterm. But it may not matter if these PONs are built using the regular UMI tools or the Dedup-solution, so maybe it's not an issue.

From Sentieon

Hi Mathias,

Thank you for clarifying. With the --consensus mode it is intended that we do not output data for the standard dedup metrics (in the --metrics file) as the consensus mode is different from the regular dedup mode.

Our tools do have the ability to output some statistics on from the duplex consensus process. The statistics are similar to the following (with duplex umis): duplex stat: 1 3049173 2 3193300 3 111292 31 57295 32 57802

Here is the explanation for the metrics: 1: Single-strand consensus with no reads from the complementary UMI group. This occurs when all overlapping reads of the group have the same UMI tags 2: Duplex consensus from two strands of DNA (complementary UMI groups). 3: UMI sequences from R1 and R2 reads are the same. In this case, the tool will use the strand of the insert to determine whether the input reads are from the same strand. This group is further divided into: 31: All reads are from the same strand 32: Reads are from both strands

In the past, our tools have also output a histogram containing the number consensus reads from a single input read, two input reads, etc.: 1 70587769

2 15996493

3 3404417

4 691710

5 136508

6 26735

7 5237

8 1129

9 289

10 193

Would either of these metrics be helpful for your work? We can update the tools so that these metrics are output during the consensus dedup process.

Thanks, Don

My reply:

Hi!

Sorry for being slow to respond. Thanks for the reply!

Those stats that you're mentioning I don't see them in the metrics file that is output from Dedup with the UMI consensus mode activated. In the metrics file I have (v 202308.02) only this info exists:

LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE TargetPanel 34853 126325657 547304 554201 0 0 1151354 0.000000 0

Maybe we get this info when adding the argument --cell_barcode_counts ?

I think maybe this information could be useful to us. But I think in addition since we are used to keeping track of the percent duplicates in our samples it would be nice to have that metric, and I'm sure we can calculate it ourselves simply by counting the reads before and after UMI collapse, but I don't think we can avoid making some custom solutions then, also for adding it to multiqc where we collect all of our QC metrics for easy access by downstream tools.

I have also had some trouble with implementing this Dedup UMI consensus into our workflow, and at the moment I'm thinking we might keep the standard UMI workflow instead which seems to be working better, unless there is some major reason for using the Dedup option instead (probably it's faster than going through all the other UMI tools).

There are 2 issues that I have discovered while working on this, and I'll describe the workflow I have tried:

The issues:

Any time I'm running a picard tool on the bamfile, including ValidateSamFile I'm getting errors like this:

INVALID_UNALIGNED_MATE_START Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR::INVALID_UNALIGNED_MATE_START:Record 6, Read name UMI-ATG-CTA-8ebe1c0d-A01901:175:H23G3DSX7:4:2103:25003:36119-D1, The unaligned mate start position is 10004, should be 0

Which stops the execution. But I don't know why it's saying this because whenever I have looked at the reads in the bam file the unaligned mate is set to 0:

UMI-GGC-GGT-9b0ec04d-A01901:175:H23G3DSX7:1:1367:14416:10363-D1 0 1 35392 0 127M * 1 0 (unaligned mate position???) TAATCCCAACACTTTGGGAGGCTGAGGTGGGAGGATCGCTTGAGGCCAGGAGTTCAAGACCAGCCTGAGCAACATAGTGAGACTTTGTCTCTATAAAAAATAAATAAATAAATAAAAACAACTTGTC JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ RG:Z:sample AS:i:127 XS:i:127 XZ:Z:0,1 XA:Z:19,+77001,127M,0;9,+35175,127M,1; RX:Z:GGT-GGC XR:Z:GGC-GGT MI:Z:GGC-GGT-9b0ec04d

MISSING_TAG_NM When running ValidateSamFile I'm also seeing a lot of these, which seem to be true for all UMI consensus reads I've looked at:

WARNING::MISSING_TAG_NM:Record 153, Read name UMI-GGC-GGT-9b0ec04d-A01901:175:H23G3DSX7:1:1367:14416:10363-D1, NM tag (nucleotide differences) is missing

When extracting the reads from the bamfile I can see that the NM field is indeed missing.

Too high phred scores after UMI collapse

A final issue I've had so far, is not unique to the Dedup consensus but is true of the regular UMI workflow as well, but is not really a fault of the tools exactly and I have resolved it by implementing a manual script. But it's basically that the quality-scores from the UMI collapsed reads are too high for Manta. So I have been forced to implement a custom script to cap the quality to 70. Just a heads up!

The workflow:

I concatenate fastq pairs per lane

UMI extract:

sentieon-genomics-202308.02/bin/sentieon umi extract -d '3M2S+T,3M2S+T' concat.sample_1.pre_umi.fastq.gz concat.sample_2.pre_umi.fastq.gz | gzip > sample_umiextract_interleaved.fastq.gz

Fastq quality-trimming. We want to do this because we're not reliably doing error-correction here, only deduplicating and sometimes getting error-correction, AND because the new Novaseq platform tends to get us lower insert-sizes and more read-through adapters that UMIs won't remove.

fastp --thread 4 --in1 sample_umiextract_interleaved.fastq.g --interleaved_in --out1 sample_1.fp.fastq.gz --out2 sample_2.fp.fastq.gz --json sample_fastp.json --html sample_fastp.html --overrepresentation_analysis --trim_tail1 1 --n_base_limit 50 --length_required 25 --low_complexity_filter --trim_poly_g --dont_eval_duplication

Alignment:

sentieon-genomics-202308.02/bin/sentieon bwa mem -R '@RG\tID:sample\tSM:NORMAL\tLB:TargetPanel\tPL:ILLUMINA' -K 1000000 -t 8 -C human_g1k_v37.fasta sample_1.fp.fastq.gz sample_2 | /sentieon-genomics-202308.02/bin/sentieon util sort -r human_g1k_v37.fasta --sam2bam -o sample_align_sort.bam -i - ;

Dedup with UMI and consensus

sentieon-genomics-202308.02/bin/sentieon driver -t 8 -i sample_align_sort.bam --algo LocusCollector --umi_ecc_dist 0 --consensus --umi_tag XR --fun score_info normal.sample.dedup.score ;

sentieon-genomics-202308.02/bin/sentieon driver -t 8 -r human_g1k_v37.fasta -i sample_align_sort.bam --algo Dedup --score_info normal.sample.dedup.score --metrics normal.sample.dedup.metrics normal.sample.dedup.bam ;

Conclusion

Either way, if there workflow doesn't become much faster with the Dedup approach to UMI collapse we might stick with the regular UMI workflow which seems to be working fine. But it would of course be nice if we could shave off some time by running Dedup instead of the long string of tools which is the UMI workflow, so if you have any advice it would be great!

Thanks!

Kind regards, Mathias

Sentieon reply:

Hi Mathias,

No problem, thank you for the reply here.

UMI Metrics Currently the LocusCollect/Dedup flow will not produce detailed metrics when run with a UMI tag in consensus mode. This should be fixed significantly in the next release of our software package, and we will at least add the metrics from the sentieon umi consensus workflow. This should address your need to track the percentage of duplicate reads, although the new output may not be compatible with MultiQC initially.

I would recommend sticking with the LocusCollector/Dedup flow for UMI processing, if possible. Our development effort is currently focused on the LocusCollector/Dedup workflow and the umi consensus workflow will be deprecated at some point in the future.

Picard Validation Errors/Warnings I am able to confirm the "INVALID_UNALIGNED_MATE_START" validation error using an internal dataset and I've filed a ticket with our engineering team so that this can be addressed. It is likely that this will also be fixed in the next release of our software package (although this may be pushed out, depending on our release timeline).

The warning, "MISSING_TAG_NM" is expected. After creation of consensus reads, this tag is not re-generated. This will not affect Sentieon's variant callers (which do not use this tag), but it could potentially affect other tools that depend on this tag.

The workflow looks good to me. Using an adapter trimming tool is recommended if the reads contain additional adapter sequences that are not removed during UMI extraction. Regarding the high phred-scores, post-processing the BQs to be compatible with Manta seems like an OK workaround.

Best regards, Don

mathiasbio commented 2 months ago

I continued the conversation with Sentieon developers and they suggested that while the error with the mate is a bug, it can be solved by running:

samtools collate -@ 4 -O -u deduped.bam /data/work/dfreed/tmp/collate-tmp | \
    samtools fixmate --reference $ref -@ 4 - - | \
    sentieon util sort -i - -o deduped_fm.bam

Which I have implemented in the PR: https://github.com/Clinical-Genomics/BALSAMIC/pull/1358

Clinical-Genomics / BALSAMIC