fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
309 stars 67 forks source link

Too up-regulated scores after fgbio CallDuplexConsensusReads #975

Closed VasLem closed 5 months ago

VasLem commented 5 months ago

Hi, I have a question regarding CallDuplexConsensusReads. The Phred score for my 5 input samples (which are by the way using a targets and baits targeted sequencing kit, and have about 900 fold enrichment on the targeted regions) is in the range of 35, shown in gray in the plot below. Once the function is called, with default parameters, I am getting transformed reads with very variable scores, shown in purple in the plot. But what concerns me is the huge upscaling of some reads, that end up having a Phred score close to 90. I delved into the maths described here . I understand that the product of probabilities increases the more the reads agree on a single base. For a base with an original Phred score of 35, a Phred score of 90 is translated into a decrease of 5 magnitudes. Does this mean that about 100k reads are on consensus for that base? Thank you in advance!

fastqc_per_sequence_quality_scores_plot

karlkashofer commented 5 months ago

Hi ! I also see this, in higher depth sequenced targeted gene panels with duplex barcodes.

And it seems strelka chokes on these bases:

[2024-04-07T20:27:56.638567Z] [35329e598677] [23_1] [WorkflowRunner] [ERROR] [2024-04-07T20:27:27.747766Z] [35329e598677] [23_1] [CallGenome+callGenomeSegment_chromId_000_chr1_0007] FATAL_ERROR: Attempting to lookup basecall quality score 89 which exceeds the maximum cached basecall quality score of 70

It seems strelka does not allow bases with quality scores > 70 https://github.com/Illumina/strelka/tree/master/docs/userGuide#alignment-files

This does not happen with CallMolecularConsensusReads.

Is there any way to work around this ?

clintval commented 5 months ago

@VasLem you can determine exactly how many reads are used to build up a consensus base using the per-base SAM tags. You will want to look at the arrays stored in the ac and bc tags to see how many reads you've observed per strand. Duplex sequencing further increases accuracy of the final called base by comparing the consensus of both strands and creating a final consensus base call (or No-call if the result is ambiguous).

https://github.com/fulcrumgenomics/fgbio/blob/3a74fd28ff630b417410953541ff107a1b7b0fb7/src/main/scala/com/fulcrumgenomics/umi/ConsensusTags.scala#L57-L77

@karlkashofer you will have to patch Strelka source code yourself and recompile the tool like is done in this PR for manta: