fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
315 stars 68 forks source link

Call consensus reads in a 2*75bp sequencing setting (fwd and reverse do not overlap) #911

Closed ktrien closed 1 year ago

ktrien commented 1 year ago

Dear,

We are testing UMI-adapters (5M2S on both sides) in combination with our current workflow (2*75bp sequencing). For bio-informatic analysis afterwards, we relied on many fgbio tools.

Concerning 'Call consensus reads' Our library length, including adapters is +/-350-420bp. As such we have an insert size of +/- 200bp.

We initially tried the callduplexconsensusreads tool but since we sequence only 2*75bp, most of our forward and reverse reads do not overlap, and as such for most of the reads, duplex consensus reads cannot be assembled.

Callmolecularconsensusreads might be a better option, but still many reads are filtered out.

Do you have a suggestion how to improve our callconsensusreads when forward and reverse do not overlap?

For your information, our current settings are

1.CallMolecularConsensusReads default settings 'min-reads' => 3,

  1. FilterConsensusReads def options { 'input' => "#{prefix}.aligned.merged.consensus.bam", 'output' => "#{prefix}.aligned.merged.filtered.consensus.bam", 'ref' => '/data/hg38.fa', 'min-reads' => 3 0 0, 'max-read-error-rate' => 0.05, 'max-base-error-rate' => 0.1, 'min-base-quality' => 30, 'max-no-call-fraction' => 0.1

Thanks in advance!

nh13 commented 1 year ago

Can you let us know which library prep kit your testing? Given you tried using CallDuplexConsensusReads, I am assuming the kit is a duplex-sequencing kit, whereby you sequence both strands of a double-stranded source molecule, such that you able to identify the top and bottom strand through the UMIs on both ends of the read pair. If so, then CallDuplexConsensusReads is the correct tool. It does not need for the forward and reverse reads in a read pair to overlap to make a duplex consensus. Perhaps you mean that you do not get enough consensus reads from the top and bottom strand of a source molecule, so that you really are only getting single-strand consensus reads? You can have CallDuplexConsensusReads output single strand consensus reads by setting --min-reads 1 1 0 (see the usage for more info on this option). Does that help?

ktrien commented 1 year ago

Dear,

Thanks for the additional information. We use the Twist Bioscience 'capture' NGS library prep (double stranded probes, 2*75bp) and assumed indeed to be a duplex-sequencing kit. Good to know that forward and reverse reads do not need to overlap, we understand now.

We played around with different parameter settings and also tried --min-reads 1 1 0. As expected, we retain a 10% 'UMI-pipeline' output (0.7M reads)', compared to the pipeline without UMI-filtering (8.9M reads).

One final question about: A duplex consensus read is made, based on 1) the plus-strand consensus and 2) minus strand consensus. Does this mean that in 'counting numbers' this duplex consensus reads counts for '1'? Or are both the plus-strand consensus AND minus-strand consensus (whether or not corrected by the duplex consensus comparison) retained and as such '2 reads' are retained in the final bam?

Thanks for the clarification. Kind regards, Katrien

nh13 commented 1 year ago

The consensus BAM will have a single read pair (two records, one for R1 and one for R2) per consensus, regardless of the number of raw reads that contributed to the consensus on each strand. So this counts as one molecule (or one template or one consensus read pair).