fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
309 stars 67 forks source link

Option to keep MolecularConsensusReads in CallDuplexConsensusReads #977

Closed karlkashofer closed 5 months ago

karlkashofer commented 5 months ago

I work with Agilent HSXT2 data which has molecular tags (UMI) on both sides of the double stranded insert.

When i do deduplication with CallMolecularConsensusReads i still see reads which were derived from the two strands of the same molecule, with their tags reversed (see attached image). Bildschirmfoto vom 2024-04-07 19-33-19

When i do deduplication with CallDuplexConsensusReads these get merged, but i lose all the reads where i dont have the second strand, presumably due to :

Because of the nature of duplex sequencing, this tool does not support fragment reads - if found in the input they are ignored. Similarly, read pairs for which consensus reads cannot be generated for one or other read (R1 or R2) are omitted from the output.

Is it possible to somehow get all reads where information for both strands is present collapsed by DuplexConsensusReads but also have all the reads where only information from one strand is available just written to output as MolecularConsensusReads ?

Sorry if i miss something obvious, cheers, KK

nh13 commented 5 months ago

The quote refers to fragment reads that are not paired, which is not to be confused for different reads observing opposite strands of the original source molecule. To retain consensus reads that have observations from only one strand of the original source molecule, use the --min-reads 1 1 0 option as described in the usage. Please read the usage for examples that describe how three values can be given.

karlkashofer commented 5 months ago

So for CallDuplexConsensusReads: --min-reads 1 1 1 => only keep consensus reads which have at least one observation on each strand --min-reads 1 1 0 => keep double strand consensus and single strand consensus reads, drop reads without multiple observations --min-reads 1 0 0 => keep double strand consensus and single strand consensus reads and also keep reads with only single observation on one strand (this is what i want)

Is this correct ? Thanks for your help, cheers, KK

nh13 commented 5 months ago

If you specify three values, the first value requires a minimum # of observations across both strands, the second value requires a minimum # of observations on one of the strand, and the third requires a minimum # of observations on the other strand.

For 1 1 1, the second and third values are more stringent that the first, since if we require each strand to have at least one observation, then of course there must be at least two observations across both strands.

For 1 1 0, this does not drop any read, since the first value requires at least one observation across both strands, the second value requires at least one observation on one of the specific single-strands, and the third value doesn't require any observations on the other specific single-strand..

For 1 1 0, this does not drop any read, since the first value requires at least one observation across both strands, the second and third value do not require any observations on a specific single-strand.

Does that make sense?