fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
309 stars 67 forks source link

variant calling with paired-end sequencing #968

Closed emmecola closed 6 months ago

emmecola commented 6 months ago

Dear fgbio developers

We are using fgbio for a sequencing project, and I need a clarification regarding the consensus generation. Reading from the documentation, I understand that “this tool calls each end of a pair independently, and does not jointly call bases that overlap within a pair”.

If I understand correctly, this basically means that the two ends of a read pair will generate two separate consensus sequences, even if they are tagged with the same UMI and map to the same genomic location (they may even overlap).

If that’s the case, when I call the variants on these consensus sequences, the two ends of the same read pair will count as two molecules. Correct? So, looking at the final VCF file, one cannot distinguish a variant detected in two different molecules from a variant that is detected in the two ends of a read pair. Is my reasoning correct?

Thanks!

nh13 commented 6 months ago

We did add support for consensus calling the overlapping bases in a read pair which is turned on by default in both CallMolecularConsensusReads and CallDuplexConsensusReads: https://github.com/fulcrumgenomics/fgbio/pull/805. There exists a separate tool CallOverlappingConsensusBases that can also perform this correction. The former tools use the consensus method (see the latter's documentation) on the raw reads prior to building the consensus for each read independently.

I hope that helps!

emmecola commented 6 months ago

Thank you! So the information coming from the overlapping mate read is taken into account, but in any case two separate consensus sequences will be reported in the output file. Can you confirm that my understanding is correct?

I think this is an important point to consider when calling variants on the consensus sequences: as far as I can see, the real number of distinct template molecules supporting a mutation might be lower than the number of consensus sequences carrying the mutation, simply because the same molecule is represented twice in the consensus file.

nh13 commented 6 months ago

You are correct that two read pairs are output. But variant callers (like GATK) do not count the same bace twice when reporting a mutation from an overlapping read pair. This is regardless of whether the read is a raw read pair or made from a consensus.