d-cameron commented 1 year ago

In this first wave, we will define the minimal representation needed to encode SVs.

High-level design: two sub-components of SV-VRS. The first is a minimal representation that encodes the structural delta between the reference and a sample; the second is a higher-level grouping of these low-level building blocks. E.g. a breakpoint & CN loss can be group together and classified as a simple deletion. This groups quickly become complicated (especially in cancer) and I expect is where the most discussion will be.

This minimal representation:

contains sufficient information to round-trip a genome in a meaningful manner (e.g. VRS-encoded CHM13 using hg38 sequences can theoretically be converted back to CHM13 fasta)
A genome is composed of DNA segments (CN segments), and DNA adjacencies (breakpoints). This corresponds to a 'breakpoint graph' with segments as nodes, and
Support all existing variant call types
Supports ambiguity
- intrinsic (i.e. homology at a breakpoint)
- calling technology imprecision (e.g. optical mapping / CN resolution; linked reads; only read-pair support)
- unplaced segments (i.e. CN gain - already handled in VRS 1.0)
- Single breakend call due to mapper ambiguity on one side
Normalized/unambiguous representation
- Not actually possible in practice.
- I'll raise this as it's own issue to split out the discussion on that

Minimal model

Breakend

Minimal: {sequence, pos, orientation}

Example: {chr12, 1000, ConnectedAfterPosition }

Ambiguity

As above but instead of an intrinsic sequence-based ambiguity, it's a caller-based margin of error.

Which alignment convention: left/right/centre/most likely/weighted likelihood/undefined?
Left/right breakend margins can be different

Breakpoint

Minimal: {breakend1, breakend2}

Technically speaking the breakends are unorder but we can define a sort order to ensure id stability.

inserted sequence:

can be:
- literal sequence
- derived sequence
  - (i.e. if the donor site can be identified)
  - Does VRS support calling variants within this derived sequence (e.g. a SNV in a translocated LINE)
left/right flanking literal sequences (e.g. manta/DRAGEN)
- My preference is for these 'insertions' to be represented as a pair of single breakends
An estimated gap size
- Needed to support BioNano optical mapping gaps

homology

Intrinsic, sequence-based homology bounds can be defined relative to the breakpoint position. SPDI normalisation doesn't really work so we're left with using one of the left/right/centre-alignment conventions (all of which are in use).

Which alignment convention: left/right/centre?
- I'm personal in favour of centre alignment as that has the least deviation
Only defined when there is no caller ambiguity

Single breakend

Minimal: {breakend}

Need a flag to indicate if this is terminal and ends a linear chromosome
Inserted sequence as per breakpoint definition.
- Note that the inserted sequence the sequence as far as the evidence supports (asm/read sequence)

Two use cases for this representation model:

Unplaced reference sequence
- A breakpoint exists at the given position, but the aligner cannot unambiguously determine where the other side is.
Novel sequence insertion
- 'Clean' insertions can be represented as breakpoints but if it's part of a more complex event (e.g. viral-integration induced genomic rearrangements) to two 'sides' of the insertion won't be at the same location thus must be represented as two separate since
- Less of an issue with VRS compared to VCF as VRS it's tied to a single reference so, unlike VRS, a viral insertion can be represented in breakpoint notation in VRS.

Phasing

For VCF, I added PSL/PSO fields in which phasing membership and am in favour of some sort of equivalent representation. E.g.

{ unique_phasing_identifier, ordinal}

This allows a partial ordering of cis-phased variants. This isn't consistent with the VRS genotype model - mostly because the very idea of a traditional genotype breaks down when SVs are present.

The example I typically used is a diploid genome with a tandem duplication with SNP X on the first copy SNP Y on the second copy, and SNP Z on copy without the tandem duplication. The genome is triploid in this region with SNP X & Y both cis (both occur on the same chromsome) and trans (on each DUP copy they're cis-phased with REF) depending on how you define cis/trans.

Extended phasing support

VCF does not supporting partial phasing restrictions. That is, in polyploid scenarios, (e.g. plants, cancer), you can determine that two variants are trans phased relative to each other, but not be able to determine whether they as cis/trans phases with the other variants/copies. In such polyploid partial phasing scenarios, one could determine that two variants are trans phased (because you encounter reads with only one of the variants) but, since the sample is polyploid at this locus, could also be cis phased on one or more of the other copies.

Need to represent breakpoint copy number
- Need to support both germline (integer) and somatic (floating point) CN
  - Is sample sub-clonality support in/out of scope?

Do we want VRS to support representing this level phasing of information? I'd like it to but I'd also like feedback before designing a model for this.

Reference

Since variants are defined as a delta, a base structure over which these variants are defined needs to be specified. VRS has chosen to define sequence as a per-variant property but this is structurally ambiguous. For example, if a DNA segment X is in different locations in hg19 and hg38 and a complete SVs call set is defined but reported in a combination of hg19 and hg38 coordinates, then the actual genomic structure of the individual is ambiguous because it's unclear where DNA segment X should be placed in the reconstructed genome (since hg19<->hg38 coordinate transform implicitly adds SVs at each liftover boundary and it's unclear which of these should be used in the reconstruction).

More generally, we need to define what a breakpoint actually means in VRS. There are two interpretation: one is the existence of a DNA adjacency in which the flanking sequence matches the sequence defined by {sequence, pos} (local/weak definition), the other is that there exists a genomic structure implicitly defined by sequence and the breakpoint encodes a DNA adjacency not part of this genomic structure. I would argue that the current version of VRS is closer to the former than the latter although CNVs do imply the latter interpretation.

Terminology

A note on terminology: pretty much every noun in this model has differing meanings/interpretations, each of which is in wide-spread usage in some part of the wider community. If there's any word you disagree with, please provide feedback as input from the community is important and I'm only intimately familiar with the NGS variant calling terminology.

Rapsssito commented 1 year ago

@d-cameron, thanks a lot for the explanation. I very much agree with this approach. While in VCF the SVs are an afterthought, in VRS the SVs must be at the center, specially the breakpoints/breakends. I have a few comments:

Technically speaking the breakends are unorder but we can define a sort order to ensure id stability.

If we decide to force an order, I would say it should match the order in which the contigs appear in the original alignment file (or the reference file) and then by position in the contig (ascending). However, is it necessary we decide an order? Can we just specify {breakend1, breakend2} is an unordered tuple?

SINGLE BREAKEND. Unplaced reference sequence: A breakpoint exists at the given position, but the aligner cannot unambiguously determine where the other side is.

Are you talking about GRIDSS 2 SGL? If this is the case, you mention "the aligner cannot unambiguously determine where the other side is", but would not that be the variant caller?

A note on terminology: pretty much every noun in this model has differing meanings/interpretations, each of which is in wide-spread usage in some part of the wider community. (...)

Should we then create a quick glossary so everyone is on the same page regarding the terminology? It would be useful to include: breakend, breakpoint, CN, aligner and variant caller.

A more general question. Do we aim to remove the "artificial" difference between INDELs and SVs with this approach?

github-actions[bot] commented 10 months ago

This issue was marked stale due to inactivity.

ga4gh / vrs

SV-VRS wave 1 discussion: minimal representation #425