Closed d-cameron closed 2 months ago
@d-cameron, thanks a lot for the explanation. I very much agree with this approach. While in VCF the SVs are an afterthought, in VRS the SVs must be at the center, specially the breakpoints/breakends. I have a few comments:
Technically speaking the breakends are unorder but we can define a sort order to ensure id stability.
If we decide to force an order, I would say it should match the order in which the contigs appear in the original alignment file (or the reference file) and then by position in the contig (ascending). However, is it necessary we decide an order? Can we just specify {breakend1, breakend2}
is an unordered tuple?
SINGLE BREAKEND. Unplaced reference sequence: A breakpoint exists at the given position, but the aligner cannot unambiguously determine where the other side is.
Are you talking about GRIDSS 2 SGL? If this is the case, you mention "the aligner cannot unambiguously determine where the other side is", but would not that be the variant caller?
A note on terminology: pretty much every noun in this model has differing meanings/interpretations, each of which is in wide-spread usage in some part of the wider community. (...)
Should we then create a quick glossary so everyone is on the same page regarding the terminology? It would be useful to include: breakend, breakpoint, CN, aligner and variant caller.
A more general question. Do we aim to remove the "artificial" difference between INDELs and SVs with this approach?
This issue was marked stale due to inactivity.
In this first wave, we will define the minimal representation needed to encode SVs.
High-level design: two sub-components of SV-VRS. The first is a minimal representation that encodes the structural delta between the reference and a sample; the second is a higher-level grouping of these low-level building blocks. E.g. a breakpoint & CN loss can be group together and classified as a simple deletion. This groups quickly become complicated (especially in cancer) and I expect is where the most discussion will be.
This minimal representation:
Minimal model
Breakend
Minimal: {sequence, pos, orientation}
Example: {chr12, 1000, ConnectedAfterPosition }
Ambiguity
As above but instead of an intrinsic sequence-based ambiguity, it's a caller-based margin of error.
Breakpoint
Minimal: {breakend1, breakend2}
Technically speaking the breakends are unorder but we can define a sort order to ensure id stability.
inserted sequence:
homology
Intrinsic, sequence-based homology bounds can be defined relative to the breakpoint position. SPDI normalisation doesn't really work so we're left with using one of the left/right/centre-alignment conventions (all of which are in use).
Single breakend
Minimal: {breakend}
Two use cases for this representation model:
Phasing
For VCF, I added PSL/PSO fields in which phasing membership and am in favour of some sort of equivalent representation. E.g.
{ unique_phasing_identifier, ordinal}
This allows a partial ordering of cis-phased variants. This isn't consistent with the VRS genotype model - mostly because the very idea of a traditional genotype breaks down when SVs are present.
The example I typically used is a diploid genome with a tandem duplication with SNP X on the first copy SNP Y on the second copy, and SNP Z on copy without the tandem duplication. The genome is triploid in this region with SNP X & Y both cis (both occur on the same chromsome) and trans (on each DUP copy they're cis-phased with REF) depending on how you define cis/trans.
Extended phasing support
VCF does not supporting partial phasing restrictions. That is, in polyploid scenarios, (e.g. plants, cancer), you can determine that two variants are trans phased relative to each other, but not be able to determine whether they as cis/trans phases with the other variants/copies. In such polyploid partial phasing scenarios, one could determine that two variants are trans phased (because you encounter reads with only one of the variants) but, since the sample is polyploid at this locus, could also be cis phased on one or more of the other copies.
Do we want VRS to support representing this level phasing of information? I'd like it to but I'd also like feedback before designing a model for this.
Reference
Since variants are defined as a delta, a base structure over which these variants are defined needs to be specified. VRS has chosen to define sequence as a per-variant property but this is structurally ambiguous. For example, if a DNA segment X is in different locations in hg19 and hg38 and a complete SVs call set is defined but reported in a combination of hg19 and hg38 coordinates, then the actual genomic structure of the individual is ambiguous because it's unclear where DNA segment X should be placed in the reconstructed genome (since hg19<->hg38 coordinate transform implicitly adds SVs at each liftover boundary and it's unclear which of these should be used in the reconstruction).
More generally, we need to define what a breakpoint actually means in VRS. There are two interpretation: one is the existence of a DNA adjacency in which the flanking sequence matches the sequence defined by {sequence, pos} (local/weak definition), the other is that there exists a genomic structure implicitly defined by sequence and the breakpoint encodes a DNA adjacency not part of this genomic structure. I would argue that the current version of VRS is closer to the former than the latter although CNVs do imply the latter interpretation.
Terminology
A note on terminology: pretty much every noun in this model has differing meanings/interpretations, each of which is in wide-spread usage in some part of the wider community. If there's any word you disagree with, please provide feedback as input from the community is important and I'm only intimately familiar with the NGS variant calling terminology.