ga4gh / vrs

Extensible specification for representing and uniquely identifying biological sequence variation
https://vrs.ga4gh.org
Apache License 2.0
80 stars 34 forks source link

ReferenceLengthExpression: Documentation #426

Closed larrybabb closed 3 months ago

larrybabb commented 1 year ago

@ahwagner In today's VRS call I brought up the question of how one would go from a SPDI nomenclature to a VRS SeqLocation that had a ReferenceLengthExpression state.

I referenced the following clinvar variation example

variationID: 1802409
name: NM_018406.7(MUC4):c.7701_7702insTCAGTATCCACAGGTCATGCCACCCCTCTTCATGTCACCAACACTTCC (p.Ser2567_Ala2568insSerValSerThrGlyHisAlaThrProLeuHisValThrAsnThrSer)
type: Indel
spdi: NC_000003.12:195783878:GGAAGTGT:GGAAGTGTTGGTGACATGAAGAGGGGTGGCATGACCTGTGGATACTGAGGAAGTGT
b38_hgvs: NC_000003.12:g.195783886_195783887insTGGTGACATGAAGAGGGGTGGCATGACCTGTGGATACTGAGGAAGTGT
del_spdi: GGAAGTGT
ins_spdi: GGAAGTGTTGGTGACATGAAGAGGGGTGGCATGACCTGTGGATACTGAGGAAGTGT

In this example the 9th character in the ins_spdi seq is a T which is not the natural expansion of the del_spdi....

The repeating sequence in the ins_seq string is GGAAGTGTTGGTGACAT so if we put in 56 for the the RLE on a SeqLocation like the following

start: 195783878,
end: 195783886,
state: 56

How would the start/end impact the recreation of the inserted sequence?

I assumed that 56 would mean that we would take the original start-end ref and start repeating it out until we truncated it at 56. In this case that would create the sequence

GGAAGTGTGGAAGTGTGGAAGTGTGGAAGTGTGGAAGTGTGGAAGTGTGGAAGTGT

which is not the right answer for this normalized variation

To date, we've been able to go directly from SPDI nomenclature to VRS since we would simply take the spdi position as the loc.start and add the length of the del_seq to get the loc.end and then assume the ins_seq of the spdi nomenclature is the loc.state LiteralSequenceExpression.

It seems like now we will need to re-normalize the SPDI in order to identify when the use case for RefLenExpr is appropriate.

Is this a big drawback? Or should we consider whether SPDI should include the RefLenExpr state in it's nomenclature so that we stay in sync with SPDI?

larrybabb commented 1 year ago

@ahwagner It seems like all VOCA normalized repeating sequence alleles have both a

-- contraction of one CCT trinucleotide
NC_000001.11:40819438:CTCCTCCT:CTCCT

-- expansion of one CCT trinucleotide
NC_000001.11:40819438:CTCCTCCT:CTCCTCCTCCT

So in order to use the SPDI nomenclature to deduce an RLE I think we'll need to find the repeating pattern by taking the diff between the two sequences and reducing it to a non-repeating pattern anchored from the rightmost character.

So above would be CCT in both cases which is already minimized, but if it were several trinucleotide expressions and we had a diff of CCTCCT. Then we would test the rightmost T then CT and then CCT at which point it would show up as a match for the previous patterns.

This must be close to what NCBI is doing to verify what is a Microsatellite. And they use "Deletion" for any 'contractions' (or at least it seems so - i haven't fully analyzed).

Anyway, I think we would need to bake something like this into our normalization process to make sure that we can go back and forth between VRS and SPDI - which to me is a pretty important feature to preserve.

ahwagner commented 11 months ago

Added examples to VRS 2.0a corresponding to your SPDI expressions in 191f809.

Here is a simple algorithm in Python for reconstituting the sequence of a normalized RLE Allele rleAllele in VRS 2.0:

from itertools import cycle
seqId = rleAllele.sequenceReference.refgetAccession
start = rleAllele.location.start
end = start + rleAllele.state.repeatSubunitLength
subseq = get_sequence(seqId, start, end) # sequence retrieval function, e.g. from SeqRepo
c = cycle(subseq)
derivedseq = ''
for i in range(rleAllele.state.length):
  derivedseq += next(c)
return derivedseq
larrybabb commented 11 months ago

@ahwagner should we close this issue now or should we consider adding some/any of this to our documentation? Please advise.

ahwagner commented 11 months ago

Issue isn't closed. I was thinking about documenting the above as well as the reverse direction: SPDI -> VRS RLE. The solution to generating the VRS RLE is straightforward; we can derive this from VOCA with no additional steps other than storing the subunit length and total length. But I don't think either direction is documented yet. Will update issue title and tag to reflect this.

ahwagner commented 9 months ago

This was clarifying for the definition:

anytime a variant can be derived solely from the reference, you use an RLE