Closed larrybabb closed 3 months ago
@ahwagner It seems like all VOCA normalized repeating sequence alleles have both a
-- contraction of one CCT trinucleotide
NC_000001.11:40819438:CTCCTCCT:CTCCT
-- expansion of one CCT trinucleotide
NC_000001.11:40819438:CTCCTCCT:CTCCTCCTCCT
So in order to use the SPDI nomenclature to deduce an RLE I think we'll need to find the repeating pattern by taking the diff between the two sequences and reducing it to a non-repeating pattern anchored from the rightmost character.
So above would be CCT
in both cases which is already minimized, but if it were several trinucleotide expressions and we had a diff of CCTCCT
. Then we would test the rightmost T
then CT
and then CCT
at which point it would show up as a match for the previous patterns.
This must be close to what NCBI is doing to verify what is a Microsatellite. And they use "Deletion" for any 'contractions' (or at least it seems so - i haven't fully analyzed).
Anyway, I think we would need to bake something like this into our normalization
process to make sure that we can go back and forth between VRS and SPDI - which to me is a pretty important feature to preserve.
Added examples to VRS 2.0a corresponding to your SPDI expressions in 191f809.
Here is a simple algorithm in Python for reconstituting the sequence of a normalized RLE Allele rleAllele
in VRS 2.0:
from itertools import cycle
seqId = rleAllele.sequenceReference.refgetAccession
start = rleAllele.location.start
end = start + rleAllele.state.repeatSubunitLength
subseq = get_sequence(seqId, start, end) # sequence retrieval function, e.g. from SeqRepo
c = cycle(subseq)
derivedseq = ''
for i in range(rleAllele.state.length):
derivedseq += next(c)
return derivedseq
@ahwagner should we close this issue now or should we consider adding some/any of this to our documentation? Please advise.
Issue isn't closed. I was thinking about documenting the above as well as the reverse direction: SPDI -> VRS RLE. The solution to generating the VRS RLE is straightforward; we can derive this from VOCA with no additional steps other than storing the subunit length and total length. But I don't think either direction is documented yet. Will update issue title and tag to reflect this.
This was clarifying for the definition:
anytime a variant can be derived solely from the reference, you use an RLE
@ahwagner In today's VRS call I brought up the question of how one would go from a SPDI nomenclature to a VRS SeqLocation that had a
ReferenceLengthExpression
state.I referenced the following clinvar variation example
In this example the 9th character in the
ins_spdi
seq is a T which is not the natural expansion of thedel_spdi
....The repeating sequence in the
ins_seq
string isGGAAGTGTTGGTGACAT
so if we put in56
for the the RLE on a SeqLocation like the followingHow would the start/end impact the recreation of the inserted sequence?
I assumed that
56
would mean that we would take the original start-end ref and start repeating it out until we truncated it at 56. In this case that would create the sequenceGGAAGTGTGGAAGTGTGGAAGTGTGGAAGTGTGGAAGTGTGGAAGTGTGGAAGTGT
which is not the right answer for this normalized variation
To date, we've been able to go directly from SPDI nomenclature to VRS since we would simply take the spdi
position
as theloc.start
and add the length of thedel_seq
to get theloc.end
and then assume theins_seq
of the spdi nomenclature is theloc.state
LiteralSequenceExpression.It seems like now we will need to re-normalize the SPDI in order to identify when the use case for
RefLenExpr
is appropriate.Is this a big drawback? Or should we consider whether SPDI should include the
RefLenExpr
state in it's nomenclature so that we stay in sync with SPDI?