Closed ahwagner closed 10 months ago
I propose that we address mixed repeats (and other related scenarios) using a new SequenceExpression
subclass, ComposedSequenceExpression
. The proposed class definition follows.
An expression of a sequence composed from multiple SequenceExpression
components.
At least one SequenceExpression
component is not a LiteralSequenceExpression
. No components may be ComposedSequenceExpressions
themselves.
Field | Type | Limits | Description |
---|---|---|---|
type | string | 1..1 | The ComposedSequenceExpression type. MUST be ComposedSequenceExpression . |
components | SequenceExpression (excluding ComposedSequenceExpression ) |
2..m | An ordered list of SequenceExpression components used to compose the ComposedSequenceExpression . |
An example sequence expression for OPMD allele -1 from John Li and Rachel Kutner's researched proposal using this class:
{
"type": "ComposedSequenceExpression",
"components": [
{
"type": "RepeatedSequenceExpression",
"seq_expr": { "type": "LiteralSequenceExpression", "sequence": "GCG" },
"count": { "type": "Number", "value": 11 }
},
{
"type": "RepeatedSequenceExpression",
"seq_expr": { "type": "LiteralSequenceExpression", "sequence": "GCA" },
"count": { "type": "Number", "value": 3 }
},
{
"type": "RepeatedSequenceExpression",
"seq_expr": { "type": "LiteralSequenceExpression", "sequence": "GCG" },
"count": { "type": "Number", "value": 1 }
}
]
}
Works well for communicating/representing the sequence. But could you show me how the composedSequence's relationship to a chromosomal position would be represented in your model?
would it be in here? https://vrs.ga4gh.org/en/stable/terms_and_model.html?highlight=repeated#repeatedsequenceexpression
Thanks! Bret
What is the artificial (or generally acceptable) boundary between CNV versus tandem repeats? Or could the structure here also cover communication of a CNV where the sequence of each copy of X is known? E.g. a gene duplication event where the duplicates do not share the exact same sequence?
Works well for communicating/representing the sequence. But could you show me how the composedSequence's relationship to a chromosomal position would be represented in your model?
@bheale no problem. The above ComposedSequenceExpression
is used wherever a SequenceExpression
(the parent class) is allowed. In the case of the above OPMD Allele -1, it would be used inside of an Allele to describe the state
.
Represented on the transcript sequence NM_004643.3, the full Allele would look like this:
{
"type": "Allele",
"location": {
"type": "SequenceLocation"
"sequence_id": "ga4gh:SQ.sH4gymNtL5nxNdTE3evfxzZa4dg3fqDz",
"interval": {
"type": "SequenceInterval",
"start": { "type": "Number", "value": 3 },
"end": { "type": "Number", "value": 33 }
}
},
"state": {
"type": "ComposedSequenceExpression",
"components": [
{
"type": "RepeatedSequenceExpression",
"seq_expr": { "type": "LiteralSequenceExpression", "value": "GCG" },
"count": { "type": "Number", "value": 11 }
},
{
"type": "RepeatedSequenceExpression",
"seq_expr": { "type": "LiteralSequenceExpression", "value": "GCA" },
"count": { "type": "Number", "value": 3 }
},
{
"type": "RepeatedSequenceExpression",
"seq_expr": { "type": "LiteralSequenceExpression", "value": "GCG" },
"count": { "type": "Number", "value": 1 }
}
]
}
}
@larrybabb: we should consider adding indices to JSON arrays where order is meaningful @tnavatar: this would make everyone's life easier in RDF land Andreas & Andy: VRS is verbose, why not just do it? Bob Freimuth: +1
We should introduce a special variable "index" that allows us to always represent these types of arrays consistently for computed identifier construction. This may actually require a special data class that allows us to explicitly contain other objects; so we can have ordered, identifiable objects.
It should be a red flag if there is no index special class in an array and it allows for non-identifiable objects.
We should have documentation to explain when to use this and why.
@larrybabb proposal: we should constrain this to only repeatedsequenceexpressions as components. Unanimous agreement to start constrained on 12/13 call.
I think I need to better understand the issue behind the above proposal. It seems to me (and I believe that @reece and @bheale would agree) that since meaningful ordering is definitional behavior for JSON arrays, the introduction of a specialized "ordered container class" is effectively extra weight on what is already explicitly meant by an Array.
Yes, I agree with @ahwagner: are already ordered, so a new container that does the same thing makes the new container incompatible with arrays and, at the same time, creates new obligations for implementations to ensure that the same index isn't used more than once.
Does G4GH use XML representations as well as JSON?
On Sun, Dec 26, 2021 at 7:11 PM Reece Hart @.***> wrote:
Yes, I agree with @ahwagner https://github.com/ahwagner: are already ordered, so a new container that does the same thing makes the new container incompatible with arrays and, at the same time, creates new obligations for implementations to ensure that the same index isn't used more than once.
— Reply to this email directly, view it on GitHub https://github.com/ga4gh/vrs/issues/363#issuecomment-1001292075, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWWKDGTJZ2BKFMKTVHMF7LUS7DNHANCNFSM5JRWQV2Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
GA4GH data standards come in a variety of representations; Google Protocol Buffers (protobuf), JSON Schema, and file-based are a few. The VRS Schema is implemented in JSON Schema, and consequently VRS JSON is the authoritative form and basis for the computed identifier algorithm.
The VRS information model has also been implemented as protobuf in the past, but I do not know of an XML implementation out there. However, I believe that even if there were an XML implementation, there would still not be a need for an explicit index
element or attribute; XML element order in lists is assumed to be preserved.
Thanks! Great link. Love the community. Happy New yearish, Bret
On Wed, Dec 29, 2021 at 1:49 PM Alex H. Wagner, PhD < @.***> wrote:
GA4GH data standards come in a variety of representations; Google Protocol Buffers (protobuf), JSON Schema, and file-based are a few. The VRS Schema is implemented in JSON Schema, and consequently VRS JSON is the authoritative form and basis for the computed identifier algorithm.
The VRS information model has also been implemented as protobuf in the past, but I do not know of an XML implementation out there. However, I believe that even if there were an XML implementation, there would still not be a need for an explicit index element or attribute; XML element order in lists is assumed to be preserved http://lists.xml.org/archives/xml-dev/201003/msg00045.html.
— Reply to this email directly, view it on GitHub https://github.com/ga4gh/vrs/issues/363#issuecomment-1002771571, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWWKDBKHBV7QW6S25O6T73UTNX6ZANCNFSM5JRWQV2Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
@ahwagner While it is great that JSON offers definitional behavior for arrays, we use a form of Canonical JSON tooling to sort our json including the array elements before digesting. This json canonicalization would mess with the producers "implicit" ordering that would be assumed by the array ordering and thus risk misrepresenting the array order that is intended. By adding the index (which is arguably very light weight) we reduce the assumptions and risks that ordering is misconstrued.
I had the same concern, but as it turns out @reece was ahead of the ball on this and our serialization strategy only sorts if the array contains ids or digests; this is also how it is implemented in vrs-python.
That's great. However, I am still a little uncomfortable with relying on the physical positional order in the message's array being significant, particularly when sharing with external systems. If others are comfortable with the reliance on understanding that arrays in our messages are strictly ordered then I can capitulate.
Still needed: documentation. Reopening to address prior to 1.3rc
This issue was marked stale due to inactivity.
closing this since it was merged and the subsequent documentation task was added to a new ticket.
In communication from @rhdolin:
Direct link to examples: https://jira.hl7.org/secure/attachment/19521/19521_Repeat+Expansion+HL7+Proposal.pdf