ga4gh / vrs

Extensible specification for representing and uniquely identifying biological sequence variation
https://vrs.ga4gh.org
Apache License 2.0
80 stars 34 forks source link

Multiple repeat expansions in VRS #363

Closed ahwagner closed 10 months ago

ahwagner commented 2 years ago

In communication from @rhdolin:

Rachel Kutner from Epic has put together a thoroughly researched proposal for representing repeat expansions in FHIR Genomics. We've gone over the VRS model, and I had a question.

In scenarios where you have mixed / multiple repeats (e.g. 'CTG[30]CAG[50]' or 'NM_004643: GCG[6]GCA[3]GCG[1]'), how would VRS represent the multiple sequential repeats? Would you, for instance, create a RepeatedSequenceExpression for each of the repeats, and then group them together in a haplotype? Or is there some other approach you'd take in VRS?

Thanks!

Direct link to examples: https://jira.hl7.org/secure/attachment/19521/19521_Repeat+Expansion+HL7+Proposal.pdf

ahwagner commented 2 years ago

I propose that we address mixed repeats (and other related scenarios) using a new SequenceExpression subclass, ComposedSequenceExpression. The proposed class definition follows.

ComposedSequenceExpression

An expression of a sequence composed from multiple SequenceExpression components.

Constraints

At least one SequenceExpression component is not a LiteralSequenceExpression. No components may be ComposedSequenceExpressions themselves.

Information Model

Field Type Limits Description
type string 1..1 The ComposedSequenceExpression type. MUST be ComposedSequenceExpression.
components SequenceExpression (excluding ComposedSequenceExpression) 2..m An ordered list of SequenceExpression components used to compose the ComposedSequenceExpression.

Example

An example sequence expression for OPMD allele -1 from John Li and Rachel Kutner's researched proposal using this class:

{
  "type": "ComposedSequenceExpression",
  "components": [
    {
      "type": "RepeatedSequenceExpression",
      "seq_expr": { "type": "LiteralSequenceExpression", "sequence": "GCG" },
      "count": { "type": "Number", "value": 11 }
    },
    {
      "type": "RepeatedSequenceExpression",
      "seq_expr": { "type": "LiteralSequenceExpression", "sequence": "GCA" },
      "count": { "type": "Number", "value": 3 }
    },
    {
      "type": "RepeatedSequenceExpression",
      "seq_expr": { "type": "LiteralSequenceExpression", "sequence": "GCG" },
      "count": { "type": "Number", "value": 1 }
    }
  ]
}
bheale commented 2 years ago

Works well for communicating/representing the sequence. But could you show me how the composedSequence's relationship to a chromosomal position would be represented in your model?

would it be in here? https://vrs.ga4gh.org/en/stable/terms_and_model.html?highlight=repeated#repeatedsequenceexpression

Thanks! Bret

bheale commented 2 years ago

What is the artificial (or generally acceptable) boundary between CNV versus tandem repeats? Or could the structure here also cover communication of a CNV where the sequence of each copy of X is known? E.g. a gene duplication event where the duplicates do not share the exact same sequence?

ahwagner commented 2 years ago

Works well for communicating/representing the sequence. But could you show me how the composedSequence's relationship to a chromosomal position would be represented in your model?

@bheale no problem. The above ComposedSequenceExpression is used wherever a SequenceExpression (the parent class) is allowed. In the case of the above OPMD Allele -1, it would be used inside of an Allele to describe the state.

Represented on the transcript sequence NM_004643.3, the full Allele would look like this:

{
  "type": "Allele",
  "location": {
    "type": "SequenceLocation"
    "sequence_id": "ga4gh:SQ.sH4gymNtL5nxNdTE3evfxzZa4dg3fqDz",
    "interval": { 
      "type": "SequenceInterval", 
      "start": { "type": "Number", "value": 3  },
      "end":   { "type": "Number", "value": 33 }
    }
  },
  "state": {
    "type": "ComposedSequenceExpression",
    "components": [
      {
        "type": "RepeatedSequenceExpression",
        "seq_expr": { "type": "LiteralSequenceExpression", "value": "GCG" },
        "count": { "type": "Number", "value": 11 }
      },
      {
        "type": "RepeatedSequenceExpression",
        "seq_expr": { "type": "LiteralSequenceExpression", "value": "GCA" },
        "count": { "type": "Number", "value": 3 }
      },
      {
        "type": "RepeatedSequenceExpression",
        "seq_expr": { "type": "LiteralSequenceExpression", "value": "GCG" },
        "count": { "type": "Number", "value": 1 }
      }
    ]
  }
}
ahwagner commented 2 years ago

@larrybabb: we should consider adding indices to JSON arrays where order is meaningful @tnavatar: this would make everyone's life easier in RDF land Andreas & Andy: VRS is verbose, why not just do it? Bob Freimuth: +1

We should introduce a special variable "index" that allows us to always represent these types of arrays consistently for computed identifier construction. This may actually require a special data class that allows us to explicitly contain other objects; so we can have ordered, identifiable objects.

It should be a red flag if there is no index special class in an array and it allows for non-identifiable objects.

ahwagner commented 2 years ago

We should have documentation to explain when to use this and why.

@larrybabb proposal: we should constrain this to only repeatedsequenceexpressions as components. Unanimous agreement to start constrained on 12/13 call.

ahwagner commented 2 years ago

I think I need to better understand the issue behind the above proposal. It seems to me (and I believe that @reece and @bheale would agree) that since meaningful ordering is definitional behavior for JSON arrays, the introduction of a specialized "ordered container class" is effectively extra weight on what is already explicitly meant by an Array.

reece commented 2 years ago

Yes, I agree with @ahwagner: are already ordered, so a new container that does the same thing makes the new container incompatible with arrays and, at the same time, creates new obligations for implementations to ensure that the same index isn't used more than once.

bheale commented 2 years ago

Does G4GH use XML representations as well as JSON?

On Sun, Dec 26, 2021 at 7:11 PM Reece Hart @.***> wrote:

Yes, I agree with @ahwagner https://github.com/ahwagner: are already ordered, so a new container that does the same thing makes the new container incompatible with arrays and, at the same time, creates new obligations for implementations to ensure that the same index isn't used more than once.

— Reply to this email directly, view it on GitHub https://github.com/ga4gh/vrs/issues/363#issuecomment-1001292075, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWWKDGTJZ2BKFMKTVHMF7LUS7DNHANCNFSM5JRWQV2Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

ahwagner commented 2 years ago

GA4GH data standards come in a variety of representations; Google Protocol Buffers (protobuf), JSON Schema, and file-based are a few. The VRS Schema is implemented in JSON Schema, and consequently VRS JSON is the authoritative form and basis for the computed identifier algorithm.

The VRS information model has also been implemented as protobuf in the past, but I do not know of an XML implementation out there. However, I believe that even if there were an XML implementation, there would still not be a need for an explicit index element or attribute; XML element order in lists is assumed to be preserved.

bheale commented 2 years ago

Thanks! Great link. Love the community. Happy New yearish, Bret

On Wed, Dec 29, 2021 at 1:49 PM Alex H. Wagner, PhD < @.***> wrote:

GA4GH data standards come in a variety of representations; Google Protocol Buffers (protobuf), JSON Schema, and file-based are a few. The VRS Schema is implemented in JSON Schema, and consequently VRS JSON is the authoritative form and basis for the computed identifier algorithm.

The VRS information model has also been implemented as protobuf in the past, but I do not know of an XML implementation out there. However, I believe that even if there were an XML implementation, there would still not be a need for an explicit index element or attribute; XML element order in lists is assumed to be preserved http://lists.xml.org/archives/xml-dev/201003/msg00045.html.

— Reply to this email directly, view it on GitHub https://github.com/ga4gh/vrs/issues/363#issuecomment-1002771571, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWWKDBKHBV7QW6S25O6T73UTNX6ZANCNFSM5JRWQV2Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

larrybabb commented 2 years ago

@ahwagner While it is great that JSON offers definitional behavior for arrays, we use a form of Canonical JSON tooling to sort our json including the array elements before digesting. This json canonicalization would mess with the producers "implicit" ordering that would be assumed by the array ordering and thus risk misrepresenting the array order that is intended. By adding the index (which is arguably very light weight) we reduce the assumptions and risks that ordering is misconstrued.

ahwagner commented 2 years ago

I had the same concern, but as it turns out @reece was ahead of the ball on this and our serialization strategy only sorts if the array contains ids or digests; this is also how it is implemented in vrs-python.

larrybabb commented 2 years ago

That's great. However, I am still a little uncomfortable with relying on the physical positional order in the message's array being significant, particularly when sharing with external systems. If others are comfortable with the reliance on understanding that arrays in our messages are strictly ordered then I can capitulate.

ahwagner commented 1 year ago

Still needed: documentation. Reopening to address prior to 1.3rc

github-actions[bot] commented 10 months ago

This issue was marked stale due to inactivity.

larrybabb commented 10 months ago

closing this since it was merged and the subsequent documentation task was added to a new ticket.