Consider alternate mechanisms to define specialized qualifiers in Statement profiles

ga4gh / va-spec

An information model for representing variant annotations.

15 stars 2 forks source link

Consider alternate mechanisms to define specialized qualifiers in Statement profiles #134

Open mbrush opened 1 month ago

mbrush commented 1 month ago

The current core-im-source draft from Larry and Alex uses a plural label qualifiers but takes a single object as its type:

I assume this plural name is used because of how you implement statement-specific qualifiers in profiles like the Variant Pathogenicity Statement - where the qualifiers field takes a single "untyped" object within which several specific 'qualifying' properties are defined:

I would like to propose alternatives to the naming and nesting of properties in undefined objects as a way to specify profiled qualifier representations.

Alternative 1 (preferred): Directly specialize the core-im Statement.qualifier property in a given Profile schema to define specific qualifier-type properties. So for Variant Pathogenicity, where we want to specify three possible types of qualifiers (penetrance, moi, gene context) - the VariantPathStatement profile would include the following three properties on the VarPAthStatement object that all extend a core-im qualifier property:


VariantPathogenicityStatement:
  maturity: draft
  type: object
  inherits: va.core:Statement
  properties:  
    . . . 
    penetranceQualifier:
      extends: qualifier
      type: string
      enum:
        - high
        - low
        - risk allele
      description: >-
        The extent to which the variant impact is expressed by individuals …

    modeOfInheritanceQualifier:
      extends: qualifier
      type: string
      enum:
        - autosomal dominant
        - autosomal recessive
        - X-linked dominant
        - X-linked recessive
        - mitochondrial
      description: >-
        The pattern of inheritance expected for the pathogenic effects …

    geneContextQualifier:
      extends: qualifier
      $refCurie: gks.genes:Gene
      description: >-
          A gene context that qualifies the Statement.

This is to me a clear, simple, and direct implementation of how I have envisioned qualifier specialization working. At its core is the simple pattern where a profile holds several distinct qualifier properties that are direct properties of the Statement, and not nested in some untyped object. But not sure if there is a technical or application-specific reason this won’t work?

Alternative 2 below was not considered / approved on the 5-29 call, so next step is to wait for Alex' team to draft an implementation of their approach to encoding Alternative 1 in a way the metaschema tooling can handle.

Or, might we consider refining the metaschema code to allow for multiple specializations of a single property - so we could directly adopt Alternative 1 above? If it is just a technical impediment that is preventing this useful feature, perhaps we could change it?

Alternative 2: Another approach that is closer to the current way things are specified, but explicit declares object types, would involve a VariantPathogenicityQualifierSet class in the VarPath profile schema. In this class, we could define the three qualifier properties defined in the current schema within a nested untyped object. This class would extend a generic QualifierSet class we would want to add to the core-im. Then the existing VarPathStatement.qualifiers` property would then simply reference this class as its type, e.g.:

In the core-im, we add:

"Statement":
  heritableProperties:
    "qualifierSet": 
       "type": "#/$defs/ QualifierSet"

"QualifierSet":
   "heritableProperties":
      "qualifiers":
         "type": "array"
         "items": Qualifier

TO DO: finish getting this right

In the VarPathStatement class:

  qualifierSet: 
      $ref: "#/$defs/ VariantPathogenicityQualifierSet"

Definition of the VariantPathogenicityQualifierSet class in the varpath profile:

VariantPathogenicityQualifierSet:
  maturity: draft
  type: object
  inherits: va.core:QualifierSet
  description: >-
    Additional, optional properties that qualify a VariantPathogenicity Statement.
  properties:
    penetranceQualifier:
        extends: qualifier
        type: string
        enum:
           - high
           - low
           - risk allele
        description: >-
           The extent to which the variant impact is expressed by individuals …

    modeOfInheritanceQualifier:
        extends: qualifier
        type: string
        enum:
           - autosomal dominant
           - autosomal recessive
           - X-linked dominant
           - X-linked recessive
           - mitochondrial
        description: >-
           The pattern of inheritance expected for the pathogenic effects …

    geneContextQualifier:
        extends: qualifier
          $refCurie: gks.genes:Gene
          description: >-
              A gene context that qualifies the Statement.

While this is not as simple and clean as the first alternative, it is at least IMO a more clear and explicit way to implement the current approach – in that avoids it what I find to be a confusing use of nested properties and untyped objects. However, it does require defining and profiling an additional core im class (QualifierSet) in order to define a VarPathStatement profile. But again, if we go with Alternative 1 above, which I prefer – we avoid all of this.

larrybabb commented 1 month ago

I'm in favor of moving forward with Alternative 2 in lieu of the effort and time needed for Alternative 1. Alternative 2 resolves the main concerns of being able to both tag which qualifiers are required vs optional as well as being able to explicitly define the types and subtypes of qualifiers.

mbrush commented 1 month ago

Looking closer at the Alt 2 proposal - I actually think we run into the same problem here as for Alternative 1 - as we would need to extend a QualifierSet.qualifier core IM property three different times to define the three VarPath qualifiers needed in that profile.

We can see this clearly in the example above - where penetranceQualifier, modeOfInheritanceQualifier, and geneContextQualifier all extend the same QualifierSet.qualifiers property. (Recall, profiling does not allow for defining abritrary new named properties in a profiled class - all named properties in a profile need to be inherited, or extend a core-im property.)

Given this, perhaps we wait for Alex to draft his proposal for implementing Alternative 1 and see what this looks like. He liked the spirit of that alternative, and said that it could be done using existing metaschama functionality.

mbrush commented 1 month ago

I know I am jumping the gun here given that we haven't seen Alex' propsoal for implementing Alternative 1 - but is it crazy to consider extending the metashchema functionality to allow for what we need to implement Alternative 1 directly? i.e. the ability to specialize one core im property into three different profile properties? Just like a class or a property in an ontology can have multiple subtypes? This would directly support implementing Alternative 1 as above, and seems like generally useful functionality for metaschema tooling to support.

I am not really qualified to propose such things, but naively it seems like this would just require a new keyword to use instead of extends that would support this functionality - maybe call it multiplies? The metaschema code could be updated to know that when it sees this multiplies keyword, it will be deriving multiple new properties in a profile that all inherit the characteristics of their parent core-im property, but overwrite these with any new constraints defined in the 'multiplied' property.

A profile schema based on this approach would be very clean and clear and easy for developers to create. For defining the three qualifiers in the VariantPathogenicityStatement, it would look something like this:

VariantPathogenicityStatement:
  maturity: draft
  type: object
  inherits: va.core:Statement
  properties:  
    subject: ...
    predicate: ...
    object: ...
    . . . 
    penetranceQualifier:
      multiplies: qualifier
      type: string
      enum:
        - high
        - low
        - risk allele
      description: >-
        The extent to which the variant impact is expressed by individuals …

    modeOfInheritanceQualifier:
      multiplies: qualifier
      type: string
      enum:
        - autosomal dominant
        - autosomal recessive
        - X-linked dominant
        - X-linked recessive
        - mitochondrial
      description: >-
        The pattern of inheritance expected for the pathogenic effects …

    geneContextQualifier:
      multiplies: qualifier
      $refCurie: gks.genes:Gene
      description: >-
          A gene context that qualifies the Statement.

Of course, this 'multiplies' functionality could also be used to implement Alterntive2 if we like that approach better. As noted above, this alternative would also need to specialize one core im property into three profile properties.

mbrush commented 3 weeks ago

UPDATES:

5-29-24 call:

Alex and Javi approved of the simplicity/directness of Alternative 1 - including the idea that all qualifier properties would be captured at the same level as the core s-p-o properties that together express the Statements core proposition.
- But Alex clarified that the metaschama 'extends' functionality cannot implement Alternative 1 exactly as shown above - as it cannot 'override' a single core-im property (e.g. qualifier) multiple times to create several derived/specialized qualifier properties in a Profile (e.g. penetranceQualifier, geneContextQualifier, etc).
- However, he did indicate that they could implement the spirit of this proposal in a different way. His team will explore / demonstrate how this could work.
Matt pointed out that another consideration that favors Alternative 1 is that there are some qualifiers that are required and critical components of a given statement type (e.g. the disease qualifier for a therapeutic-response statement), and others that are not required but can optionally provide additional detail or context to a statement (e.g. the alleleOrigin qualifier for a therapeutic-response statement).
- Alex would like these essential qualifiers to be emphasized by being shown at the same level as the core s-p-o properties (which is achieved by Alternative 1), and not nested down a level in a Qualifiers object (which would happen in Alternative 2).

6-12-24 call:

Larry seemed to support Alternative 1 as the best long term solution, but pushed for short term implementation of Alternative 2 as a concrete improvement that would require minimal change/effort - given that Alternative 1 may require some developer time.
Matt still factors Alternative 1, but agrees Alternative 2 is a simple short term improvement if that is all that we can do right now.
- He reiterated the fact that defining qualifier specializations in yaml files is something that community modelers will need to be able to understand and do for themselves. So it needs to remain clear and simple to understand and implement.
- This conceptual clarity and ease of implementation was one of the main drivers for Alternative 1 proposed above. As was the clarity of the connection between the profiled qualifiers and the core-im qualifier property.
- Whatever solution Alex's team comes up with to implement this approach should be clear and simple to implement, and not obscure the connection between the core-im and profiled qualifiers.

larrybabb commented 3 weeks ago

I have a new proposal.

Why do we even need to put the qualifiers attribute in the Statement yaml at all? I'm not saying that it isn't important, but the reality is that if there isn't a qualifier needed on a given Statement profile then we wouldn't even want the attribute to begin with. And, when there are qualifier(s) needed they are specialized to be their own properties. Thus, trying to define some generic placeholder seems like a fools errand as it has no value. The real value is finding a way to convey to Profilers and Abstractionists that the concept is valuable and important to the SPOQ design and while the SPO elements are fundamental the Q elements may or may not be given a specific profile.

Can we remove qualifiers as a formal attribute from Statement and find a way to show a Concept Attribute called Qualifier in its place. This would allow us to be super transparent and clear that the concept is fundamental yet too abstract to define until Profiling takes place.

This would allow us to avoid future-proofing abstract classes and instead directing folks on the standard way to qualify statements.

mbrush commented 3 weeks ago

Larry - can you explain further what you mean by “ find a way to show a Concept Attribute called Qualifier in its place. ”?

larrybabb commented 3 weeks ago

After more discussion we are going to make changes to the metaschema processor to support approach 1.

ahwagner commented 3 weeks ago

@larrybabb and @mbrush is there a recording that documents why we will be investing the effort to make this change in the near-term?

mbrush commented 3 weeks ago

@ahwagner I don't think the discussion was recorded, but I will summarize here. To be clear, the solution that Larry and I decided we prefer was implementing Alternative 1 by extending the metaschema code with a new keyword and functionality that allows for specializing a core-im property into multiple sub-properties when profiling. Details and benefits are described in the comment above. If I recall you liked the spirit of this proposal, and main objection was that the current metaschema processor doesn't have a function to support specializing one core property into several sub-properties in a profile. Larry and I say lets just add this (seemingly straightforward) functionality, rather than try to define work arounds.

Our rationale:

we generally agree that this new functionality would it make qualifier specialization clean, clear, and easy for profile developers to define in the yaml (see yaml example above), then if we can add it without too much effort - so lets just do it now rather than spending time defining and implementing a short term fix that is not ideal.
We noted that the utility of this functionality is not limited Statement qualifiers, as there are analogous use cases for wanting to specialize a single core-im property into many sub-properties (e.g. the need to multiply the core-im `StudyResult.dataItems' property into several data type specific properties in the caf profile here). This too becomes very concise, clear, and easy to define in the yaml with a 1:many specialization capability.
One of the challenges to selling our approach will be ensuring profile developers and users can understand and easily create yaml profile definitions - so anything we can do to make metaschema-based specification of these more straightforward is a win.

Of course this is all dependent on your approval willingness to devote developer time to making this enhancement. Larry estimated adding code to handle 1:m specialization would be ~a days work, but obviously you would know better here.

ahwagner commented 3 weeks ago

Sorry about the delay in a response here, I'm really bogged down in grant submissions and travel at the moment.

FWIW, the limitation here isn't technical; it would be easy enough to implement. The limitation is about breaking conventions. Extends has roughly meant "replaces the parent property with an extended version". It is not specific to VA-spec or the VA-spec core IM, it is a generic operation that works across products. This operation is different than "inherits from an abstract property", which is the behavior being described in this thread. I'm opposed to changing the definition and behavior of extends (or any keyword) to suit a specific use case, and not sufficiently motivated by the argument that it is easier for implementers to reuse extends in the same way as done for the subject and object properties of some core IM classes.

What I was going to propose was simply creating a Qualifier class and using the JSON Schema allOf keyword, that is used to address this specific type of situation, e.g.:

VariantPathogenicityStatement:
  maturity: draft
  type: object
  inherits: va.core:Statement
  properties:
    penetrance:
      description: >-
        The extent to which the variant impact is expressed by individuals …
      allOf:
       - $refCurie: va.core:Qualifier
       - type: string
          enum:
            - high
            - low
            - risk allele
    . . .

An alternative approach is to add functionality for another another keyword (@mbrush suggested multiples), and I think inherits (at the property level) would make sense here. But my preference would be to use the JSON Schema allOf approach first, since it would require no further development work and leverage standard patterns in JSON Schema.

mbrush commented 2 weeks ago

@ahwagner - no worries, I know you are busy! To be clear, we are not proposing to change the definition or behavior of the extends keyword - per the first part of your response. 'Extends' stays as is, and is used for 1:1 property extension/replacement.

Our proposal is what you reference briefly at the very end of your response - to create an new keyword that is used specifically when a profiler wants to do 1:m extension/replacement of a property. For example, specialize the Statement.qualifiers property into modeOfInheritance and geneContext. Or specialize the StudyResult.dataItem property into focusAlleleCount and locusAlleleCount. Glad to hear that this approach makes good sense to you, and would be technically simple to implement!

I get the rationale behind your proposal as well - but was hoping you could flesh it out a bit by illustrating what this 'Qualifier' class-based proposal looks like in both the core-im yaml as well as a derived profile? I tried working this up myself but wasn't sure what you had in mind.

I also wanted to note one potential issue with your proposal concerning its creation of a new de novo property in the VarPath profile (penetrance) that does 'extend' an existing core-im property. This violates our established Profiling rules - which require any 'new' property added to a profile to specialize/extend an existing core-im property (if it does not, it needs to be created using the Extension mechanism). Of course, in your example you could declare penetrance to 'extend' the core-im Statement.qyualifier property - but there are other qualifier properties in this profile that would also need to extend qualifier - which as you say is not allowed. This is of course where the 'multiplies' keyword would help.

Finally, I wanted to make sure it was clear that the qualifier example is not the only use case for wanting to perform 1:m property specialization/extension. For example, I think it would also come into play to support specific data type properties created in StudyResult profiles (e.g. focusAlleleCount and locusAlleleCount in the CAF profile). So there is more general utility to adopting a 'multiplies' like functionality.

ahwagner commented 2 weeks ago

Our proposal is what you reference briefly at the very end of your response - to create an new keyword that is used specifically when a profiler wants to do 1:m extension/replacement of a property.

Hey @mbrush; just to clarify, this is not what I meant. Yes, a new keyword is possible (though again, I would prefer to use existing JSON Schema conventions); and no, this is not a proposal for a 1:m replacement of a property. To my knowledge, 1:m property replacement is not a pattern that is used in JSON Schema, pydantic, Active Model, or any other framework we use for modeling data in VICC resources. It might help me understand the importance of this pattern to see it applied in other data modeling languages. I am not aware of a 1:m property/slot replacement mechanism in LinkML, either; from what I understand (having spent very little time with this particular language), what I am proposing is most similar to the LinkML slot_URI property.