ga4gh / va-spec

An information model for representing variant annotations.
15 stars 2 forks source link

Technical approaches to ‘profiling’ of supporting classes (e.g. DataSet, Cohort/StudyGroup in the caf profile) #142

Open mbrush opened 1 month ago

mbrush commented 1 month ago

The Problem: (illustrated using the caf profile)

At present the caf profile does not directly use core-im DataSet or StudyGroup classes to capture data in the derivedFrom and cohort properties, respectively.

I realize that the DataSet class and Cohort/StudyGroup class were not part of the initial core-im that Alex and Larry created - which may explain the approach above? But now that the these classes are in the core-im that the caf profile imports, can we consider the best way to use them explicitly in this caf profile?


Proposed Solutions:

An assumption underlying the proposed solutions is that the ultimate goal here is to specify what subset of properties from the core-im DataSet and Cohort/StudyGroup classes are allowed use in the caf profile, and define constraints on how are they to be populated in this profile. Another assumption behind these proposals is that implementations do not want to have to pull in ALL attributes on the core-im classes they use - i.e. those declared directly on them in the core im, or inherited from ancestors in the core-im. The proposals below both address this concern.

Approach 1: use a new 'overwrites' functionality

The only difference between the current approach is this proposal is that it explicitly defines a DataSet class in the schema to hold properties/data captured by the derivedFrom property - rather than implying one through the definition of nested properties under the derivedFrom attribute, in an untyped/anonymous json object.

How it works:

Below is an example of how the caf-source yaml might look for this approach::

  CohortAlleleFrequencyStudyResult:                         
    maturity: draft
    type: object
    inherits: va.core:StudyResult                
    description: A StudyResult that reports measures related to the frequency of an Allele in a cohort
    properties:
      derivedFrom:
        $ref: "#/$defs/DataSet"                # Reference to the local definition of a caf-specific DataSet class below
        description: The dataset from which the CohortAlleleFrequencyStudyResult was reported.
        additionalProperties: false

   . . . 

  DataSet:                                          # definition of a local, caf-specific DataSet class that would overwrite the one in the core-im
    maturity: draft
    type: object
    overwrites: va.core:DataSet          # this would require a new 'overwrites' metaschema keyword and functionality 
    description: >-
      A collection of related data items or records that are organized together in a common format
      or structure, to enable their computational manipulation as a unit.
    properties:                                   # includes (and refines) only properties from the core-im DataSet class that are to be used in this caf-profile
      id:
        type: string
        description: ...
      type:
        type: string
        description: ...
      label:
        type: string
        description: ...
      version:
        type: string
        description: ...
    additionalProperties: false

Note that i think the extends keyword that is used on profiled properties performs this overwriting function for properties. The idea here is to have a keyword that similarly overwrites class definitions from the core-im - but in a way that follows VA/SEPIO profiling rules (e.g. all properties on these classes must come from its core-im 'parent', or extend a property on this 'parent').

Pros:

Cons:

Approach 2 below results in the same final outputs, but implements a solution further upstream by controlling what content gets imported into a profile schema in the first place. . .


Approach 2: core-im 'slim' imports:

This would import into the caf profile a core-im ‘slim’, defined as part of the profiling process, that would include only the subset of core classes and properties that will be directly used/specialized in the caf profile.

How it works:

image


Pros:

Cons:


IMO we always knew there would be tooling required to help implement the profiling approach in a way that preserves a single source of truth and reduces duplicative maintenance. I think the metaschema tooling is where it make sense to implement this functionality for now (e.g. with functions like 'overwrite', or 'create slim'). But longer term this is they type of think that the LinkML framework is set up to handle in a more robust and standard way.

I know we have many other priorities besides metaschema development/extension now - but at least speccing out how we want this to work in the future will help us manually craft profiles in a way that is consistent with how we want tooling to do it for us in the future.

larrybabb commented 3 weeks ago

@mbrush this is a very thoughtful analysis, thank you for laying it out so well.

I like much of what you are suggesting, but my engineering instinct is telling me that this is a bit premature to address yet. That said, I'm not trying to dissuade anyone from working on this or evolving the discussion and ultimate solution.

The maturity model process we have is meant to tag the classes and attributes that the early innovators and adopters are trying to apply. It is this very early and simple process whereby items will be tagged for "Trial Use". Anything that is not "Trial Use" or (eventually) "Normative" is essentially academic IMO. I do understand that we need these early implementers to identify the "Draft" artifacts that are beginning to be used so those are crucial as well.

We should first try to accomplish what's needed now with the maturity model maturity tag and see how far we can get before we add-in the tooling and enhancements that I believe we will ultimately need. Again, I'm thinking with my pragmatic engineering management hat on and looking around at the level of development resources we have to get things off the ground and this is something that will definitely need to be added once we prove to ourselves that using maturity tags is woefully insufficient.

larrybabb commented 1 week ago

@mbrush I have refactored the caf profile and applied most of your suggestions to the new CohortAlleleFrequencyStudyResult profile in the va-spec/profiles folder. Please review. I am not a fan of opening up the metaschema processor to applying the ideas above at this time. So I was able to find solid solutions that got me to the same ends (I believe). In any case, please review and consider archiving this issue for re-visiting later.

If we want to continue pursuing the idea of changes to the metaschema process to support these use cases, then I would suggest transferring this to the gks-metaschema repo as a discussion or issue there and then referencing this (archived) issue.

mbrush commented 1 week ago

I noted that in your updated CAF Profile, you have properties declared to inherit from core-im classes.
This doesn't seem right to me - I thought that classes inherit from classes, and properties extend properties.

image

Are you trying to specify that these properties take objects of a type that inherits from DataItem? If so, I think the yaml needs to be adjusted. Maybe we can take a pass at this in our next meeting.