Technical approaches to ‘profiling’ of supporting classes (e.g. DataSet, Cohort/StudyGroup in the caf profile)

mbrush commented 5 months ago

The Problem: (illustrated using the caf profile)

At present the caf profile does not directly use core-im DataSet or StudyGroup classes to capture data in the derivedFrom and cohort properties, respectively.

e.g. the CohortAlelleFrequency.derivedFrom property does not reference/use a DataSet to capture the dataset description it holds. But such a class is implied by the nesting of properties defined under a generic object taken by the derivedFrom property (which are consistent with the properties defined for the core im DataSet class)
- similarly the CohortAlelleFrequency.cohort property does not reference/use a Cohort/StudyGroup to capture the data it holds. But such a class is implied by the properties defined under the object taken by the 'cohort' property – which are consistent with the core-im model of a Cohort/StudyGroup.

I realize that the DataSet class and Cohort/StudyGroup class were not part of the initial core-im that Alex and Larry created - which may explain the approach above? But now that the these classes are in the core-im that the caf profile imports, can we consider the best way to use them explicitly in this caf profile?

Proposed Solutions:

An assumption underlying the proposed solutions is that the ultimate goal here is to specify what subset of properties from the core-im DataSet and Cohort/StudyGroup classes are allowed use in the caf profile, and define constraints on how are they to be populated in this profile. Another assumption behind these proposals is that implementations do not want to have to pull in ALL attributes on the core-im classes they use - i.e. those declared directly on them in the core im, or inherited from ancestors in the core-im. The proposals below both address this concern.

Approach 1: use a new 'overwrites' functionality

The only difference between the current approach is this proposal is that it explicitly defines a DataSet class in the schema to hold properties/data captured by the derivedFrom property - rather than implying one through the definition of nested properties under the derivedFrom attribute, in an untyped/anonymous json object.

How it works:

To implement this, we would define a caf-specific DataSet class in the caf-source.yaml schema doc itself, where the subset of core-im Dataset properties to be used in the profile are defined, and any profile-specific constraints are added (e.g. cardinality, data types).
This definition would 'overwrite' the imported core-im DataSet class in the context of the caf profile - through the use of a new metaschema keyword ‘overwrites’ (instead of ‘inherits’)
Metaschema code s updated to implement the 'overwrites' functionality when downstream artifacts are generated - i.e. ignore definition of core im class and the properties it defines or inherits, and instead use the class definition provided here in the profile. This means that the json schema for example would only include the properties defined in the profile, and not those in the upstream core-im.

Below is an example of how the caf-source yaml might look for this approach::

  CohortAlleleFrequencyStudyResult:                         
    maturity: draft
    type: object
    inherits: va.core:StudyResult                
    description: A StudyResult that reports measures related to the frequency of an Allele in a cohort
    properties:
      derivedFrom:
        $ref: "#/$defs/DataSet"                # Reference to the local definition of a caf-specific DataSet class below
        description: The dataset from which the CohortAlleleFrequencyStudyResult was reported.
        additionalProperties: false

   . . . 

  DataSet:                                          # definition of a local, caf-specific DataSet class that would overwrite the one in the core-im
    maturity: draft
    type: object
    overwrites: va.core:DataSet          # this would require a new 'overwrites' metaschema keyword and functionality 
    description: >-
      A collection of related data items or records that are organized together in a common format
      or structure, to enable their computational manipulation as a unit.
    properties:                                   # includes (and refines) only properties from the core-im DataSet class that are to be used in this caf-profile
      id:
        type: string
        description: ...
      type:
        type: string
        description: ...
      label:
        type: string
        description: ...
      version:
        type: string
        description: ...
    additionalProperties: false

Note that i think the extends keyword that is used on profiled properties performs this overwriting function for properties. The idea here is to have a keyword that similarly overwrites class definitions from the core-im - but in a way that follows VA/SEPIO profiling rules (e.g. all properties on these classes must come from its core-im 'parent', or extend a property on this 'parent').

Pros:

this one approach handles the subsetting and unwanted property inheritance issue

Cons:

The downside I see here is that we are pulling in a bunch of stuff, and then later trimming content out before generating the actual json schema, or web docs.

Approach 2 below results in the same final outputs, but implements a solution further upstream by controlling what content gets imported into a profile schema in the first place. . .

Approach 2: core-im 'slim' imports:

This would import into the caf profile a core-im ‘slim’, defined as part of the profiling process, that would include only the subset of core classes and properties that will be directly used/specialized in the caf profile.

How it works:

We could provide a technical solution to allow profile creators to tag elements of the core-im that they want to use in their profile, and tooling to read these tags and automatically generate such a slim/subset of the core-im file.
It is this 'slim' that would be imported into the caf profile schema (rather than importing the full core im that includes a bunch of stuff they don't need)
There is precedent for this, that use different slimming/subsetting approaches:
- Ontodog like ‘slim’ approach - derive a s/s and put an x next to elements to be extracted into the slim
- BiolinkModel ‘subset’ approach: create an in subset property and use this to tag elements in the core-im-source file with the name of specific profile(s) they are a part of. I prefer this because it keeps everything in one source of truth file, and it advertises the use of each element for all to see. An example of what this might look like:

Tooling would generate core-im slims for each named subset tag, that contain only elements annotated in the source core-im to belong to that subset. It is these profile-specific core-im slims that would be imported into a profile, instead of the full core-im-source file.
As an aside, subset tag names would have to be standardized of course – perhaps using a profile registry that captures formal name/abbreviation for each profile being created - to be used in places like the name of the schema file, and the name of the subset tag (see ticket #140)

Pros:

It seems like the tooling support it requires for generating the slim yaml files could be built into the metaschema framework.
I think it is a clearer approach in the long run - resulting in profile schema that are cleaner generate, and easier to understand.
It also has the benefit of forcing development to proceed explicitly through the core-im - ensuring developers understand the core model, and use it appropriately and comprehensively in creating their profiles.
This goes with the theme I am pushing of tighter, explicit, consistent use of / reference to the core-im in the profile creation process - which I think is critical to reap max benefit of the VA-SEPIO profiling approach from the perspective of unified understanding and accurate use of variant knowledge of diverse types across all profiles.

Cons:

A drawback of this approach is that it doesn't not deal with the issue of inheritance of unneeded properties from ancestors in the core IM. If we want this functionality, we would need to come up with a way to support it.
- e.g. perhaps through some code that lets profile creators 'demote 'properties defined on abstract classes in the core-im to selected concrete subtypes when they are defining their slim. e.g. 'demote' the property specified_by from Information Entity down to Statement in the core-im slim, because I don't want this property to show up on other Information Entity subclasses in my profile, such as Method, or Document.

IMO we always knew there would be tooling required to help implement the profiling approach in a way that preserves a single source of truth and reduces duplicative maintenance. I think the metaschema tooling is where it make sense to implement this functionality for now (e.g. with functions like 'overwrite', or 'create slim'). But longer term this is they type of think that the LinkML framework is set up to handle in a more robust and standard way.

I know we have many other priorities besides metaschema development/extension now - but at least speccing out how we want this to work in the future will help us manually craft profiles in a way that is consistent with how we want tooling to do it for us in the future.

larrybabb commented 4 months ago

@mbrush this is a very thoughtful analysis, thank you for laying it out so well.

I like much of what you are suggesting, but my engineering instinct is telling me that this is a bit premature to address yet. That said, I'm not trying to dissuade anyone from working on this or evolving the discussion and ultimate solution.

The maturity model process we have is meant to tag the classes and attributes that the early innovators and adopters are trying to apply. It is this very early and simple process whereby items will be tagged for "Trial Use". Anything that is not "Trial Use" or (eventually) "Normative" is essentially academic IMO. I do understand that we need these early implementers to identify the "Draft" artifacts that are beginning to be used so those are crucial as well.

NOTE: maybe we need to distinguish between true "Draft" (discussed in an academic setting but no implementation planned) vs "Draft" (implementation beginning or planned)?

We should first try to accomplish what's needed now with the maturity model maturity tag and see how far we can get before we add-in the tooling and enhancements that I believe we will ultimately need. Again, I'm thinking with my pragmatic engineering management hat on and looking around at the level of development resources we have to get things off the ground and this is something that will definitely need to be added once we prove to ourselves that using maturity tags is woefully insufficient.

larrybabb commented 4 months ago

@mbrush I have refactored the caf profile and applied most of your suggestions to the new CohortAlleleFrequencyStudyResult profile in the va-spec/profiles folder. Please review. I am not a fan of opening up the metaschema processor to applying the ideas above at this time. So I was able to find solid solutions that got me to the same ends (I believe). In any case, please review and consider archiving this issue for re-visiting later.

If we want to continue pursuing the idea of changes to the metaschema process to support these use cases, then I would suggest transferring this to the gks-metaschema repo as a discussion or issue there and then referencing this (archived) issue.

mbrush commented 4 months ago

I noted that in your updated CAF Profile, you have properties declared to inherit from core-im classes.
This doesn't seem right to me - I thought that classes inherit from classes, and properties extend properties.

Are you trying to specify that these properties take objects of a type that inherits from DataItem? If so, I think the yaml needs to be adjusted. Maybe we can take a pass at this in our next meeting.

ga4gh / va-spec