Open mbrush opened 5 months ago
@mbrush this is a very thoughtful analysis, thank you for laying it out so well.
I like much of what you are suggesting, but my engineering instinct is telling me that this is a bit premature to address yet. That said, I'm not trying to dissuade anyone from working on this or evolving the discussion and ultimate solution.
The maturity model process we have is meant to tag the classes and attributes that the early innovators and adopters are trying to apply. It is this very early and simple process whereby items will be tagged for "Trial Use". Anything that is not "Trial Use" or (eventually) "Normative" is essentially academic IMO. I do understand that we need these early implementers to identify the "Draft" artifacts that are beginning to be used so those are crucial as well.
We should first try to accomplish what's needed now with the maturity model maturity
tag and see how far we can get before we add-in the tooling and enhancements that I believe we will ultimately need. Again, I'm thinking with my pragmatic engineering management hat on and looking around at the level of development resources we have to get things off the ground and this is something that will definitely need to be added once we prove to ourselves that using maturity
tags is woefully insufficient.
@mbrush I have refactored the caf
profile and applied most of your suggestions to the new CohortAlleleFrequencyStudyResult
profile in the va-spec/profiles folder. Please review. I am not a fan of opening up the metaschema processor to applying the ideas above at this time. So I was able to find solid solutions that got me to the same ends (I believe). In any case, please review and consider archiving this issue for re-visiting later.
If we want to continue pursuing the idea of changes to the metaschema process to support these use cases, then I would suggest transferring this to the gks-metaschema repo as a discussion or issue there and then referencing this (archived) issue.
I noted that in your updated CAF Profile, you have properties declared to inherit
from core-im classes.
This doesn't seem right to me - I thought that classes inherit
from classes, and properties extend
properties.
Are you trying to specify that these properties take objects of a type that inherits from DataItem? If so, I think the yaml needs to be adjusted. Maybe we can take a pass at this in our next meeting.
The Problem: (illustrated using the caf profile)
At present the caf profile does not directly use core-im
DataSet
orStudyGroup
classes to capture data in thederivedFrom
andcohort
properties, respectively.e.g. the
CohortAlelleFrequency.derivedFrom
property does not reference/use aDataSet
to capture the dataset description it holds. But such a class is implied by the nesting of properties defined under a generic object taken by thederivedFrom
property (which are consistent with the properties defined for the core im DataSet class)CohortAlelleFrequency.cohort
property does not reference/use aCohort/StudyGroup
to capture the data it holds. But such a class is implied by the properties defined under the object taken by the 'cohort' property – which are consistent with the core-im model of aCohort/StudyGroup
.I realize that the
DataSet
class andCohort/StudyGroup
class were not part of the initial core-im that Alex and Larry created - which may explain the approach above? But now that the these classes are in the core-im that the caf profile imports, can we consider the best way to use them explicitly in this caf profile?Proposed Solutions:
An assumption underlying the proposed solutions is that the ultimate goal here is to specify what subset of properties from the core-im
DataSet
andCohort/StudyGroup
classes are allowed use in the caf profile, and define constraints on how are they to be populated in this profile. Another assumption behind these proposals is that implementations do not want to have to pull in ALL attributes on the core-im classes they use - i.e. those declared directly on them in the core im, or inherited from ancestors in the core-im. The proposals below both address this concern.Approach 1: use a new 'overwrites' functionality
The only difference between the current approach is this proposal is that it explicitly defines a
DataSet
class in the schema to hold properties/data captured by thederivedFrom
property - rather than implying one through the definition of nested properties under thederivedFrom
attribute, in an untyped/anonymous json object.How it works:
DataSet
class in the caf-source.yaml schema doc itself, where the subset of core-im Dataset properties to be used in the profile are defined, and any profile-specific constraints are added (e.g. cardinality, data types).Below is an example of how the caf-source yaml might look for this approach::
Note that i think the
extends
keyword that is used on profiled properties performs this overwriting function for properties. The idea here is to have a keyword that similarly overwrites class definitions from the core-im - but in a way that follows VA/SEPIO profiling rules (e.g. all properties on these classes must come from its core-im 'parent', or extend a property on this 'parent').Pros:
Cons:
Approach 2 below results in the same final outputs, but implements a solution further upstream by controlling what content gets imported into a profile schema in the first place. . .
Approach 2: core-im 'slim' imports:
This would import into the caf profile a core-im ‘slim’, defined as part of the profiling process, that would include only the subset of core classes and properties that will be directly used/specialized in the caf profile.
How it works:
in subset
property and use this to tag elements in the core-im-source file with the name of specific profile(s) they are a part of. I prefer this because it keeps everything in one source of truth file, and it advertises the use of each element for all to see. An example of what this might look like:Pros:
Cons:
specified_by
fromInformation Entity
down toStatement
in the core-im slim, because I don't want this property to show up on other Information Entity subclasses in my profile, such asMethod
, orDocument
.IMO we always knew there would be tooling required to help implement the profiling approach in a way that preserves a single source of truth and reduces duplicative maintenance. I think the metaschema tooling is where it make sense to implement this functionality for now (e.g. with functions like 'overwrite', or 'create slim'). But longer term this is they type of think that the LinkML framework is set up to handle in a more robust and standard way.
I know we have many other priorities besides metaschema development/extension now - but at least speccing out how we want this to work in the future will help us manually craft profiles in a way that is consistent with how we want tooling to do it for us in the future.