Align high level class structure between core-source and va-spec models

mbrush commented 1 year ago

I reorganized high level class structure in va-spec to support the ValueEntity vs ExtensibleEntity distinction, and better align with what is in core-source model. But there is not yet complete alignment, and some elements needed for VA are missing from the core-source representation.

This issue compares the current high level class structure of the gks-metaschema and va-spec models, to facilitate alignment needed before the va-spec can drop these classes and re-use what is in core-source.

Diagram 1: The current core-source upper level class hierarchy.

Note that not all classes are shown as boxes in the diagram - some concrete subclasses are listed in the bottom section of abstract class boxes such as DomainEntity and ExtensibleEntity.

Classes in red are those that I suspect should be moved/re-organized, as proposed in Diagram 2 below.

Diagram 2: The core-source hierarchy after I amended what I suspect may be oversights?

Specific amendments made here:

Condition and TherapeuticCollection have been moved under Domain Entity (per ga4gh/va-spec#107)
Extension now inherits from ExtensibleEntity (per ga4gh/va-spec#108)
The Expression class defined in vod-source now inherits from ExtensibleEntity (per ga4gh/va-spec#109)

Diagram 3: The va-spec upper class hierarchy

. . . how I would refactor things in the VA IM to support the ValueEntity vs ExtensibleEntity distinction, and better align with what is in core-source model. (But as noted, this is not yet fully aligned with the organization of high level classes in the metaschema per diagrams above.)

Key differences / features of VA high level class organization to consider/align

The va model adds a description attribute to ExtensibleEntity
The va model introduces the notion of a 'UtilityEntity' to formally separate general purpose data type like classes (Coding, Extension, Expression, etc) from 'CoreEntities' such as Statement, Evidence Lines, Methods, etc. (i.e. the dedicated classes for representing Statements and their evidence and provenance)
- This parallels how we have organized and described these different categories of classes in our documentation.
- We think this distinction is very helpful conceptually, also useful from a modeling perspective w.r.t. attribute inheritance.
- But several questions/considerations need to be addressed before moving ahead with this idea
The va model has additional attributes in the ExtensibleEntity/CoreEntity class representing generally useful properties like label, url, references, and xrefs.
- IMO these are things users generally may expect to find on any non-ValueEntity class, including many of the fields listed as examples in the current def of Entity (see ga4gh/va-spec#106) . . . e.g. people may want to add things like descriptions, xrefs, etc. to a Statement object.
- Note also that if/when we want to add these types of general info to a ValueEntity class, we would use the corresponding attribute in the Descriptor that wraps it (e.g. ValueObjectDescriptor.value_object_url)
The va model considers Proposition to be a Core Entity, not a Value Object

Questions / Issue with this Model: Note that some issues arise with this model when considering how the classes partitioned under ExtensibleEntity (e.g. Coding) may be used within ValueEntities (e.g. Proposition if this is treated as a ValueEntity). Some general questions to think about that might inform our thinking here:

Do we need to treat Propositions as ValueEntities . . . I get the use case for this from ClinGen/VICC, but seems inconsistent with the rest of the Core VA classes being extensible.
Do we envision Propositions and other ValueEntities using Codings? If so, they cannot be extensible, right? (but they are extensible in all proposals above)
- e.g. If we want to include a slot in a Disease value object to describe its mode of inheritance, this would likely be represented using a Coding that takes a HPO 'inheritance pattern' term.
- e.g. Propositions may also have attributes where we would want to use a Coding (e.g. a Molecular Consequence Proposition would probably include an SO term which would be represented as a Coding). If not, how represent things that would naturally use Codings? Resort to enums?
Are there creative ways we might handle value entities that lets us use extensible utility classes in their representation, and/or include 'non-required' fields within them? (e.g. exclude specific fields from being input into the id generation algorithm (I recall there is precedent for this in vrs, with the _id field?)
It may be that the Core Class vs Utility Class distinction (which is conceptually useful and made in our documentation) need not be formally made in the model itself, if it causes issues w.r.t. inheritance of attributes.

larrybabb commented 1 year ago

There's a lot to unpack here. But here are my thoughts once we come around to discussing this...

ValueObjectDescriptor - this class does not fit the semantics of UtilityEntity as I look at the other subclasses in that set. It may be a special EntityDescriptor class that directly descends from ExtensibleEntity? In any case, the interesting thing about Descriptors is that they all create wrappers for ValueObjects which can be extended, identified and tied to a given authority's record. So there could also be provenance, recordmetadata and even a method for a given ValueObjectDescriptor (IMO). Descriptors are really a kind of record-level statement about a ValueObject (again IMO).
I agree that Propostion does not have to be a ValueObject, but I still feel quite strongly that all of the attributes of a given concrete Proposition MUST be required. We can discuss. It is also worth noting that @ahwagner and I are coming around to the idea that while the Proposition is an incredible important and useful semantic that provides the Definitional representation of a Statement it does not necessarily need to be a separate class. We need to come up with a way of specifying our Statements such that the embedded Definition that computationally and precisely represents the basis behind the Statement is super-transparent to implementers. It may be best to keep it as a separate class for just that purpose, but every Statement will have one and only one Proposition and those Propositions will be tightly constrained with a full complement of required fields.

I hear your argument about optional fields. These are fine on classes that are not able to be computationally precise. We want to really try to achieve the notion of interoperability which is confounded IMO by flexibility in how data is represented. Optional fields, while necessary, should be segregated from the truly interoperable substructures if at all possible and reasonable.

mbrush commented 1 year ago

April 2023 Update: Clingen/VICC are no longer pursuing the descriptor-based approach to value object representation in their initial implementation models. Value objects and descriptor objects will be collapsed into a single object - folding together non-essential decoration and essential identifying information. For objects where we want to compute identifiers, a separate specification will indicate the subset of fields to be used for this purpose.

Given this development, we no longer need to make a class-level distinction between Value Objects and Extensible Entities, as in the diagrams above, and as in the current GKS foundation/coure-source model. Every class should now be extensible. This IMO simplifies our high level class structure, and moves us past many of the concerns / alignment issues documented above in this ticket.

A much simpler aligned high-level class structure would look roughly as below:

Notes / Rationale:

Element is the most general abstract root class. It holds attributes that could apply to any concrete class in the model - be they a Utility class, or Entity class) . . . e.g. all objects can have types, or descriptions, and now that we collapse descriptors into value entities everything should be extensible.
Utility classes are essentially complex data type structures - re-usable collections of fields that can be plugged into other objects to capture related information.
- These are things that we don't want/need to identify, label, etc.
- The Utility class itself is abstract / organizational - implementers would always use one of the concrete subtypes with specifically defined fields (e.g. Coding, Expression).
Entity classes are either Core Entities or Domain Entities.
- Core Entities are classes representing general/universal types of knowledge artifacts, and the processes and agents involved in their generation. These are not specific to any domain or field of research.
- Domain Entities are things in a particular domain or field that knowledge is created about.
- The attributes in the Entity object align with those defined in the VRSATILE Value Object Descriptor class (plus two additional ones, starred). These will be inherited by VRS and VA classes that plug into this high level structure here (Domain Entities and VA Core Classes).
- On the VRS / Domain Entity side, this inheritance is consistent with the idea of collapsing descriptors into value entities - as these are fields that now need to live inside this single object.
- One the VA / Core Entities Side - this is consistent with the general purpose attributes we currently define on our high level Entity class.
- The only thing to consider is if we want to also include the two additional attributes identified as useful for VA (references, and recordMeta), but not currently part of the ValueObjectDescriptor class . . . if these are not wanted on the VRS side, we are happy to push them down into a VA-specific class like Information Entity.
Finally, instead of the names Element and Entity, we might consider the names Entity and Identifiable Entity. But I prefer the simpler names in the diagram above.

ga4gh / va-spec