ga4gh / va-spec

An information model for representing variant annotations.
15 stars 3 forks source link

Complex Data Type to hold systematic names/aliases in Value Object Descriptors #67

Open mbrush opened 3 years ago

mbrush commented 3 years ago

A core requirement for the Value Object Descriptor (VOD) captures is the ability to record alternate names/expressions for the wrapped object. We proposed to use the following VOD attributes for this:

Value Object Descriptor:

This ticket is for both vetting this basic approach, and also deciding on the best name for the alias field (see above) and the complex data type it is bound to (see below).

The proposed complex data type itself is analogous to Identifier and Coding complex data types, that hold structured representations of instance identifiers and coded values, respectively. The attributes of this proposed complex data type will look something like the following:

To Be Named Complex Data Type:

mbrush commented 3 years ago

Working proposals for attribute and data type names:

Name for the current 'alias' attribute in the VOD object:

Name for the complex data type bound to this alias field

In defining a name and model for this Data Type, we should consider the intended scope and possible applications of the complex data type. Minimally, it will support names for any wrapped value object from a formal nomenclature, such as HGVS and ISCN. But if there are broader applications or utility for such a data type, we should consider them.

Thoughts @larrybabb @ahwagner @rrfreimuth? See line 72 in the data example for more concrete implementation context.

rrfreimuth commented 3 years ago

@mbrush said

  • label: string [0..1] - the preferred/primary name
  • alternateLabel: string [0..m] - other simple/informal names
  • alias : t.b.d. complex data type [0..m] - more systematic names/expressions based on formal nomenclatures, e.g. HGVS, ISCN

Before commenting on the proposed complex data type (and I do have thoughts on that), I'd like to understand further the intent for each of those attributes. Unless I'm missing something, I don't think there is a distinction between them. One user's "preferred/primary" name could be another user's "simple/informal" name could be another's "systematic name". If that's true, then each of those attributes should be able to support the representation of the same data. Typing two attributes as string and the third as a TBD complex data type would prevent lossless transformation.

If someone could affirm my thoughts or provide rationale for why the types need to be different, I'll then be able to comment on the proposed types themselves.

larrybabb commented 3 years ago

We need a singular "preferred" label, not a list of possible labels to allow the producer to specify the label that they use for the concept (whatever it is). Just like we need a singular "description" field to allow the producer to describe the concept based on the source or whatever it is they determine to be the best label and/or description of a given instance of a concept. The alternate labels are additional or also known as labels that do not have a structure and could possibly be known (legacy or other names - like xrefs are for the preferred identifier).

We need a rough consensus on this as it should not chew up valuable time that could be addressing much more important areas of the model. If someone has an implementation that this does not work for they should bring their system requirements forward (discuss how it is preventing them from delivering on their requirements with a real working system).

As far as the alias goes this seems to have spiraled out of the suggestion that @ahwagner and I made in regards to trying to find a "special" structure for things like ISCN or HGVS or other nomenclatures that are not truly Codes or Identifiers. The idea here is to provide a simple structure (system, version, value) construct that would enable the producer to use the "system" as a kind of namespace that would enumerate the specific type of nomenclature (e.g. ghgvs for g. hgvs nomenclature, chgvs ..., etc..) it would also open the door to specify the version of the nomenclature spec if that is helpful. And the value of course is the string expression produced by manually or automatically by some agent (often the produced value does not adhere to formal hgvs specs - especially for older forms curated from literature).

As we try to implement the sharing of variant data in our pilot systems it seems clear and necessary to allow consumers of the messages to be able to "find" the "genomic hgvs" nomenclature if provided in certain instances. So we determined that we needed to either stuff nomenclatures into codings or separate them into this specialized concept that we think keeps things cleaner as we are not 100% confident that these nomenclature or systemic names are plain labels or legitimate codes/identifiers.

larrybabb commented 3 years ago

If we can't get a rough consensus on this plan in a very short time frame then I would suggest that we simply put nomenclatures into the code structures and move on. We should be able to make that work. Then we can revisit this once there's more compelling need to separate them out. Let's spend our time tackling the more critical issues as this seems pretty insignificant in the end.

mbrush commented 3 years ago

Thanks for this Larry, and @rrfreimuth I hope this makes the intention and distinctions clearer for you. Jsut a few things left to settle. Below I lay out the options I think we need to consider/choose from, based on feedback from the group, and questions that have been raised.

To Larry's point, let's only briefly consider these issues, and pick the solution most folks prefer for v0. But I will say that I place a slightly higher level of importance on this part of the model than Larry - as the structure we settle on has the potential to be used/populated in every VA Statement. And I suspect that things like HGVS labels will be an important 'hook' for users to find and understand the data.

Current Proposed Approach:

The VOD structure as it stands uses the four fields shown above to hold name/identifier related metadata. If we go with this approach, I think we just need to consider the name we give the field that holds systematic expressions (currently called alias), so it is clearer about its purpose. There is nothing in the term 'alias' that suggests it would hold only systematic names. Consider something like:

VOD:  (only the fields under consideration in this issue)
- label: string [0..1]                   #  holds the preferred/primary name
- alternateLabels: string [0..m]         #  holds other free-text names (non-systematic/structured) 
- expressionLabels: Expression  [0..m]   #  aka 'systematicLabels' - complex data type that holds systematic names/expressions based on formal nomenclatures, e.g. HGVS, ISCN
- xref: curie [0..m]                     #  holds identifiers of related database entities (not names)

One question this raises is whether the notion of a 'systematic name' includes names assigned by authoritative sources/terminologies (e.g. a knowledgebase, registry, or ontology) - even if they are not derived through some formal grammar / heuristics like hgvs. For example, consider a case where our VOD wraps a gene, where the preferred label is taken from HGNC to be 'LEF1', and we want to capture the name 'TCF10' as the preferred name of this gene from EntrezGene. Would we capture this alternative name in the systematicName field instead of the alternativeName field (so we could attach NCBIGene as the system)?

Alternate Approach: (back on the table, given Bob's concerns)

This question is moot if we go with an approach letting us capture all alternative labels using a single 'names' field that takes a complex 'Name' data type, where the 'Name.system' field is optional:

VOD: (only the fields under consideration in this issue)
- label: Name [0..1]    #  the preferred/primary name of the wrapped entity
- names: Name [0..m]      #  other names for the wrapped entity - be they informal/bespoke, assigned by an authority, or systematic expressions based on a formal nomenclature such as HGVS, ISCN
- xref: curie [0..m]      #  holds identifiers of related database entities (not names)
- 
Name:
- value: string [1..1]             # the name
- system: string [0..1]            # the system or authority that created/assigned the name (if such a system was involved)
- systemURL: url [0..1]            # the url of this system
- systemVersion: string [0..1]     # the version of the system that assigned the name

In some respects this model is simpler, in that there is only one way/structure to capture alternative names. Consider also that the preferred label could be duplicated in a Names structure if the data creator wanted to record the system that provided it. But it is more complex in other respects, in that it requires a nested data structure t represent even non-systematic alternate names.

Not sure if this alternate proposal addresses the concerns @rrfreimuth has raised. But I hope we can spend just a few minutes considering these issues, and make a decision knowing it is v0, and we can iterate in v1 to address problems.

rrfreimuth commented 3 years ago

Good discussion. I had two primary questions and I think we're past the first one. @larrybabb If we consider the meaning of preferred/primary name to be scoped to a given message, then I don't have any issue with its use. I would be more concerned about applying those labels to a generalized knowledge base where the context of a particular message is lacking.

My second comment/question is about the data types used for each of the name-like attributes. If we use different types (e.g., string vs. Coding) then it will not be possible to round-trip messages losslessly because once the data in a Coding is represented as a simple string, we can't get it back again. Therefore, I wonder if it would be possible (and sensical) to use the same type for both the primary/preferred name and the alternate/alias name(s). That type would have to support a simple string, a code, and a nomenclature/grammar-derived name.

I think Coding gets pretty close, but it lacks a convenient way to capture type or use as in the Identifier type. Adding an attribute like that to Coding might be sufficient, and it would allow us to capture "gHGVS" (and distinguish it from "cHGVS").

Please note that if we follow the pattern provided by Coding, we should try very hard to adhere to the same internal data types that are used in that model. That means system is a uri, not a string or a code (although we could discuss whether our use cases require URIs in that slot and whether requiring that type would be overkill).

Finally, I'm not sure we need (or should) to differentiate between names that derive from a nomenclature/grammar and those that do not. If we can find a single type to support all styles of names, then perhaps our attribute list collapses from 3 to 2. Just my $0.02.

mbrush commented 3 years ago

Thanks Bob - I feel like you are pushing for an approach at the other extreme of the original proposal requireing many separate fields for names with different data types. I think the 'Alternate Proposal' above sits somewhere in between.

A couple responses to specific comments

it will not be possible to round-trip messages losslessly because once the data in a Coding is represented as a simple string, we can't get it back again

Sorry if being dense, but I'm not sure we need to worry about the round tripping scenario, or need the model to guard against it. Maybe we can flesh out on the call.


. . . That type would have to support a simple string, a code, and a nomenclature/grammar-derived name.

I think the data type would only need to support a 'simple string' and 'nomenclature/grammar-derived' version of a name, but not a 'code'. We are capturing names here (not identifiers, or codes), so so there is no need to be able to hold a code for the name. This is why we are proposing a new complex data type here (to be named 'Expression', or 'Name' - depending on how we end up deciding to scope it)


I think Coding gets pretty close, but it lacks a convenient way to capture type or use as in the Identifier type. Adding an attribute like that to Coding might be sufficient, and it would allow us to capture "gHGVS" (and distinguish it from "cHGVS").

I think this is the purpose of the 'system' field - which would hold values like 'gHGVS' or 'cHGVS' or 'ISCN'.

larrybabb commented 3 years ago

We meant to separate nomenclatures not authorized names and symbols from coding authorities like HGNC or ensembl or any of the other thousands of authorities that mint and define labels, names, symbols, codes and identifiers.

Nomenclatures follow and expression based syntax defined by a specification or guideline. I do not think it is wise or useful to use it for sharing gene symbols since symbols can change over time and even be reused on different genes at various points in time.

Nomenclatures are not as dependable as the more formal computed digests defined in VRS which are in line with identifiers.

I agree with the weakness in the name 'alias' so it may make more sense to use the name labelExpressions for the HGVS and/or ISCN expressions that producers contrive properly or improperly to represent different types of variants.

I do think we should move on and review this after we've tested it in some implementations so as to not belabor the issue.

larrybabb commented 3 years ago

The real value in providing a separate expression for hgvs is to give producers and consumers a distinct element to place and find hgvs or iscn expressions. There's no reason users can't put expressions in the alternateLabels attribute if they choose to. However if they want and need to make a given type of expression findable then the expressionLabels or labelExpressions would be a reasonable option. If it proves to be clumsy useless or confusing we can change it in v0.1 when we get there.

mbrush commented 3 years ago

Update from 12-16-20 VA Call. Consensus reached to go with original approach. Roughly:

VOD:  (only the fields under consideration in this issue)
- label: string [0..1]                   # The preferred name for the wrapped value object as assigned by the creator of the VOD 
- alternateLabels: string [0..m]         #  holds other names for the object (simple names, not systematically-generated expressions non-systematic from a formal nomenclature) 
- expressionLabels: Expression [0..m]    #  aka 'systematicLabels' -  holds labels representing systematic expressions that describe the wrapped object, as generated by formal nomenclatures (e.g. HGVS, ISCN, HLA)
- xref: curie [0..m]                     #  holds identifiers of related database entities (these are ids, not names)

and

Expression:
- value: string [1..1]             # the expression itself 
- system: string [1..1]            # the system or authority that created/assigned the expression 
- systemURL: url [0..1]            # the url of this system
- systemVersion: string [0..1]     # the version of the system that assigned the expression

Larry, Alex, And Dmitriy were primary advocates for this. Bob preferred one of the alternate approaches, but was ok moving ahead with this one. Testing of this model should include cases that explore the concerns Bob raised (see meeting minutes, and comments in ticket above)

Some work to do (possibly for v 0.1) to align the Expression data type with existing data types in the community (e.g. FHIR Coding or FHIR Identifier)

Also, consider if the expressionLabels field is generally applicable, and should be part of the generic VOD schema, or attached to subtypes where 'expressions' are relevant (e.g. Variation descriptors)