cbizon commented 6 years ago

The current use of DomainEntity in ValueSets is confusing and appears to add complexity without much benefit. This is one of a series of proposals for simplifying or modeling our approach. We (@bpow @larrybabb @mbrush ) are meeting Tuesday to decide on how to handle this stuff, so it would be good to collect all the proposals and comment on them ahead of time.

Current Status and Issues

When we have a property like IndividualAlleleInheritance.parentalConfirmation, we only want to allow that property to have a set of values. We do this in a two step process. First, we have a list of IRIs comprising a ValueSet. This set contains the allowed values of the property. Second, we make a type called ParentalConfirmation that derives from DomainEntity, and we state that the value of the parentalConfirmation property must be of this type.

Problem 1: A false sense of security The goal here is to use the type-safeness from the subclass of DomainEntity to make sure that the correct ValueSet is used to populate the property. But the types don't actually ensure that, because there's not an explicit (type-)connection between ParentalConfirmation (the DomainEntity) and ParentalConfirmation (the ValueSet). It is completely possible to make a message that is type-correct and uses an illegal value.

Problem 2: Complexity When looking at the documentation, users see that a given property must have the type ParentalConfirmation, and then they go to look it up and find that it just wraps elements from some ValueSet, which show up as "Examples". This is pretty baffling to users, but to hide it would mean the documentation is no longer playing fair with the reader.

Proposal: Don't use DomainEntity for this purpose

The problem is that we've put too much onto DomainEntity, which has a useful function for defining things like Family that probably don't have an IRI associated with them, but might.

I. Change annotation of properties For properties that now have DomainEntities as their type, change the type to "id" indicating that they will take an IRI. Add a column to the spreadsheet for ValueSets indicating the relevant value set that properties must draw from, and modify the documentation to show this as well.

2. Validate with enum: JSON schema can validate values using an enum: https://spacetelescope.github.io/understanding-json-schema/reference/generic.html Update the json-schema generation to create enums from the exclusive ValueSets and use them to validate messages.

3. Possibly use JSKOS to define ValueSets https://gbv.github.io/jskos/jskos.html

4. Retain DomainEntity as a class for its non-ValueSet use cases

larrybabb commented 6 years ago

Nice write-up @cbizon.

I think #1 above might be the best (the way I understand it at least). I think it is important to have a fairly generic "id" type for these concepts that we know to be "codes" which link to external concepts. These are codes we want to provide some level of control, sometimes fixed and sometimes extensible. Let's make sure we don't lose sight of the need for different instances of these value sets to have the feature which allows a sender to make up a local concept with/without a local code that is not in the prescribed list, as well as, adding one from a specific identifiersystem or some filtered area within a given identifier system.

I think #1 above may be the closest to this.

I believe #2 is too restrictive, it may work for some of the "fixed" ones we have defined, but I think it won't necessarily play nice in ontology land.

3 may still be viable, once we understand how strict SKOschemas and concepts need to be managed and if they allow for users of our message model to add custom codes on the fly.

rrfreimuth commented 6 years ago

Thanks for laying this out so nicely, Chris. As I consider these options, I find myself asking a more basic question: do we intend this model to include the constraints that might be used in a particular case? Asked another way, is the model documenting the only allowed values or is it listing examples of expected values?

We will certainly want to constrain values in some cases, as it would be harder to run the interpretation/scoring model without controlled value sets, but I wonder if we can sufficiently enumerate them all ahead of time. Will we need to support extensibility?

Thanks, Bob

From: cbizon [mailto:notifications@github.com] Sent: Thursday, January 18, 2018 1:36 PM To: clingen-data-model/clingen-interpretation Cc: Subscribed Subject: [clingen-data-model/clingen-interpretation] PROPOSAL: Remove use of DomainEntity in ValueSets (#185)

The current use of DomainEntity in ValueSets is confusing and appears to add complexity without much benefit. This is one of a series of proposals for simplifying or modeling our approach. We (@bpowhttps://github.com/bpow @larrybabbhttps://github.com/larrybabb @mbrushhttps://github.com/mbrush ) are meeting Tuesday to decide on how to handle this stuff, so it would be good to collect all the proposals and comment on them ahead of time.

Current Status and Issues

When we have a property like IndividualAlleleInheritance.parentalConfirmation, we only want to allow that property to have a set of values. We do this in a two step process. First, we have a list of IRIs comprising a ValueSet. This set contains the allowed values of the property. Second, we make a type called ParentalConfirmation that derives from DomainEntity, and we state that the value of the parentalConfirmation property must be of this type.

Problem 1: A false sense of security The goal here is to use the type-safeness from the subclass of DomainEntity to make sure that the correct ValueSet is used to populate the property. But the types don't actually ensure that, because there's not an explicit (type-)connection between ParentalConfirmation (the DomainEntity) and ParentalConfirmation (the ValueSet). It is completely possible to make a message that is type-correct and uses an illegal value.

Problem 2: Complexity When looking at the documentation, users see that a given property must have the type ParentalConfirmation, and then they go to look it up and find that it just wraps elements from some ValueSet, which show up as "Examples". This is pretty baffling to users, but to hide it would mean the documentation is no longer playing fair with the reader.

Proposal: Don't use DomainEntity for this purpose

The problem is that we've put too much onto DomainEntity, which has a useful function for defining things like Family that probably don't have an IRI associated with them, but might.

I. Change annotation of properties For properties that now have DomainEntities as their type, change the type to "id" indicating that they will take an IRI. Add a column to the spreadsheet for ValueSets indicating the relevant value set that properties must draw from, and modify the documentation to show this as well.

Validate with enum: JSON schema can validate values using an enum: https://spacetelescope.github.io/understanding-json-schema/reference/generic.html Update the json-schema generation to create enums from the exclusive ValueSets and use them to validate messages.
Possibly use JSKOS to define ValueSets https://gbv.github.io/jskos/jskos.html
Retain DomainEntity as a class for its non-ValueSet use cases

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/clingen-data-model/clingen-interpretation/issues/185, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALAhBixr4hA88Z5EavGPhYjNe5O1XfoXks5tL5uTgaJpZM4RjcaA.

rrfreimuth commented 6 years ago

I think @larrybabb answered some of my questions in this post. Evidently we were responding in parallel.

Thanks, Bob

From: Larry Babb [mailto:notifications@github.com] Sent: Thursday, January 18, 2018 2:00 PM To: clingen-data-model/clingen-interpretation Cc: Subscribed Subject: Re: [clingen-data-model/clingen-interpretation] PROPOSAL: Remove use of DomainEntity in ValueSets (#185)

Nice write-up @cbizonhttps://github.com/cbizon.

I think #1https://github.com/clingen-data-model/clingen-interpretation/issues/1 above might be the best (the way I understand it at least). I think it is important to have a fairly generic "id" type for these concepts that we know to be "codes" which link to external concepts. These are codes we want to provide some level of control, sometimes fixed and sometimes extensible. Let's make sure we don't lose sight of the need for different instances of these value sets to have the feature which allows a sender to make up a local concept with/without a local code that is not in the prescribed list, as well as, adding one from a specific identifiersystem or some filtered area within a given identifier system.

I think #1https://github.com/clingen-data-model/clingen-interpretation/issues/1 above may be the closest to this.

I believe #2https://github.com/clingen-data-model/clingen-interpretation/issues/2 is too restrictive, it may work for some of the "fixed" ones we have defined, but I think it won't necessarily play nice in ontology land.

3https://github.com/clingen-data-model/clingen-interpretation/issues/3 may still be viable, once we understand how strict SKOschemas and concepts need to be managed and if they allow for users of our message model to add custom codes on the fly.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/clingen-data-model/clingen-interpretation/issues/185#issuecomment-358764326, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALAhBkV1eK0AJBWPxeXjwWsH6tOJUJnNks5tL6LUgaJpZM4RjcaA.

cbizon commented 6 years ago

To be 100% clear, I wasn't trying to phrase 1,2,3,4 as choices, but as the components of a single proposal. Or, if you want to look at it this way, 1 is the proposal and 2,3,4 are attempts to shore up problems that might be caused by 1.

larrybabb commented 6 years ago

(Now I've posted a similar point to @rrfreimuth above in parallel - oh well)

I think another aspect of this that may be confusing us is the fact that we are defining a model for use in exchanging data, not storing it. As a message interchange there is the need to allow for its use in cases that are not ideal or tightly controlled. Providing flexibility on extending or creating concepts in systems that may not have the technical capability to register the concept in an ontology or public system needs to be accommodated.

If we are thinking that we can define ever value and limit our use of the message to a specific set of coded values for every attribute that needs them then we will most certainly build a message that deters adoption.

That said, I would say that we can define a flexible solution, but then provide our implementation for ClinGen to be more restrictive, much in the same way that HL7 standards publish general messaging structures and content and leaving specific groups large and small with the task of providing implementation guides that impose restrictions for a clearly scoped set of use cases.

Can we all agree that from a messaging standpoint, there is a need to support the exchange of dynamically and locally created concepts to be used in the place of elements which may be defined to promote a baseline set of concepts (in the form of a value set)? Then we can determine if we want to extend the use of these custom codes in our implementation guide that specifies how ClinGen is using the model to share/exchange ACMG interpretations between ClinVar-VCI-ClinGenKB?

larrybabb commented 6 years ago

@cbizon sorry for my misinterpretation. I didn't recognize that. Thanks for the clarification.

cbizon commented 6 years ago

In terms of extensibility, I think we want to allow for a range, which you currently model with the parameters on ValueSet. Some ValueSets are completely defined (yes/no or other cases where we can describe all possible values). Others maybe not. Others certainly not.

The restriction with the approach in this proposal is that if you want to use a new value, you need to phrase it in the form of an IRI. This isn't a huge restriction: it's not like they cost anything, and having a small hill to climb in order to fill in random values might actually be a feature...

rrfreimuth commented 6 years ago

Can we all agree that from a messaging standpoint, there is a need to support the exchange of dynamically and locally created concepts to be used in the place of elements which may be defined to promote a baseline set of concepts (in the form of a value set)? Then we can determine if we want to extend the use of these custom codes in our implementation guide that specifies how ClinGen is using the model to share/exchange ACMG interpretations between ClinVar-VCI-ClinGenKB? Yes! (my 2 cents only, of course)

Thanks, Bob

larrybabb commented 6 years ago

From @cbizon above.

The restriction with the approach in this proposal is that if you want to use a new value, you need to phrase it in the form of an IRI. This isn't a huge restriction: it's not like they cost anything, and having a small hill to climb in order to fill in random values might actually be a feature...

This is (one of the places) where I get hung up.

having a small hill to climb in order to fill in random values

@cbizon can we do an example or two of this so that I can build confidence that it supports the breadth of scenarios that the HL7 CodeableConcept does, for example? It's fine if it is more limiting. I just need to get on the same page with your understanding of how this works.

I think we should focus on this in our next meeting, if you can't throw up an example or two in this thread/issue.

cbizon commented 6 years ago

If somebody wants to use a new term we have a few options, I think:

1) We can demand that they contact us and we decide if it's an appropriate addition to the value set. If so, we add it, and they go and use it. In this way, we would be operating like eg. SO or somebody.

2) We don't demand it. Then the person has to just put in an IRI. It needs to be something that both sender and receiver understand, so it would be up to the user to negotiate with the receiver and make sure that they both understand it, and whether or not they wanted to make it dereferencable or not, and all that stuff.

I think that's all there is to it.

larrybabb commented 6 years ago

1 is not realistic. we don't have the resources to act as an agency for managing terms.

2 seems okay, but I'd like to talk about how to deal with terms that aren't necessarily negotiated in advance, in other words you may send stuff the other system does not precisely understand but the rest of the message still has significant value and vice versa.

Also, how would we handle situations where folks want to send "other" or "see comments" to enable the ability to send a less structured explanation of the concept because they've hit a scenarios that could not be foreseen.

bpow commented 6 years ago

Echoing my thanks to @cbizon for kicking off this discussion. I find that working out some of the thornier issues is better in text than on calls-- in part because it forces me to read through and process what others have written before I think through what I have to say.

I keep coming back to trying out different ways of describing what it is that we are trying to accomplish with ValueSets and CodeableConcepts (the latter is a name that isn't in our model anymore, but lives on in our thinking as adopted from the ideas of FHIR). I know that me trying to describe this is repetition, but I keep hoping that getting the right phrasing of the problem will make the solution evident...

We desire way of representing concepts in a type-safe-ish way where we can enumerate the accepted or preferred values, but at least in some cases allow these enumerations to be extended.

The other piece of CodeableConcept from FHIR that we would like to adopt is the ability to represent that two concepts from different systems involved in data interchange may be regarded as equivalent. We have discussed in the past that this type of statement feels a lot like an assertion, and maybe should be modeled as one, but I don't recall having reached any consensus on how to deal with this second point.

More in separate messages to maybe make commenting easier...

bpow commented 6 years ago

The proposal on the table as I read it takes any validation that concepts come from a ValueSet out of the model/data exchange format itself, and would require some other means for systems to verify correctness of the message. As @cbizon points out, our current model for valuesets probably doesn't go far enough to allow us to perform validation directly on the message using something like json-schema anyway, so maybe that's not too much of a loss.

Basically, I see it as though we would be back to having, within the json-ld context, properties be CodableConcept rather than CodeableConcept<ParentalConfirmation> or something like that. Is that accurate?

larrybabb commented 6 years ago

I think that's probably correct. As much as I didn't want to reverse (again) the decision we made to remove CodeableConcept. I'm not sure, but I think the SKOs-Concept concept may be a nice mapping to CodeableConcept, although I'm not clear it handles freetext type entries. I'm pretty sure every concept in a SKOs-scheme needs to be determined in advance.

bpow commented 6 years ago

Another way to characterize what we have been trying to do with the current model is to fit the FHIR ideas of CodeableConcepts into a fairly object-oriented model. I think there are a couple of fairly natural ways to try to represent enumerated values in the object-oriented world that can give you something like this type safety.

Abstract classes with subclasses

Have an abstract class that defines the overall type, and individual classes that represent the individual values. We don't really explicitly talk about having abstract classes in our model (although we do have some there implicitly...).

So, in this example, ParentalConfirmation would be an abstract class, and individual values like SEPIO-CG:99005 (maternal confirmation) would be concrete classes. I don't think this really makes sense for us because "maternal confirmation" isn't a class-- it is a value...

Specialized classes that can take restricted values as instances.

This is pretty much what a java Enum does (and similar in other languages), and I think the closest to what we have been doing in our current model (albeit with the confusion of the way we've named ValueSets and DomainEntities). For our use case, of course, we want the ability to extend some of these valuesets after creation. My characterization of this issue is in trying to figure out how much of this we want to support (and how to do so) in our messaging model.

bpow commented 6 years ago

I'm pretty sure that skos:concepts, and skos:conceptschemes, as part of the RDF world, can be extended quite easily. Perhaps even too easily for our tastes 😄

cbizon commented 6 years ago

I guess I'm not completely sold on the idea that we need either codeable concept or type safety integrated into our values / value sets. There is benefit, but I'm not sure it's worth the complexity cost.

larrybabb commented 6 years ago

Summation of notes from meeting on 1/30/18 (MattB, ChrisB, BradfordP & LarryB)...

From: Matthew Brush brushm@ohsu.edu Date: Tuesday, January 30, 2018 at 6:24 PM Subject: call notes/outcomes

Hi all, some quick and dirty call notes:

We agreed that a UserLabel object should always have a 'label' attribute.
We agreed that 'description' attribute on a UserLabel object should describe why the label was chosen (in some of the data examples the descriptions describe the ascertainment parameters in general)
We made no decision about rules for structuring the IRI of a user-defined term - other than that some conventions are needed here. Matt proposed approach used in ontology world where a shared namespace / prefix is defined for a set of terms, and each curator/system is assigned an ID range within this namespace. This avoids id collisions between data collected by different curators/systems.

Unresolved Issues:

(1) Capturing the context/scope in which a userLabel is defined - we would lose this scope when translating from json to rdf - because while the scope is implied in the nesting structure of the json, it is not explicitly encoded. Because the path from the statement to the UserLabel goes through the re-usable value term object (SEPIO:0000332) in the data, the scope in which separate UserLabels are defined for a single IRI can be lost in the RDF.
- (a) Bradford's original proposal dealt with this by completely separating a UserLabel object that references the value term in a ‘Thesaurus’ section of the json document. This UserLabel object is not itself referenced by the value term/borrowed IRI. This addresses the IRI pollution problem and the scoping problem, but makes for a convoluted message structure.
- (b) Matt's proposal nests the userLabel under the borrowed IRI, and adds a "scope" attribute to the UserLabel which references the ID of the object that contains it (not the ascertainment, but the parent ‘statement’ object). In this way the scope of the user label is explicitly stated, and thus can make it to rdf (and back to json) if this is an important requirement. The structure we came up with to define user labels for existing terms on the call was this (see context and comments on this toward the end of the google doc):
```
ascertainment:
  id: SEPIO:0000332
  label: ExAC ascertainment method
  type: Descriptor
  inValueSet: Ascertainment
  userLabel: 
id: CG-EX:UsrLabl008
label: Exome Aggregation Consortium ascertainment method
type: UserLabel
scope: CG-EX:AllFreq028
contribution: 
id: cg-user:12345
label: Bradford Powel
role: user
```
  The structure proposed to define entirely new terms that extend a given value set was something like this (scoping not an issue here, so the structure below should be sufficient):
```
ascertainment:  
  id: CG-USER:V0001
  label: GENIUS T2D ascertainment method
  type: Descriptor 
  extendsValueSet: Ascertainment 
  description: "Case group is composed of cases from six cohorts from the
  GENIUS T2D consortium. Case status is defined per-cohort"
```
  The issue of more specific DomainEntity types for tying these value terms is still open, and addressed below.
(2) Domain Entity typing of value set terms.
- (a) Bradford wants to support a use case of interoperability with other data sets where things like ascertainments are instantiated and described in the data. To support this, it would help to declare the type of thing that the attribute accepts (its range) using a domain entity type.
- (b) Action Item: Bradford to write up fully-fleshed examples highlighting the value and possible implementations of this proposal. Sure I missed something – but these are key issues and action items I believe.

bpow commented 6 years ago

I wonder whether it would be a useful distinction in our thinking and/or in our model between those concepts/"thingies" for which we specify additional structure and those for which we do not.

If I understand what @cbizon has in mind, the idea would be to replace all of the DomainEntity subtypes for which we do not have any additional attributes/structure to the parent DomainEntity type.

For purposes of cataloging things, the following types descend from DomainEntity and have additional structure in our model:

Criterion (has defaultStrength)
Gene (just has a geneSymbol, which arguably could be the 'label')
SequenceLocation (has referenceSequence, start, stop)
ReferenceSequence (has accession, produces, producedBy)
Haplotype (has 1..* allele)
Genotype (has haplotype, zygosity)
AssertionMethod (has url, version, scoringAlgorithm)
IdentifierSystem (has oid, iri, prefix)

Note that Haplotype is listed under "Evidence Items" in the web page, but descends from DomainEntity in the sheets document. Criterion, Genotype and IdentifierSystem descend from DomainEntity in the sheets, but do not have pages to represent their types in the webpage.

I'm not really sure that IdentifierSystem belongs as a DomainEntity, but interestingly enough, although it inherits from DomainEntity currently, it also defines its own label and description properties.

Part of the point of me cataloging this is to consider whether there is a more fundamental difference between these types (that are DomainEntity-plus) and the other DomainEntity subtypes that would be eliminated from our model under this proposal.

bpow commented 6 years ago

Following up on my earlier comment with some more focused questions:

Should there be a difference (at the model level) between things that, from the model's perspective, can just be specified by an IRI, and those that we require to have further structure?
Can we think of better names that clarify this? Maybe DomainEntity is OK for those things that have additional structure, but something like "DomainConcept" for things that do not have additional structure? Not saying this is a good idea, but would like to hear from people.
Would we imagine that we (or other implementers) might restrict items to a ValueSet even if they have additional structure defined either by us or by others? For instance, I could see Gene as being specified as coming from a ValueSet (where we would use the preferred system HGNC and items composed using their geneSymbols, but another system could be ensembl genes, or some other namespace that allows non-human genes...). Likewise, Criterion is associated with a valueset, but we also provide some additional attributes.

I'm really thinking that the differentiation at the type level is less "things that belong to ValueSets vs. things that don't" and more "things for which we define additional properties vs. things that we just represent with an IRI". Or maybe those two types of differentiation are orthogonal...

clingen-data-model / clingen-interpretation

PROPOSAL: Remove use of DomainEntity in ValueSets #185

3 may still be viable, once we understand how strict SKOschemas and concepts need to be managed and if they allow for users of our message model to add custom codes on the fly.

3https://github.com/clingen-data-model/clingen-interpretation/issues/3 may still be viable, once we understand how strict SKOschemas and concepts need to be managed and if they allow for users of our message model to add custom codes on the fly.

Abstract classes with subclasses

Specialized classes that can take restricted values as instances.

Summation of notes from meeting on 1/30/18 (MattB, ChrisB, BradfordP & LarryB)...