Open cmungall opened 1 year ago
Yes. This situation can be quite difficult. I don't have any great insight either. Sometimes you can find or create a "neutral label". E.g.:
For this example, this seems really clunky though. You can use numerical values, but now you have to make a choice about which unit to use. E.g.:
Silly word 'venti' was invented by one vendor (Starbucks) and to my knowledge isn't even used by anyone else. It should definitely NOT be a preferred label ;-)
@dr-shorthair I totally AGREE! :)
Thanks to @cmungall for thoughtful discussion (as usual). Yes, I think this captures the essence of the task that I have triggered. And yes, a desirable outcome is to at least have documented the strategy in use for each classification scope.
It may not be possible to use the same approach for all scopes, because of both technical mismatches and community sensitivities.
My colleagues seems to think that reconciliation is not possible in the case of soil classifications - see https://github.com/EnvironmentOntology/envo/issues/825#issuecomment-1289801787
Consider two systems (vocabularies, authoritative taxonomies, etc) called X and Y. They both have terms X1..Xn, Y1..Ym:
System X:
System Y:
To make it more concrete, but still abstracted away from ENVO, consider aligning the starbucks ontology with the Joe's coffee ontology:
System X:
System Y:
There are variants of this problem. In one scenario, both Joe and Starbucks have fuzzy qualitative notions of their categories. In another, they have precise measurements, and these measurements are close but not exact. E.g. you get 10ml more in a Joe's small than a Starbucks tall. There may be other nuances - the starbucks measures include milk, Joe's does not.
The task is to bring these into a unified ontology, Z (e.g. ENVO). This unification must balance different goals that are sometimes in opposition.
The ultimate strategy used will be highly dependent on the domain, use cases, various sociotechnological aspects of adoption of systems X, Y, etc. Sometimes this may even be entangled with legal or geopolitical aspects.
However, we can still abstract some very general patterns for how we address this.
Eager merge and reconciliation strategy
Merge strategy, using X as terminological authority:
Z1 = X1, Y1 primary label: tall Z2 = X1, Y1 primary label: venti Z3 = X3, Y3 primary label: grande Z4 = Y4 primary label: extra-large
(here = denotes skos exactMatch)
All Zn terms may be grouped under a parent class Z "size"
This process may involve manually unifying definitions and making them coherent with other parts of Z (favoring consistency with Z over precise wording choices in X and Y)
During the process of reconciliation it may be decided to slice and dice things differently
For example, Z4 may be labeled "extra-grande" to be more consistent with the rest of Z
Advantages:
Disadvantages:
Examples of this strategy:
The neuroanatomical atlases one is an informative case study. An ontology can never recapitulate the precision of a spatial atlas, nor should it. Annotations can still be done at the level of the atlas (either using region identifiers from the atlas, or the coordinate system). The ontology does its job of reconciling different atlases, providing a unified view, allowing queries to transparently cover all system.
It is not perfect. Compromises need to be made about whether region R uses boundary B1 or boundary B2 or intentionally conflates. But it is better than the alternatives.
The Mondo scenario is also a good case study because there are multiple complex decisions that go into how different systems are unified, with different lumping and splitting decisions. Terminological choices for which label is primary (e.g. system X uses type I, type 2, etc with system Y using gene nomenclature) can have massive ramifications for the primary stakeholders, i.e patients.
Flat preservation non-reconciliation strategy
Z1 = X1 primary label: tall Z2 = X2 primary label: venti Z3 = X3 primary label: grande Z4 = Y1 primary label: small Z5 = Y2 primary label: medium Z6 = Y3 primary label: large Z7 = Y4 primary label: extra-large
All Zn terms may be grouped under a parent class Z "size"
Advantages
Disadvantage
Examples of use:
Preservation with groupings
This is a variant of the previous one where we introduce grouping classes
Z0: size
This largely inherits advantages and disadvantages of the previous system
It somewhat alleviates the query problems of the previous system in that someone can query with the grouping class. But it is still confusing to users as the leaf nodes have so much overlap.
It arguably worsens the annotation problem as the curator now has 3 (largely indistinguishable, in the case of Z10 or Z11) classes to choose from.
It can also be hard to maintain and can lead to ragged lattices.
Deciding on a strategy
It may be the case that ENVO does not dictate an overall strategy. I tend to favor the reconciliation approach which has been successful in other OBO ontologies.
However, we should at least be intentional and clear about which strategy we use in each individual scenario, and a vocabulary for expressing our strategy.