EnvironmentOntology / envo

A community-driven ontology for the representation of environments
http://www.environmentontology.org
Creative Commons Zero v1.0 Universal
132 stars 51 forks source link

Document design pattern for bringing in alternative classifications #1377

Open cmungall opened 1 year ago

cmungall commented 1 year ago

Consider two systems (vocabularies, authoritative taxonomies, etc) called X and Y. They both have terms X1..Xn, Y1..Ym:

System X:

System Y:

To make it more concrete, but still abstracted away from ENVO, consider aligning the starbucks ontology with the Joe's coffee ontology:

System X:

System Y:

There are variants of this problem. In one scenario, both Joe and Starbucks have fuzzy qualitative notions of their categories. In another, they have precise measurements, and these measurements are close but not exact. E.g. you get 10ml more in a Joe's small than a Starbucks tall. There may be other nuances - the starbucks measures include milk, Joe's does not.

The task is to bring these into a unified ontology, Z (e.g. ENVO). This unification must balance different goals that are sometimes in opposition.

The ultimate strategy used will be highly dependent on the domain, use cases, various sociotechnological aspects of adoption of systems X, Y, etc. Sometimes this may even be entangled with legal or geopolitical aspects.

However, we can still abstract some very general patterns for how we address this.

Eager merge and reconciliation strategy

Merge strategy, using X as terminological authority:

Z1 = X1, Y1 primary label: tall Z2 = X1, Y1 primary label: venti Z3 = X3, Y3 primary label: grande Z4 = Y4 primary label: extra-large

(here = denotes skos exactMatch)

All Zn terms may be grouped under a parent class Z "size"

This process may involve manually unifying definitions and making them coherent with other parts of Z (favoring consistency with Z over precise wording choices in X and Y)

During the process of reconciliation it may be decided to slice and dice things differently

For example, Z4 may be labeled "extra-grande" to be more consistent with the rest of Z

Advantages:

Disadvantages:

Examples of this strategy:

The neuroanatomical atlases one is an informative case study. An ontology can never recapitulate the precision of a spatial atlas, nor should it. Annotations can still be done at the level of the atlas (either using region identifiers from the atlas, or the coordinate system). The ontology does its job of reconciling different atlases, providing a unified view, allowing queries to transparently cover all system.

It is not perfect. Compromises need to be made about whether region R uses boundary B1 or boundary B2 or intentionally conflates. But it is better than the alternatives.

The Mondo scenario is also a good case study because there are multiple complex decisions that go into how different systems are unified, with different lumping and splitting decisions. Terminological choices for which label is primary (e.g. system X uses type I, type 2, etc with system Y using gene nomenclature) can have massive ramifications for the primary stakeholders, i.e patients.

Flat preservation non-reconciliation strategy

Z1 = X1 primary label: tall Z2 = X2 primary label: venti Z3 = X3 primary label: grande Z4 = Y1 primary label: small Z5 = Y2 primary label: medium Z6 = Y3 primary label: large Z7 = Y4 primary label: extra-large

All Zn terms may be grouped under a parent class Z "size"

Advantages

Disadvantage

Examples of use:

Preservation with groupings

This is a variant of the previous one where we introduce grouping classes

Z0: size

This largely inherits advantages and disadvantages of the previous system

It somewhat alleviates the query problems of the previous system in that someone can query with the grouping class. But it is still confusing to users as the leaf nodes have so much overlap.

It arguably worsens the annotation problem as the curator now has 3 (largely indistinguishable, in the case of Z10 or Z11) classes to choose from.

It can also be hard to maintain and can lead to ragged lattices.

Deciding on a strategy

It may be the case that ENVO does not dictate an overall strategy. I tend to favor the reconciliation approach which has been successful in other OBO ontologies.

However, we should at least be intentional and clear about which strategy we use in each individual scenario, and a vocabulary for expressing our strategy.

wdduncan commented 1 year ago

Yes. This situation can be quite difficult. I don't have any great insight either. Sometimes you can find or create a "neutral label". E.g.:

For this example, this seems really clunky though. You can use numerical values, but now you have to make a choice about which unit to use. E.g.:

dr-shorthair commented 1 year ago

Silly word 'venti' was invented by one vendor (Starbucks) and to my knowledge isn't even used by anyone else. It should definitely NOT be a preferred label ;-)

wdduncan commented 1 year ago

@dr-shorthair I totally AGREE! :)

dr-shorthair commented 1 year ago

Thanks to @cmungall for thoughtful discussion (as usual). Yes, I think this captures the essence of the task that I have triggered. And yes, a desirable outcome is to at least have documented the strategy in use for each classification scope.

It may not be possible to use the same approach for all scopes, because of both technical mismatches and community sensitivities.

dr-shorthair commented 1 year ago

My colleagues seems to think that reconciliation is not possible in the case of soil classifications - see https://github.com/EnvironmentOntology/envo/issues/825#issuecomment-1289801787