Document design pattern for bringing in alternative classifications

cmungall commented 1 year ago

Consider two systems (vocabularies, authoritative taxonomies, etc) called X and Y. They both have terms X1..Xn, Y1..Ym:

System X:

X1
X2
X3

System Y:

Y1
Y2
Y3
Y4

To make it more concrete, but still abstracted away from ENVO, consider aligning the starbucks ontology with the Joe's coffee ontology:

System X:

X1: tall
X2: venti
X3: grande

System Y:

Y1: small
Y2: medium
Y3: large
Y4: extra-large

There are variants of this problem. In one scenario, both Joe and Starbucks have fuzzy qualitative notions of their categories. In another, they have precise measurements, and these measurements are close but not exact. E.g. you get 10ml more in a Joe's small than a Starbucks tall. There may be other nuances - the starbucks measures include milk, Joe's does not.

The task is to bring these into a unified ontology, Z (e.g. ENVO). This unification must balance different goals that are sometimes in opposition.

The ultimate strategy used will be highly dependent on the domain, use cases, various sociotechnological aspects of adoption of systems X, Y, etc. Sometimes this may even be entangled with legal or geopolitical aspects.

However, we can still abstract some very general patterns for how we address this.

Eager merge and reconciliation strategy

Merge strategy, using X as terminological authority:

Z1 = X1, Y1 primary label: tall Z2 = X1, Y1 primary label: venti Z3 = X3, Y3 primary label: grande Z4 = Y4 primary label: extra-large

(here = denotes skos exactMatch)

All Zn terms may be grouped under a parent class Z "size"

This process may involve manually unifying definitions and making them coherent with other parts of Z (favoring consistency with Z over precise wording choices in X and Y)

During the process of reconciliation it may be decided to slice and dice things differently

For example, Z4 may be labeled "extra-grande" to be more consistent with the rest of Z

Advantages:

harmonization and data integration
mappings using SSSOM/SKOS allow precisely relating the unified class to the source term (exact, broad, close, narrow)
subcategories are largely disjoint, no need for paired annotation
combines best aspects of each
internal coherency
terminological preferences can be recapitulated in specific exports of Z (e.g. a Y flavor of Z using Y's labels)
If X and Y have quantitative aspects these are still retained if X and Y are mapped to. We are not replacing X and Y, just providing a consensus view.

Disadvantages:

may involve some loss of specificity
difficulty of getting consensus on reconciliation
danger of making "hodge-podges" that are neither fish nor fowl
Y may be unhappy their primary labels are relegated to synonyms (albeit tagged)

Examples of this strategy:

Uberon bringing in multiple different authoritative yet conflicting brain atlases
Uberon bringing in different species
Mondo bringing in different genetic disease classifications
GO in providing a unified system for functional genomics

The neuroanatomical atlases one is an informative case study. An ontology can never recapitulate the precision of a spatial atlas, nor should it. Annotations can still be done at the level of the atlas (either using region identifiers from the atlas, or the coordinate system). The ontology does its job of reconciling different atlases, providing a unified view, allowing queries to transparently cover all system.

It is not perfect. Compromises need to be made about whether region R uses boundary B1 or boundary B2 or intentionally conflates. But it is better than the alternatives.

The Mondo scenario is also a good case study because there are multiple complex decisions that go into how different systems are unified, with different lumping and splitting decisions. Terminological choices for which label is primary (e.g. system X uses type I, type 2, etc with system Y using gene nomenclature) can have massive ramifications for the primary stakeholders, i.e patients.

Flat preservation non-reconciliation strategy

Z1 = X1 primary label: tall Z2 = X2 primary label: venti Z3 = X3 primary label: grande Z4 = Y1 primary label: small Z5 = Y2 primary label: medium Z6 = Y3 primary label: large Z7 = Y4 primary label: extra-large

All Zn terms may be grouped under a parent class Z "size"

Advantages

politically neutral - we do not favor any one system

Disadvantage

confusing for users
- if I want to annotate a coffee cup, do I use Z1 or Z4? Do I annotate with both
- if I want to query for coffee cups, do I query with Z1 or Z4? Do I have to query with both and do a union?
leads to inconsistent gappy annotation
does not fulfill requirements for a unified ontology, namely providing unification

Examples of use:

NCIT for some branches? (need to preserve original terminological systems)
Generally this is not common in OBO

Preservation with groupings

This is a variant of the previous one where we introduce grouping classes

Z0: size

Z10 primary label: tall or venti
- Z1 = X1 primary label: tall
- Z4 = Y1 primary label: small
Z11 primary label: medium or venti
- Z2 = X2 primary label: venti
- Z5 = Y2 primary label: medium
Z12 primary label: grande, large or extra large
- Z3 = X3 primary label: grande
- Z6 = Y3 primary label: large
- Z7 = Y4 primary label: extra-large

This largely inherits advantages and disadvantages of the previous system

It somewhat alleviates the query problems of the previous system in that someone can query with the grouping class. But it is still confusing to users as the leaf nodes have so much overlap.

It arguably worsens the annotation problem as the curator now has 3 (largely indistinguishable, in the case of Z10 or Z11) classes to choose from.

It can also be hard to maintain and can lead to ragged lattices.

Deciding on a strategy

It may be the case that ENVO does not dictate an overall strategy. I tend to favor the reconciliation approach which has been successful in other OBO ontologies.

However, we should at least be intentional and clear about which strategy we use in each individual scenario, and a vocabulary for expressing our strategy.

wdduncan commented 1 year ago

Yes. This situation can be quite difficult. I don't have any great insight either. Sometimes you can find or create a "neutral label". E.g.:

Size 4 coffee
- large coffee
- venti coffee
Size 3 coffee
- medium coffee
- grande coffee
Size 2 coffee
- small coffee
Size 1 coffee
- tall coffee

For this example, this seems really clunky though. You can use numerical values, but now you have to make a choice about which unit to use. E.g.:

20-24 fl oz coffee
- large coffee
- venti coffee
16-20 fl oz coffee ... But now you've imposed the use of "oz" instead of "ml" :(

dr-shorthair commented 1 year ago

Silly word 'venti' was invented by one vendor (Starbucks) and to my knowledge isn't even used by anyone else. It should definitely NOT be a preferred label ;-)

wdduncan commented 1 year ago

@dr-shorthair I totally AGREE! :)

dr-shorthair commented 1 year ago

Thanks to @cmungall for thoughtful discussion (as usual). Yes, I think this captures the essence of the task that I have triggered. And yes, a desirable outcome is to at least have documented the strategy in use for each classification scope.

It may not be possible to use the same approach for all scopes, because of both technical mismatches and community sensitivities.

dr-shorthair commented 1 year ago

My colleagues seems to think that reconciliation is not possible in the case of soil classifications - see https://github.com/EnvironmentOntology/envo/issues/825#issuecomment-1289801787

EnvironmentOntology / envo