edmcouncil / idmp

This repository stores the OWL ontology built on the basis of the ISO standards for identification of medicinal products.
https://spec.edmcouncil.org/idmp/
MIT License
30 stars 10 forks source link

Labels not unique under OWL entailment #331

Open tw-osthus opened 1 year ago

tw-osthus commented 1 year ago

If we use simple OWL entailment that infers sub-properties, then we do not have a unique label for a class or named individual.

This is a problem when we use queries under that entailment.

find substances in a medicinal product

?product a idmp-mrd:MedicinalProduct .
?product (cmns-col:comprises)+ ?comprised .
?comprised a idmp-sub:Substance .
?comprised rdfs:label ?substance .

in our Amlodipine example this returns for the "Amlodipine EMC" medicinal product two substance labels:

amlodipine mesylate monohydrate and amlodipine mesilate monohydrate

I first expected a typo, but amlodipine mesilate monohydrate is a defined synonym! Because of sub-property entailment, and cmns-av:synonym is a sub-property of rdfs:label, both labels are returned. We can not easily get the preferred label, unless we make an expensive check that the label is not a synonym: filter not exists(?comprised cmns-at:synonym ?substance)

If no sub-property entailment used, we get the "correct" label amlodipine mesylate monohydrate

So can we disable sub-property inference? No, because we need this when we traverse along "cmns-col:comprises" or any sub-property of it.

I suggest that we generate skos:prefLabel in addition to rdfs:label, so that if uniqueness is needed, then we can use skos:prefLabel instead of rdfs:label

tw-osthus commented 1 year ago

rdfs:label like cmns-av:synonym is an annotation property and can be excluded from inferencing, but the problem that we need some inferencing of sub-properties still remains.

mereolog commented 1 year ago

IMHO we should take it as a feature of a representation that allows multiple names for a single entity.

An alternative may be that we divorce cmns-av:synonym from rdfs:label, but this feels arbitrary as synonyms may be seen as alternative labels. Another option is to 'reify' synonyms as named individuals.

Both options are processwise expensive as they require a change a commons ontology.

This also shows, in my view, a risk related to running SPARQL queries with some reasoning support. In this particular case we could rewrite the query as there are just 3 sub properties of cmns-col:comprises

tw-osthus commented 1 year ago

There is nothing wrong to have multiple rdfs:labels per entity, and cmns-av:synonym (by skos:altLabel) is a legitimate sub-property of it. It is just that our guidelines expect a single asserted rdfs:label, so that anyone who wants to implement the IDMP-O standard, knows the label he should use.

However under simple sub-property inference, an asserted rdfs:label cannot be distinguished from an inferred rdfs:label, and the guideline breaks in our SPARQL queries. People can expect to find the single rdfs:label to use in a standard on IDMP vocabulary.

If we run a SPARQL query and we get more results then expected, then it is hard to debug, because it might be the consequence of a nested select query relying on labels. We can use GROUP BY ?iri together with SAMPLE() to catch it, as Heiner has done it, but this is also not stable either, because SAMPLE() can return any of the labels and using GROUP BY makes the queries much more difficult to understand and hides the important graph pattern.

IMHO it is not simply a feature of representation. Competency questions are a very important part of the ontology, and we should be able to use them without specialized tooling support. Accurids can replace the IRIs by labels, Protege can do it, but both features are non-standard and tool dependent. A standard SPARQL query runs on any RDF API and triple store. Our queries should be minimal, so that experienced people can understand them, but they should also be usable by non-RDF experienced SMEs. They will expect to read labels, not IRIs, and rightfully so, because that is what labels are intended for.

We can add a simple requirement to our guidelines, that we add a unique skos:prefLabel annotation to an entity with the same value as the rdfs:label. We can generate that even automatically. By doing that, we have a reliable predicate for a unique label, that works even when sub-property inference is enabled. People can use rdfs:label to then collect all possible labels, e.g. in a mapping use case.

Using skos:prefLabel does not introduce another external dependency, because we already use skos:altLabel as super-property of cmns-av:synonym, so we already have a dependency on SKOS in CMNS

Sure, we can go farther, and use reified labels, such as SKOS-XL, and in some context, this is the way to go, because it can be hard to agree on a stable cross-domain unique label. After all, this has been and still is my main critique on using human-readable IRIs. I can also understand that introducing reified labels, make the queries more complicated, and if we can work without them, this is good because of simplicity. In Allotrope, the community had also decided on using simple SKOS instead of SKOS-XL.

Some tools and services might not understand and use SKOS and rely only on rdfs:label. So I do not recommend on deleting the asserted rdfs:label. However these tools will have to disable any sub-property inference anyway, if they expect uniqueness. However they have to accept, that some standard queries, which rely on this inference, will not work.

Not using sub-property inference is a big limitation for SPARQL based competency questions. You can get away with sub-class inference by using the property path rdfs:subClassOf/rdf:type, but you cannot do the same with properties. A property path such as cmns-col:comprises+ that wants to include the sub-properties of cmns-col:comprises, must completely list all sub-properties in the path itself, e.g. (cmns-col:comprises|cmns-col:hasMember|cmsn-col:hasConstituent|...)+, which is not possible to do. We also can not use variables in a property path. Any SPARQL query that needs to drill down in a recursive way, will have to use property paths. In our use cases, this is for example the nested packaging.

ElisaKendall commented 1 year ago

@merelog @tw-osthus There is evidence in Protege of some of the challenges in labels, even when one has used multilingual labels that are all rdfs:label -- the Protege display doesn't know which label to use so it doesn't use any of them. But if we say that the US label is preferred, that may not be the right solution either. I think there are some interesting issues buried in this discussion, though, not necessarily limited to labels / annotations. Perhaps we should have a discussion about this?

mereolog commented 7 months ago

@tw-osthus is this still an issue? The ticket is almost a year old - can we close it?

tw-osthus commented 7 months ago

The issue is still unresolved and existing, but it seems that is not a pain point for the IDMP community, which I am a bit surprised.

All external vocabularies have a notion of preferred label, so the people concerned with them seem to agree that it is useful to have that discussion. Bayer mentioned that they want to have a single preferred label, so inferencing mutiple rdfs:labels is bad.

It is a bit of a political topic, whether US English should be the preferred language, but it is de facto the lingua franca in the scientific domain. Around 1900 it could have been German, but we all know the reasons why it has not come this way.

You can configure Protege to have any property to use to display labels, even mulitiple ones in a preferential order.