Handling of language tags in KGCL

cmungall commented 8 months ago

Currently handling of language tags is under-specified in KGCL, both in terms of

matching (e.g. change label from X[@en] to Y)
applying (e.g. change label from X to Y[@en])

Recall also that most OBO ontologies use a mixture of uncommitted literals, xsd:string, and @en to denote english language labels.

As a general principle, the KGCL DSL is intended to be user-friendly. The user shouldn't have to know detailed implementation knowledge about each ontology. In fact it is very hard for them to know these details. As a case in point, for the following two terms in ENVO it's impossible to know from OLS that the first uses an explicit @en and the second does not:

At the most recent OMO meeting there was heated discussion about whether we should expect cardinality=1 of rdfs:label given that some ontologies may want to be international. It's not up to KGCL to adjudicate here. However, we can make things easy for users:

matching should be liberal; if a language tag is not specified this should not be interpreted as "must match untyped literal", it should instead be interpreted as "match this at the string level"
application should be configurable at the ontology level
- if the user does not specify a language tag, and the ontology is configured to always use language tags then the configured default language should be applied
- if the user does specify a language tag then this should be used (it is up to the ontology to configure GH actions to reject any or all language tags if their policy is always untyped literals)

2 This does place more of a burden on implementors as there needs to be some configuration mechanism, but having this default to untyped literals will work for pretty much all OBO ontologies for now

gouttegd commented 8 months ago

At the most recent OMO meeting there was heated discussion about whether we should expect cardinality=1 of rdfs:label given that some ontologies may want to be international. It's not up to KGCL to adjudicate here.

Actually even if we decided that there can only be one label (and so, that we can ignore all cases where there are more than one label as being invalid and “not-our-problem”), that wouldn’t solve the general issue: KGCL supports modifying other annotation properties than just rdfs:label, including properties for which there is no doubt (or at least I hope there is no doubt!) that it is perfectly legitimate to have more than one annotations per term. All properties pertaining to synonyms, for example.

matching should be liberal; if a language tag is not specified this should not be interpreted as "must match untyped literal", it should instead be interpreted as "match this at the string level"

Given a case like this:

AnnotationAssertion(rdfs:label EX:0001 "the label")
AnnotationAssertion(rdfs:label EX:0001 "the label"@en)

What should be the behaviour of rename EX:0001 from "the label" to "the new label" ? Should it rename both the language-neutral label and the English label? What if I want to specifically edit the language-neutral label?

How about:

No language tag means that we look for a value that does not have a language tag (so, "the label" would match a tag-less label only);
We accept a @* language tag that would mean “any language tag (including no language tag) will do” (so, "the label"@* would match any literal value that is exactly "the label", regardless of any language tag).

This way the decision to ignore the language tags when matching would be an explicit decision. (Of course we could also do the opposite: no language tag means “ignore the language tags when matching”, and a @NONE or similar would mean “only match literals that do not have a language tag” – though that would seem much less natural to me.)

Then at the level of the Ontobot, individual ontologies can configure the bot to pass to the KGCL engine a --default-old-language-tag option, that would be used when no language tag is explicitly given in the KGCL command(s). By setting this option to @*, this would give the same behaviour as the one you propose, the difference being that this behaviour would not be hardcoded in the KGCL engines.

cmungall commented 8 months ago

Thanks, I think this makes sense, but I'd like to reverse it a little

No language tag ("...") means that any language tag or or no language tag matches
As specific language tag ("..."@en) must match a language-literal
If a user wants to match plain literals or plain literals only, they say "..."^^rdf:PlainLiteral

Here there is a slight impedance mismatch with semantic web standards where there is always a literal type commitment (people get caught by this all the time with sparql queries, a string match "works" on one ontology but not another, without inefficient coercion to strings). However, there is less of an impedance mismatch with user expectations.

balhoff commented 8 months ago

I wouldn't hinge anything on rdf:PlainLiteral, which is dropped for RDF 1.1. In the newer standard a simple literal is short for a literal with datatype xsd:string: https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal

You'll see this behavior when RDF passes through Jena, although confusingly the OWL spec was not updated to keep up.

gouttegd commented 8 months ago

No language tag ("...") means that any language tag or or no language tag matches

OK.

Regardless of whether ignoring the language tags is the default behaviour (your proposition) or must be explicitly asked (mine, @*), we must also decide what is expected behaviour when more than one label match, as in my example:

AnnotationAssertion(rdfs:label EX:0001 "the label")
AnnotationAssertion(rdfs:label EX:0001 "the label"@en)

If the command is rename EX:0001 from "the label" to "the new label", I don’t think there is any problem. It seems clear to me that the result should be:

AnnotationAssertion(rdfs:label EX:0001 "the new label")
AnnotationAssertion(rdfs:label EX:0001 "the new label"@en)

Because:

No language tag on the old value, so we match both existing labels.
No language tag on the new value, so we preserve the existing tags (including the absence of a language tag).

This is also, I believe, what most users would expect, so this is fine.

But what if the command is rename EX:0001 from "the label" to "the new label"@en – that is, with a language tag on the new value (whether it has been explicitly specified by the user, or automatically added because the ontology is configured to do so – as envisioned in your first message)?

The “logical” (but not necessarily sensible) output would be:

AnnotationAssertion(rdfs:label EX:0001 "the new label"@en)
AnnotationAssertion(rdfs:label EX:0001 "the new label"@en)

Because:

No language tag on the old value, so we match both labels;
Language tag on the new value, so we set the language tag as specified.

Here I don’t think this is a desirable behaviour.

Even worse, let’s imagine a term that has labels in several languages, and that in two languages the labels are actually the same string (this won’t be frequent but it may happen; words that are identical across languages are not unheard of). For example, say we have:

AnnotationAssertion(rdfs:label EX:0001 "lion"@en)
AnnotationAssertion(rdfs:label EX:0001 "lion"@fr)

A command like rename EX:0001 from "lion" to "sea lion"@en should not rename the French label to "sea lion"@en! (Arguably ontologies that have multi-language labels should simply forbid the use of untagged values in KGCL commands, and force users to always be explicit.)

I propose something like:

If the new value has a language tag and the old value does not, then we do not look blindly for any matching label regardless of the language tag (as we do in the general case). Instead, we first look for a matching label with the same language tag as the new value, and then (if we don’t find such a label) we look for a matching label without a language tag. We never look for a matching label with a different language tag.

Admittedly this is a bit complicated, but I think that should cover all cases reasonably. For example, given the command rename EX:0001 from "lion" to "sea lion"@en:

if the term has only a matching English label ("lion"@en), it would be renamed into another English label ("sea lion"@en);
if the term has only a matching language-neutral label ("lion"), it would both be renamed and given an English language tag ("lion"@en);
if the term has both a matching English label and a matching language-neutral label, only the English label would be renamed into another English-tagged label;
if the term has both a matching English label and another matching label in another language, likewise: only the English label would be renamed.

Overall this should work just fine for ontologies that have a mixture of untagged and tagged labels.

Aside:

If a user wants to match plain literals or plain literals only, they say "..."^^rdf:PlainLiteral

Note that with recent versions of the OWL API, a literal without an explicit datatype (as in "the label") is interpreted as a xsd:string and so is still a typed string, not a rdf:plainLiteral.

cmungall commented 5 months ago

Sorry for the delay. Thank you for the analysis. I agree with your proposed solution.

INCATools / kgcl

Handling of language tags in KGCL #60