FoodOntology / joint-food-ontology-wg

This is a repository for documents and issues related to the development of interoperable food related ontologies.
22 stars 3 forks source link

Use of language labels and synonyms among the food-related ontologies #25

Open ddooley opened 2 years ago

ddooley commented 2 years ago

Bernd Krieg-Brückner has examined the challenge of semantic web related language tagging. Feedback is appreciated on his research!

He says:

"I have analyzed the problem of translation into other languages (and regional languages/dialects) further, propose a pragmatic solution below, and will investigate further with my collaborators here (notably Michaela), how to interrelate with (and possibly pick up translation data from) WikiData.

The ISO Standardization situation

Language Codes are standardized under

For Country Codes and geographic regions there is ISO 3166-1 [https://en.wikipedia.org/wiki/ISO_3166-1].

Standardization on the Internet: IETF language tags

The above situation is a little confusing at first, but becomes more realistic when considering the de facto standardization on the Internet. HTML/XML, W3C and notably Wikipedia/WikiData support the IETF BCP 47 language tag: "a standardized code or tag that is used to identify human languages in the Internet", cf.

Although tags may be long with a defined syntax, they may be abbreviated; the recommendation is to keep them as short as possible. The different kinds of subtags can be distinguished by their length (number of characters).

Terminology and examples:

Apparently, Wikipedia/WikiData uses IETF language tags [the only deviation I found is the tag "simple" meaning "en-simple", while "simple" seems to be documented in IETF to be applicable to any primary language prefix]. Protégé uses/recommends an early version (the present documentation is hopelessly outdated).

BKB¹s recommendations for language annotations in FoodOn

  1. use label/synonym annotations etc. with IETF language tags; examples:

    • en, en-US, en-CA, de-CH [do not use an ISO 639-3 code if an equivalent ISO 639-1 code exists; e.g. use la instead of lat, de instead of deu]
  2. abbreviate IETF language tags (as in e.g. WikiData)
    !! except for regional languages (with ISO 639-3 code) that are sublanguages of a primary language, which should be kept as prefix (ISO 639-1 code); examples:
    • de-bar [NOT bar] [Rationale: the structuring of the prefix "de" as a quasi-macrolanguage is maintained; it is easily stripped off, but more complex to reconstruct; Contra: Wiki(Data) uses the abbreviation only]
  3. use Country Code subtags (ISO 3166-1) possibly plus (regional) Language Code subtags (ISO 639-3) for "official" written language vs. regional dialect terms; examples:
    • currant[en]: Johannisbeere[de], Ribisel[de-AT]
    • potato pancake[en]: Kartoffelpuffer[de], Reibekuchen[de], Platzki[de-AT], Reiberdatschi[de-bar] [Rationale: de-AT indicates that the term is only used in Austria, not Bavaria etc. de-bar indicates that the term is used throughout the Austro-Bavarian language; this may not always be true, but there is not way to restrict to Bavaria only] [cf. Note below; similarly de-CH, if the term exists there, otherwise de-gsw (or both)]
  4. keep label/synonym annotations for the same primary language together in a separate file; examples:
    • de, de-CH, de-AT, de-bar, … [Rationale: regional languages or dialects are then directly accessible. In Protégé a View/CustomRendering set to "de-AT, de-bar, de" will select the appropriate label, if present, in that order] [it makes sense to keep a regional hierarchy of files, e.g. to keep a folder of all (regional) languages in India]
  5. different spellings should probably be synonyms, not labels, e.g.
    • Platzka[de-AT] for Platzki[de-AT] [the synonym issue is separate from the regional language issue]

Note (mainly, but not only, for Germans)

There is a confusion between the written and spoken language Bavarian (Bairisch/Boarisch) in Wikipedia. There are special pages for Bavarian (and WikiData has special entries under https://bar.wikipedia.org/), where the language is "written" in a (to my knowledge) non-standard transliteration of spoken "Boarisch". It includes a relevant page on "kitchen vocabulary" - https://bar.wikipedia.org/wiki/Austro-Boarischa_Kuchlwoatschotz - where the transliterated/spoken language "Boarisch" is contrasted against a "Schriftsproch" meaning a written form of "Bairisch" intelligible by Germans (whereas "Boarisch" is quite unintelligible unless one knows it from years of experience). Terms used only in Austria are marked "Ö" and those only in Bavaria are marked "B". [Actually, that article is written in "Ostmiddlboarisch" or "Weanarisch" [Ostmittelbairisch], a dialect spoken in Vienna; whether this is appropriate for all "Boarisch" is another matter.] My recommendations above refer to "Bairisch" (with code de-bar [or bar, resp.]).

ddooley commented 2 years ago

Note that OBOFoundry now officially recommends that labels etc. have language tags rather than "string" datatype. https://github.com/OBOFoundry/OBOFoundry.github.io/issues/479 . I'm checking if there are any other restrictions on language tag content with respect to protege or processing tools. I recall there might be a curve ball about what is typically accepted by some software.

ddooley commented 2 years ago

One Protege update relevant to this is https://github.com/protegeproject/protege/issues/784 . Requires downloading protege 5.6.0 . However Stanford's default download for MAC OS is 5.5 still.

oldskeptic commented 2 years ago

Most of this is regulated by the rdfs W3C Recommendation which points to BCP 47, which references RFC 5646 which "recommends" ISO639-1 (Lang Code), ISO3166-1 (Region / Dialect Codes) and ISO15924 (Region / Dialect Codes + script).

At issue is that the RFCs are recommendations and people tend to implement them "their way". I note the following in RFC5645: "de-CH-1996" represents German as used in Switzerland and as written using the spelling reform beginning in the year 1996 C.E., which is great for hardcore language nerds but this has no support that I know of. Same for ISO8601 time periods.

As of this ticket, librdf/raptor correctly parses ISO3166-1 regions but users of Virtuoso may encounter bumps in some circumstances (openlink/virtuoso-opensource#710). I've seen some toolchains die on anything but two letter language codes.

With respect to https://github.com/OBOFoundry/OBOFoundry.github.io/issues/479, rdfs W3C Recommendation states that language-typed literals with xml:lang tags are implied to be langString typed. Most parsers are very loose about this situation. A blank xml:lang remains useful for cases where a term is language (but not script) agnostic, such as some personal names.

I agree with Bernd that there is a difference between a translation and a synonym. I would also add context to the mix: most crops / varietals / cultivars have common names, trade (commercial) names, scientific names and occasionally the registration / patent number is used. Common names are locale specific, trade and registration numbers jurisdiction specific and scientific names could be latin / greek if someone has gotten around to it.

I personally prefer a single rdfs:label or skos:prefLabel per term per language tag as it allows simple SPARQL queries without unintended projection snafus. Putting a single localized label on the screen for the user to read is the 99% use case and should be as easy as possible.

Skos sees synonyms as skos:altLabels which is a great way of handling the above constraints. My concern is that in most cases synonyms belong to a specific context, which is impossible to record with a Literal.

The use of skos-xl is attractive through the use of skos-xl:Label to assign provenance and / or context to the term. The unfortunate use of a few disjoint and restrictions on it makes it hard to use it for recording a full multilingual vocabulary. FIBO solves this type of problem with the use of tags and a node within a scheme for the identifier. Again, your mileage may vary.

As a general rule when handling nomenclature: The thing and the name of the thing are two different things.

ddooley commented 2 years ago

Alan Ruttenberg has just commented about https://www.w3.org/International/questions/qa-choosing-language-tags , mentioning it recommends RFC 5646 which you link to above, Rob!

oldskeptic commented 2 years ago

I'd also like to recommend http://www.lexvo.org/ which provides a full graph of languages and scripts along with labels for each language in... every other language. This makes it incredibly useful for multilingual UI work.