Improved tagging of language for literals (includes script and transliteration)

michelleif commented 2 years ago

Background Currently in Sinopia a user can specify an RDF language tag for a literal from ISO-639-2 or choose "No language specified".

This ticket describes improvements to meet the user community's need to specify script and transliteration in addition to language, and to be compliant with BCP-47. (This ticket is based on previous discussions in #2026 and #2101; the PCC's BIBFRAME And MARC Bibliographic Encoding for Languages (BABEL) Final Report; and the Non-Latin Script Affinity Group: Discussions and Recommendations for Common Practices.)

Feature request

For literal and label fields, change the language modal to the below design to allow catalogers to tag script and transliteration in addition to language. Script and transliteration can only be selected if a language has been selected. (Note: "Languages" should read "Language")
For choice of language, use the IANA Language subtag registry tags with type=language (Note: this is a change from our current Language dropdown which offers ISO-639-2). See question below about how to display.
For choice of script, use the IANA Language subtag registry tags with type=script. Default to "No script specified". See question below about how to display.
For choice of transliteration, offer the following choices: alaloc, ewts, buckwalt, mns, satts, iso, iast, and pinyin. Default to "No tranlisteration specified".
After selections are made, Sinopia adds language tags and subtags to the RDF for the field following https://datatracker.ietf.org/doc/html/rfc6497#section-2.2.
- Examples using the literal Nihon Chizu Kabushiki Kaisha. Ōsaka-shi annaizu:
- no language, script, or transliteration specified: “Nihon Chizu Kabushiki Kaisha. Ōsaka-shi annaizu”
- language specified (Japanese (ja)); script and transliteration not specified: “Nihon Chizu Kabushiki Kaisha. Ōsaka-shi annaizu”@ja
- language and script (Latin (Latn)) specified, transliteration not specified: “Nihon Chizu Kabushiki Kaisha. Ōsaka-shi annaizu”@ja-Latn
- language, script ,and transliteration (American Library Association-Library of Congress (alaloc)) specified: “Nihon Chizu Kabushiki Kaisha. Ōsaka-shi annaizu”@ja-Latn-t-ja-m0-alaloc (per https://datatracker.ietf.org/doc/html/rfc6497#section-2.2: 't' specifies transformed content, then follows the source language tag, m0 is a field separator for transliteration, then follows the transliteration scheme tag)

Question

How to display the choices for language and script:
- Display only the Description from the IANA Registry? Example: Japanese
- Display only the Subtag from the IANA Registry? Example: ja
- Display both? Example Japanese (ja) or ja (Japanese).

Out of scope for this ticket

any checking that the cataloger's selection of language, script, and transliteration is valid
automatic detection of language and script of a literal (though we are doing a short exploration of whether this is feasible for Sinopia, see #2876)
default tag for script or translation (the request to have default transliteration tag be alaloc is noted, but if we set that default, it will be present even when there is no script specified)
handling of strings with mixed languages or scripts
bulk updating of ISO 639-2 language tags in existing Sinopia resources, when differing from IANA tags
conditionally narrowing the list of languages, scripts or transliteration schemes based on selections made in any of those fields
omitting the script tag from the RDF for those language/script combinations that have "suppress-script" specified in the IANA registry

justinlittman commented 2 years ago

Additional idea: Display the constructed language tag as the user is making selections. So, for example, if the user selects Japanese (ja) and Latin (Latn), ja-Latn would be displayed.

larkot567 commented 2 years ago

The proposed improvements look great, thank you very much!

Comment: In two different places of the text the LC transliteration schema appears as "alalc" and "alaloc". I think "alalc" is more preferable (shorter, anyway).

For the question - how to display the choices of the language and the script in RDF: humans would appreciate more explicit data, but machines can work with short codes. Since humans will be looking at the RDF for some time during this project, maybe the third option "Display both?" could be implemented?

I liked Justin's idea of combined language and script after a user makes selections.

Larisa Walsh

michelleif commented 2 years ago

thank you @larkot567 , the alalc is an error, per https://github.com/unicode-org/cldr/blob/main/common/bcp47/transform.xml#L14 it is alaloc. i'll edit the ticket description above.

justinlittman commented 2 years ago

IANA uses "en" for English, not "eng". How do we want to handle all of the existing resources?

larkot567 commented 2 years ago

That was my point too. Also - Russian is "ru ("rus" in ISO and MARC), Chinese is "zh" (not chi), Spanish is "es" not "spa", German is "gml (it has several codes actually), not "ger". Can we do exact mapping of MARC codes to IANA? MARC is not that granular, in many cases it will not be possible. If the future linked data editor will be using IANA languages subtags, I wonder whether introducing ISO-639-3 instead of MARC is a right thing to do (referring to the PCC BABEL group and testing of codes by the PCC community that is currently happening)

LD4P / sinopia_editor

Improved tagging of language for literals (includes script and transliteration) #3318