LD4P / sinopia_editor

Sinopia Linked Data Editor
https://sinopia.io/
Apache License 2.0
35 stars 10 forks source link

Improved tagging of language for literals (includes script and transliteration) #3318

Closed michelleif closed 2 years ago

michelleif commented 2 years ago

Background Currently in Sinopia a user can specify an RDF language tag for a literal from ISO-639-2 or choose "No language specified".

This ticket describes improvements to meet the user community's need to specify script and transliteration in addition to language, and to be compliant with BCP-47. (This ticket is based on previous discussions in #2026 and #2101; the PCC's BIBFRAME And MARC Bibliographic Encoding for Languages (BABEL) Final Report; and the Non-Latin Script Affinity Group: Discussions and Recommendations for Common Practices.)

Feature request

  1. For literal and label fields, change the language modal to the below design to allow catalogers to tag script and transliteration in addition to language. Script and transliteration can only be selected if a language has been selected. (Note: "Languages" should read "Language") 138543011-8484c05d-5b43-4a30-9038-cbd05b4e29f9
  2. For choice of language, use the IANA Language subtag registry tags with type=language (Note: this is a change from our current Language dropdown which offers ISO-639-2). See question below about how to display.
  3. For choice of script, use the IANA Language subtag registry tags with type=script. Default to "No script specified". See question below about how to display.
  4. For choice of transliteration, offer the following choices: alaloc, ewts, buckwalt, mns, satts, iso, iast, and pinyin. Default to "No tranlisteration specified".
  5. After selections are made, Sinopia adds language tags and subtags to the RDF for the field following https://datatracker.ietf.org/doc/html/rfc6497#section-2.2.
    • Examples using the literal Nihon Chizu Kabushiki Kaisha. Ōsaka-shi annaizu:
    • no language, script, or transliteration specified: “Nihon Chizu Kabushiki Kaisha. Ōsaka-shi annaizu”
    • language specified (Japanese (ja)); script and transliteration not specified: “Nihon Chizu Kabushiki Kaisha. Ōsaka-shi annaizu”@ja
    • language and script (Latin (Latn)) specified, transliteration not specified: “Nihon Chizu Kabushiki Kaisha. Ōsaka-shi annaizu”@ja-Latn
    • language, script ,and transliteration (American Library Association-Library of Congress (alaloc)) specified: “Nihon Chizu Kabushiki Kaisha. Ōsaka-shi annaizu”@ja-Latn-t-ja-m0-alaloc (per https://datatracker.ietf.org/doc/html/rfc6497#section-2.2: 't' specifies transformed content, then follows the source language tag, m0 is a field separator for transliteration, then follows the transliteration scheme tag)

Question

Out of scope for this ticket

justinlittman commented 2 years ago

Additional idea: Display the constructed language tag as the user is making selections. So, for example, if the user selects Japanese (ja) and Latin (Latn), ja-Latn would be displayed.

larkot567 commented 2 years ago

The proposed improvements look great, thank you very much!

Comment: In two different places of the text the LC transliteration schema appears as "alalc" and "alaloc". I think "alalc" is more preferable (shorter, anyway).

For the question - how to display the choices of the language and the script in RDF: humans would appreciate more explicit data, but machines can work with short codes. Since humans will be looking at the RDF for some time during this project, maybe the third option "Display both?" could be implemented?

I liked Justin's idea of combined language and script after a user makes selections.

Larisa Walsh

michelleif commented 2 years ago

thank you @larkot567 , the alalc is an error, per https://github.com/unicode-org/cldr/blob/main/common/bcp47/transform.xml#L14 it is alaloc. i'll edit the ticket description above.

justinlittman commented 2 years ago

IANA uses "en" for English, not "eng". How do we want to handle all of the existing resources?

larkot567 commented 2 years ago

That was my point too. Also - Russian is "ru ("rus" in ISO and MARC), Chinese is "zh" (not chi), Spanish is "es" not "spa", German is "gml (it has several codes actually), not "ger". Can we do exact mapping of MARC codes to IANA? MARC is not that granular, in many cases it will not be possible. If the future linked data editor will be using IANA languages subtags, I wonder whether introducing ISO-639-3 instead of MARC is a right thing to do (referring to the PCC BABEL group and testing of codes by the PCC community that is currently happening)