NeTEx-CEN / NeTEx

NeTEx is a CEN Technical Standard for exchanging Public Transport schedules and related data.
http://netex-cen.eu
GNU General Public License v3.0
78 stars 40 forks source link

A discussion on MultilingualString #558

Open skinkie opened 7 months ago

skinkie commented 7 months ago

At this moment a MultilingualString is actually not that. It is obviously a string, but there is nothing multilingual about it. It does not contain any translations for example. @JohanEntur wrote he would expect something like this to happen in NeTEx:

<names>
   <name lang="no">Oslo</name>
   <name lang="sv">Oslo</name>
   <name lang="en">Oslo</name>
   <name lang="nl">Oslo</name>
   <name lang="es">Oslo</name>
</names>

I think this is not what we should do. I would be in favor of the following, making the MultilingualString a standard structure which would directly facilitate translations. AlternativeText would still be useful for aliases and variants, but a single variant would host its own translated variantions:

<Name>
   <Text lang="no">Oslo</Text>
   <Text lang="sv">Oslo</Text>
   <Text lang="en">Oslo</Text>
   <Text lang="nl">Oslo</Text>
   <Text lang="es">Oslo</Text>
</Name>
JohanEntur commented 7 months ago

this is for everything really. I'm coming from the perspective of StopPlace here. We use name, and description (stopplace and quay). But I'm also thinking about countries with other scripts, like Cyrillic where it is completely natural to have all fields in both scripts + translations.

I guess the problem is endemic. How about Notice?

Aurige commented 7 months ago

Initially the MultilingualString was initially designed to carry the lang (any lang) information associated with a text but not for translations. The translations are expected to be provided by AlternativeName and AlternativeText (AlternativeName being also used for the situation where multiple names are possible, but the NameType attribute clearly states when it is a translation). @JohanEntur you can have an AlternativeText for a Notice, and for languages like Japanese, the following should give you "畑 はたけ" <AlternativeText attributeName="Text" id="myDataSpace:Notice:657899" order="1"> <Text lang="jp">&#x7551; &#x306F;&#x305F;&#x3051;</Text> </AlternativeText>

JohanEntur commented 7 months ago

But in Belgium. Is French the language, or the Alternative Language?

Aurige commented 7 months ago

it depends on where in Belgium you are ;-) but in any case, this is not the "official" language, just the one used as default (you need a default one), all other being alternatives

nick-knowles commented 7 months ago

It is certainly more modular to have the translations in-lined within the term they translate. rather than separate

However I suspect the majority of usage of text is monolingual in the default language so don't think we want to complicate 95% of usage just to cover the edge case . Therefore ,as long as we are not requiring the use of a Text wrapper tag within each Multilingual string, it would be okay to also allow a text but it would be clearer to wrap the translations within their own tag as we generally do for repeated children.

< Name >Copenhagen
< translations > < Text lang="de>Kopenhagen < / Text> < Text lang="dk">København< /Text> < Text lang="en">Copenhagen< / Text> < Text lang="fr">Copenhague< / Text> < Text lang="ru">Копенгаген< / Text> < Text lang="se">köpenhamn< / Text> < / translations > < / Name >

and also

< Name lang="dk">København < translations >
< Text lang="de>Kopenhagen < / Text> < Text lang="en">Copenhagen< / Text> < Text lang="fr">Copenhague< / Text> < Text lang="ru">Копенгаген< / Text> < Text lang="se">köpenhamn< / Text> < / translations > < / Name>

JohanEntur commented 7 months ago

I would counter that most usage is monolingual because the alternative model is not appealing to use.

It's hardly an edge case to have multiple text strings, and often there is no default language. In Norway, we have 2 official languages (Norwegian and Sami) where Norwegian has 2 official subsets, and the Sami have I believe 4 or 6. Belgium has I believe 3 official languages, Spain, Ireland, Wales, and Scotland also have widespread minority languages. Then we have the problem with Bulgaria which needs to make data available in two scripts, Latin and Cyrillic. Most likely several countries will want to add English translations to their data to cater to tourists.

I suspect that requirements to publish data in all official languages + languages that are useful internationally will grow as data become better and better in the future, and support for minority languages in minority regions will be persistent.

I have an example as well. Our stops database (NSR) is national, so the dataset as a whole would have a "default" language - naturally. This is currently Norwegian bokmål (nob). But some stops in the north have only Sami names. Since I can't leave the name field empty, I have to place the Sami name in the name field and leave the Sami translation (which other stops have) empty.

For this reason, I had to concoct this rather convoluted rule: If a Norwegian name exists, it should be recorded in the main name field. If no Norwegian name exists, the local name in any language shall apply to the main name field. We also have stops in neighbouring countries. For Sweden, it's not really a problem, but for stops in Finland, our default-norwegian registry has to either put our translation to Norwegian in the name field - or write the correct name (Rovaniemi autoasema, Му́рманск) in the name field and use the translation fields to add Norwegian translations.

This is the complete ruleset I've established to cover the situation for now (lovingly translated by ChatGPT:


4.4.11 Stop names in other languages

In NSR, all stops have a main name field. This is a monolingual field and must always be filled in accordance with the rules for naming stops. If a stop has a name in another language, the following rules shall apply:

If a Norwegian name exists, it should be recorded in the main name field. If no Norwegian name exists, the local name in any language shall apply to the main name field.

Exceptions can be made if the local name makes the stop unusable for the general public where understanding the text is important. This could typically be due to a non-Latin character set or words that Norwegian audiences cannot be expected to understand.

Example: Му́рманск, Rovaniemi linja-autoasema Names in other languages where a Norwegian name also exists should be recorded in the ALTERNATIVE NAMES field, where the name can be coded as TRANSLATION along with the respective language.

Translations in many languages are supported, but only one per language.

Both Bokmål and Nynorsk are considered Norwegian.

If unofficial non-Norwegian names of stops are to be registered, the coding ALIAS is used along with the relevant language following the same principles as aliases in Norwegian.

The spelling of alternative names (aliases, translations, etc.) primarily follows the guidelines of the stop's name field. Local spellings should be adhered to as much as possible. The main meaning or reference of the original name should be retained.

4.4.11.1 Sami and Kven names in administrative areas

Stop names in Sami or Kven should always be registered as alternative names if the stop also has a Norwegian name. If the stop only has a non-Norwegian name, alternative names should not be used.


The easiest way would be to allow all text strings to allow multiple inputs, where each input has either language or script as an attribute. I know it's a lot to ask, but if NeTEx is to grow up to be popular in Europe - this kind of thing cannot be neglected.

ue71603 commented 6 months ago

@JohanEntur To consider: This is "A" way of doing things. And it might be easier in some ways. However, having String replacement done the way it is currently done is not new/not special. This is exactly the way things were done in all C programs for simple multi-language support (https://localazy.com/blog/make-multi-language-application-in-c-gettext-localazy). The problem you face is that it is not easy readable. But NeTEx is not for human-readable consumption. And once AlternativeText and AlternativeName are implemented then one does not have to care.

JohanEntur commented 6 months ago

Thanks for clarifying

But it's less about easy reading, but more about the idea that something is default and others are alternative which I believe adds fictitious roles to each instance of translation.

To illustrate with a hypothetical default Swedish dataset:

Mattias Gynter Johann Buchthein

Now our Norwegian dataset:

Kautokeino Mattias Günter Mattias Gynter Kautokeino Guovdageaidnu --- Also, `ISO 639-3` is required to code the Sami languages. --- I have to say I feel quite uncomfortable trying to argue points of data structure against guys you because I have no expertise at all - as you know, while you have immense knowledge about these kinds of things. I just hope you see my point in the matter. The model should be straightforward and generic and not force users to jump through hoops.
nick-knowles commented 6 months ago

Note that semantically we consider AlternativeName to be distinct from AlternativeText

  1. ALTERNATIVE TEXT is a simple string that can be used to provide a translation of any text attribute including descriptions, notes, etc. It is a single normalized text string (ie no Line feeds, carriage returns etc) . THere are some edge cases , for example when a name i s to come into use on a certain date , so a validity condition can be specified.

There is a particualr subtlety though. The ALTERNATIVE TEXT is typically keyed within its parent element on a "Use for language", rather than on the actual language of the alternative text, which is given on the multilingual string on the TEXT attribute within ALTERNATIVE TEXT (though they will usually be the same but may be different). This is because the overall use case is "If I am assembling the texts to present to the user in the user interface, which language should I use?" In particular - what should I do if some elements are translated into my target language but others are not? This can be politically sensitive in bilingual countries (The Netex mechanism was taken from a very general UN model for the use of for alternative texts...). So you can for example, have the case (a) The default language is Flemish. The base text of a given element is French and there are translations in English French and German. If you are presenting the UI in Flemish, if there is no Flemish translation, use the English rather than the French.... That said, in most case "language to use" is the same as the language of the text.

image

  1. ALTERNATIVE NAME is used for significant named elements such as stops, places, that may have aliases . e.g. Kings Cross Station, London | London Kings Cross | London Kings X. It is only available for a limited number of first class entities It can have a set of further structured properties such as qualifier and an abbreviation that can be translated together as a coherent set of text values.. image

So to be clear, as I understand it the discussion here is just about better ways of encoding AlternativeText so that the translations can be included inline with each text attribute rather than as a set at the beginning of the parent element - this is just for syntactic prettiness (and more concise)

Ie instead of the following as at present, where all the translations for all the different attributes have to be in a pool at the beginning

< FareScheduledStopPoint id="uic:6050a" version="01" > < alternativeTexts> < !-- Translations for FareScheduledStopPoint.Name --> < AlternativeText useForLanguage="en" attributeName="Name"> < Text lang="en" textIdType="translation">Copenhagen< /Text>
< /AlternativeText> < AlternativeText useForLanguage="en" attributeName="Name"> < Text lang="en" textIdType="alias">Central openhagen< /Text>
< /AlternativeText> < AlternativeText useForLanguage="dk" attributeName="Name"> < Text textIdType="translation" lang="dk">København< /Text> < /AlternativeText>
< AlternativeText useForLanguage="fr" attributeName="Name"> < Text textIdType="translation" lang="fr">Copenhague< /Text> < /AlternativeText> < AlternativeText useForLanguage="ru" attributeName="Name"> < Text textIdType="translation" lang="ru">Копенгаген< /Text> < /AlternativeText> < AlternativeTextl useForLanguage="se" attributeName="Name"> < Text textIdType="translation" lang="se">köpenhamn< /Text> < /AlternativeTextlang="fr"> < !-- Translations for FareScheduledStopPoint.Description --> < AlternativeText useForLanguage="en" attributeName="Description"> < Text textIdType="translation" lang="en">Capital of Denmark< /Text> < /AlternativeText> < AlternativeText useForLanguage="fr" attributeName="Description"> < Text textIdType="translation" lang="fr">Capitale du Danemark< /Text> < /AlternativeText> < AlternativeText useForLanguage="it" attributeName="Description"> < Text textIdType="translation" lang="fr">Capitale du Danemark< /Text> < /AlternativeText> < AlternativeText useForLanguage="de" attributeName="Description"> < Text textIdType="translation" lang="de">Hauptstadt von Dänemark< /Text> < /AlternativeText> < !-- Translations for FareScheduledStopPoint.NameOnRouting --> < AlternativeText useForLanguage="de" attributeName="NameOnRouting"> < Text textIdType="translation" lang="de">Kpnhgn< /Text> < /AlternativeText> < AlternativeText useForLanguage="fr" attributeName="NameOnRouting"> < Text textIdType="translation" lang="fr">Cpnhge< /Text> < /AlternativeText> < AlternativeText useForLanguage="en" attributeName="NameOnRouting"> < Text textIdType="translation" lang="en">Kbnhvn< /Text> < /AlternativeText> < /alternativeTexts> < Name lang="dk">København< /Name> < ShortName>CPH< /ShortName> < Description lang="dk">Danmarks hovedstad< /Description> < VehicleModes>rail coach bus< /VehicleModes> < CountryRef ref="dk"/> < NameOnRouting lang="dk">Kophag< /NameOnRouting> < /FareScheduledStopPoint>

We would instead make MultilingualSting into a mixed data structure that allowed the following < FareScheduledStopPoint id="uic:6050a" version="01" > < Name lang="dk">København < alternativeTexts> < Text lang="en">Copenhagen< /Text>
< Text lang="de">Kopenhagen< /Text>
< Text lang="fr">Copenhague< /Text>
< Text lang="ru">Копенгаген< /Text>
< Text lang="se">köpenhamn< /Text>
< /alternativeTexts>
< /Name> < ShortName>CPH < /ShortName> < Description lang="dk">Danmarks hovedstad < alternativeTexts> < Text lang="en">Capital of Denmark< /Text> < Text lang="fr">Capitale du Danemark< /Text>
< Text lang="de">Hauptstadt von Dänemark< /Text>
< /alternativeTexts> < /Description> < VehicleModes>rail coach bus< /VehicleModes> < CountryRef ref="dk"/> < NameOnRouting lang="dk">Kophag < alternativeTexts> < Text lang="de">Kpnhgn< /Text> < Text lang="fr">Cpnhge< /Text> < Text lang="en">Kbnhvn< /Text>
< /alternativeTexts> < /NameOnRouting> < /FareScheduledStopPoint>

Could also add a list of "use for" languages as an attribute on Text