NeTEx-CEN / NeTEx

NeTEx is a CEN Technical Standard for exchanging Public Transport schedules and related data.
http://netex-cen.eu
GNU General Public License v3.0
86 stars 39 forks source link

A discussion on MultilingualString #558

Open skinkie opened 1 year ago

skinkie commented 1 year ago

At this moment a MultilingualString is actually not that. It is obviously a string, but there is nothing multilingual about it. It does not contain any translations for example. @JohanEntur wrote he would expect something like this to happen in NeTEx:

<names>
   <name lang="no">Oslo</name>
   <name lang="sv">Oslo</name>
   <name lang="en">Oslo</name>
   <name lang="nl">Oslo</name>
   <name lang="es">Oslo</name>
</names>

I think this is not what we should do. I would be in favor of the following, making the MultilingualString a standard structure which would directly facilitate translations. AlternativeText would still be useful for aliases and variants, but a single variant would host its own translated variantions:

<Name>
   <Text lang="no">Oslo</Text>
   <Text lang="sv">Oslo</Text>
   <Text lang="en">Oslo</Text>
   <Text lang="nl">Oslo</Text>
   <Text lang="es">Oslo</Text>
</Name>
JohanEntur commented 1 year ago

this is for everything really. I'm coming from the perspective of StopPlace here. We use name, and description (stopplace and quay). But I'm also thinking about countries with other scripts, like Cyrillic where it is completely natural to have all fields in both scripts + translations.

I guess the problem is endemic. How about Notice?

Aurige commented 1 year ago

Initially the MultilingualString was initially designed to carry the lang (any lang) information associated with a text but not for translations. The translations are expected to be provided by AlternativeName and AlternativeText (AlternativeName being also used for the situation where multiple names are possible, but the NameType attribute clearly states when it is a translation). @JohanEntur you can have an AlternativeText for a Notice, and for languages like Japanese, the following should give you "畑 はたけ" <AlternativeText attributeName="Text" id="myDataSpace:Notice:657899" order="1"> <Text lang="jp">&#x7551; &#x306F;&#x305F;&#x3051;</Text> </AlternativeText>

JohanEntur commented 1 year ago

But in Belgium. Is French the language, or the Alternative Language?

Aurige commented 1 year ago

it depends on where in Belgium you are ;-) but in any case, this is not the "official" language, just the one used as default (you need a default one), all other being alternatives

nick-knowles commented 11 months ago

It is certainly more modular to have the translations in-lined within the term they translate. rather than separate

However I suspect the majority of usage of text is monolingual in the default language so don't think we want to complicate 95% of usage just to cover the edge case . Therefore ,as long as we are not requiring the use of a Text wrapper tag within each Multilingual string, it would be okay to also allow a text but it would be clearer to wrap the translations within their own tag as we generally do for repeated children.

< Name >Copenhagen
< translations > < Text lang="de>Kopenhagen < / Text> < Text lang="dk">København< /Text> < Text lang="en">Copenhagen< / Text> < Text lang="fr">Copenhague< / Text> < Text lang="ru">Копенгаген< / Text> < Text lang="se">köpenhamn< / Text> < / translations > < / Name >

and also

< Name lang="dk">København < translations >
< Text lang="de>Kopenhagen < / Text> < Text lang="en">Copenhagen< / Text> < Text lang="fr">Copenhague< / Text> < Text lang="ru">Копенгаген< / Text> < Text lang="se">köpenhamn< / Text> < / translations > < / Name>

JohanEntur commented 11 months ago

I would counter that most usage is monolingual because the alternative model is not appealing to use.

It's hardly an edge case to have multiple text strings, and often there is no default language. In Norway, we have 2 official languages (Norwegian and Sami) where Norwegian has 2 official subsets, and the Sami have I believe 4 or 6. Belgium has I believe 3 official languages, Spain, Ireland, Wales, and Scotland also have widespread minority languages. Then we have the problem with Bulgaria which needs to make data available in two scripts, Latin and Cyrillic. Most likely several countries will want to add English translations to their data to cater to tourists.

I suspect that requirements to publish data in all official languages + languages that are useful internationally will grow as data become better and better in the future, and support for minority languages in minority regions will be persistent.

I have an example as well. Our stops database (NSR) is national, so the dataset as a whole would have a "default" language - naturally. This is currently Norwegian bokmål (nob). But some stops in the north have only Sami names. Since I can't leave the name field empty, I have to place the Sami name in the name field and leave the Sami translation (which other stops have) empty.

For this reason, I had to concoct this rather convoluted rule: If a Norwegian name exists, it should be recorded in the main name field. If no Norwegian name exists, the local name in any language shall apply to the main name field. We also have stops in neighbouring countries. For Sweden, it's not really a problem, but for stops in Finland, our default-norwegian registry has to either put our translation to Norwegian in the name field - or write the correct name (Rovaniemi autoasema, Му́рманск) in the name field and use the translation fields to add Norwegian translations.

This is the complete ruleset I've established to cover the situation for now (lovingly translated by ChatGPT:


4.4.11 Stop names in other languages

In NSR, all stops have a main name field. This is a monolingual field and must always be filled in accordance with the rules for naming stops. If a stop has a name in another language, the following rules shall apply:

If a Norwegian name exists, it should be recorded in the main name field. If no Norwegian name exists, the local name in any language shall apply to the main name field.

Exceptions can be made if the local name makes the stop unusable for the general public where understanding the text is important. This could typically be due to a non-Latin character set or words that Norwegian audiences cannot be expected to understand.

Example: Му́рманск, Rovaniemi linja-autoasema Names in other languages where a Norwegian name also exists should be recorded in the ALTERNATIVE NAMES field, where the name can be coded as TRANSLATION along with the respective language.

Translations in many languages are supported, but only one per language.

Both Bokmål and Nynorsk are considered Norwegian.

If unofficial non-Norwegian names of stops are to be registered, the coding ALIAS is used along with the relevant language following the same principles as aliases in Norwegian.

The spelling of alternative names (aliases, translations, etc.) primarily follows the guidelines of the stop's name field. Local spellings should be adhered to as much as possible. The main meaning or reference of the original name should be retained.

4.4.11.1 Sami and Kven names in administrative areas

Stop names in Sami or Kven should always be registered as alternative names if the stop also has a Norwegian name. If the stop only has a non-Norwegian name, alternative names should not be used.


The easiest way would be to allow all text strings to allow multiple inputs, where each input has either language or script as an attribute. I know it's a lot to ask, but if NeTEx is to grow up to be popular in Europe - this kind of thing cannot be neglected.

ue71603 commented 11 months ago

@JohanEntur To consider: This is "A" way of doing things. And it might be easier in some ways. However, having String replacement done the way it is currently done is not new/not special. This is exactly the way things were done in all C programs for simple multi-language support (https://localazy.com/blog/make-multi-language-application-in-c-gettext-localazy). The problem you face is that it is not easy readable. But NeTEx is not for human-readable consumption. And once AlternativeText and AlternativeName are implemented then one does not have to care.

JohanEntur commented 11 months ago

Thanks for clarifying

But it's less about easy reading, but more about the idea that something is default and others are alternative which I believe adds fictitious roles to each instance of translation.

To illustrate with a hypothetical default Swedish dataset:

Mattias Gynter Johann Buchthein

Now our Norwegian dataset:

Kautokeino Mattias Günter Mattias Gynter Kautokeino Guovdageaidnu --- Also, `ISO 639-3` is required to code the Sami languages. --- I have to say I feel quite uncomfortable trying to argue points of data structure against guys you because I have no expertise at all - as you know, while you have immense knowledge about these kinds of things. I just hope you see my point in the matter. The model should be straightforward and generic and not force users to jump through hoops.
nick-knowles commented 11 months ago

Note that semantically we consider AlternativeName to be distinct from AlternativeText

  1. ALTERNATIVE TEXT is a simple string that can be used to provide a translation of any text attribute including descriptions, notes, etc. It is a single normalized text string (ie no Line feeds, carriage returns etc) . THere are some edge cases , for example when a name i s to come into use on a certain date , so a validity condition can be specified.

There is a particualr subtlety though. The ALTERNATIVE TEXT is typically keyed within its parent element on a "Use for language", rather than on the actual language of the alternative text, which is given on the multilingual string on the TEXT attribute within ALTERNATIVE TEXT (though they will usually be the same but may be different). This is because the overall use case is "If I am assembling the texts to present to the user in the user interface, which language should I use?" In particular - what should I do if some elements are translated into my target language but others are not? This can be politically sensitive in bilingual countries (The Netex mechanism was taken from a very general UN model for the use of for alternative texts...). So you can for example, have the case (a) The default language is Flemish. The base text of a given element is French and there are translations in English French and German. If you are presenting the UI in Flemish, if there is no Flemish translation, use the English rather than the French.... That said, in most case "language to use" is the same as the language of the text.

image

  1. ALTERNATIVE NAME is used for significant named elements such as stops, places, that may have aliases . e.g. Kings Cross Station, London | London Kings Cross | London Kings X. It is only available for a limited number of first class entities It can have a set of further structured properties such as qualifier and an abbreviation that can be translated together as a coherent set of text values.. image

So to be clear, as I understand it the discussion here is just about better ways of encoding AlternativeText so that the translations can be included inline with each text attribute rather than as a set at the beginning of the parent element - this is just for syntactic prettiness (and more concise)

Ie instead of the following as at present, where all the translations for all the different attributes have to be in a pool at the beginning

< FareScheduledStopPoint id="uic:6050a" version="01" > < alternativeTexts> < !-- Translations for FareScheduledStopPoint.Name --> < AlternativeText useForLanguage="en" attributeName="Name"> < Text lang="en" textIdType="translation">Copenhagen< /Text>
< /AlternativeText> < AlternativeText useForLanguage="en" attributeName="Name"> < Text lang="en" textIdType="alias">Central openhagen< /Text>
< /AlternativeText> < AlternativeText useForLanguage="dk" attributeName="Name"> < Text textIdType="translation" lang="dk">København< /Text> < /AlternativeText>
< AlternativeText useForLanguage="fr" attributeName="Name"> < Text textIdType="translation" lang="fr">Copenhague< /Text> < /AlternativeText> < AlternativeText useForLanguage="ru" attributeName="Name"> < Text textIdType="translation" lang="ru">Копенгаген< /Text> < /AlternativeText> < AlternativeTextl useForLanguage="se" attributeName="Name"> < Text textIdType="translation" lang="se">köpenhamn< /Text> < /AlternativeTextlang="fr"> < !-- Translations for FareScheduledStopPoint.Description --> < AlternativeText useForLanguage="en" attributeName="Description"> < Text textIdType="translation" lang="en">Capital of Denmark< /Text> < /AlternativeText> < AlternativeText useForLanguage="fr" attributeName="Description"> < Text textIdType="translation" lang="fr">Capitale du Danemark< /Text> < /AlternativeText> < AlternativeText useForLanguage="it" attributeName="Description"> < Text textIdType="translation" lang="fr">Capitale du Danemark< /Text> < /AlternativeText> < AlternativeText useForLanguage="de" attributeName="Description"> < Text textIdType="translation" lang="de">Hauptstadt von Dänemark< /Text> < /AlternativeText> < !-- Translations for FareScheduledStopPoint.NameOnRouting --> < AlternativeText useForLanguage="de" attributeName="NameOnRouting"> < Text textIdType="translation" lang="de">Kpnhgn< /Text> < /AlternativeText> < AlternativeText useForLanguage="fr" attributeName="NameOnRouting"> < Text textIdType="translation" lang="fr">Cpnhge< /Text> < /AlternativeText> < AlternativeText useForLanguage="en" attributeName="NameOnRouting"> < Text textIdType="translation" lang="en">Kbnhvn< /Text> < /AlternativeText> < /alternativeTexts> < Name lang="dk">København< /Name> < ShortName>CPH< /ShortName> < Description lang="dk">Danmarks hovedstad< /Description> < VehicleModes>rail coach bus< /VehicleModes> < CountryRef ref="dk"/> < NameOnRouting lang="dk">Kophag< /NameOnRouting> < /FareScheduledStopPoint>

We would instead make MultilingualSting into a mixed data structure that allowed the following < FareScheduledStopPoint id="uic:6050a" version="01" > < Name lang="dk">København < alternativeTexts> < Text lang="en">Copenhagen< /Text>
< Text lang="de">Kopenhagen< /Text>
< Text lang="fr">Copenhague< /Text>
< Text lang="ru">Копенгаген< /Text>
< Text lang="se">köpenhamn< /Text>
< /alternativeTexts>
< /Name> < ShortName>CPH < /ShortName> < Description lang="dk">Danmarks hovedstad < alternativeTexts> < Text lang="en">Capital of Denmark< /Text> < Text lang="fr">Capitale du Danemark< /Text>
< Text lang="de">Hauptstadt von Dänemark< /Text>
< /alternativeTexts> < /Description> < VehicleModes>rail coach bus< /VehicleModes> < CountryRef ref="dk"/> < NameOnRouting lang="dk">Kophag < alternativeTexts> < Text lang="de">Kpnhgn< /Text> < Text lang="fr">Cpnhge< /Text> < Text lang="en">Kbnhvn< /Text>
< /alternativeTexts> < /NameOnRouting> < /FareScheduledStopPoint>

Could also add a list of "use for" languages as an attribute on Text

JohanEntur commented 3 months ago

Btw, this is not just the Name field. There are other fields such as LANDMARK which has the same current structure as the NAME, and it too is meant to carry the same kind of data (public facing text).

<Landmark lang="NO">A tree</Landmark>

By the logic of the structure, an ALTERNATIVELANDMARKS would be needed here to carry the translations.

This type of issue would propagate to any field where a user is expected enter a text which will be in a specific language, and I would suggest that the issue of multilingual input be solved in the same way for any such field.

Other examples CrossRoad, NameSuffix, Label, Comment (in AccessibilityAssessment), Description, and possibly ShortName. These are the things I dug up when looking at the XML structure for StopPlace/Quay.

Aurige commented 3 months ago

@JohanEntur Doesn't Nick's answer/example provide what you are looking for ? Every single DataManagedObject (so most of NeTEx entities) can carry alternativeTexts (for translations, but also all other possible uses) image

JohanEntur commented 3 months ago

How does the data know that < Text lang="en">Capital of Denmark< /Text> is for Description and not Name? Is it depending on the order alone? And, doesn't this mean AlternativeNames should be removed. It would not make sense that everything except Name uses this method.

And AlternativeText doesn't have this: image

Aurige commented 3 months ago

you need to use AlternativeText as in Nick's example

_< FareScheduledStopPoint id="uic:6050a" version="01" >
    < alternativeTexts>
    ....
        < !-- Translations for FareScheduledStopPoint.Description -->
        < AlternativeText useForLanguage="en" attributeName="Description">
            < Text textIdType="translation" lang="en">Capital of Denmark< /Text>
        < /AlternativeText>
 ...._
skinkie commented 3 months ago

@Aurige I find the semantics of thing still suboptimal. I would suggest changing this for NeTEx 3.0 to have all MultilingualStrings having the option to directly use <Text>.

JohanEntur commented 3 months ago

I support the suggestion of @skinkie.

  1. I think there should be one solution applied to all issues of this type.
  2. I think it should be simple and understandable (not in terms of human readable, but understandable for new adopters of NeTEx). I think the solution proposed by Nick would lead to confusion and a lot of strange XML solutions around Europe. Clean, simple, approachable is important in a cooperative exchange format.

Also, @Aurige, you cant just give me "you need to use AlternativeText as in Nick's example" when my comment was a critique of Nick's example :)

JohanEntur commented 3 months ago

So, I'm eager to get some resolution here. @skinkie has suggested this:

<Name>
   <Text lang="no">Oslo</Text>
   <Text lang="sv">Oslo</Text>
   <Text lang="en">Oslo</Text>
   <Text lang="nl">Oslo</Text>
   <Text lang="es">Oslo</Text>
</Name>

Is there anything specific that speaks against this? I found this from Nick: "However I suspect the majority of usage of text is monolingual in the default language so don't think we want to complicate 95% of usage just to cover the edge case". I agree that the majority is monolingual, but this is machine data. Surely the computer reading and writing this doesn't care either way.

This seems to me a straightforward way to list all relevant texts per language for any multilingual object, and it works for everything. Also easy to scale up.

Especially compared to the current feature of the main text being linked to a default value of the dataset, and then stop names have AlternativeNames and everything else has AlternativeText.

PS it has to be ISO 639-3 or it wont support all languages.

Aurige commented 2 months ago

Experiment the proposal from @nick-knowles on Dec 8, 2023 (see up in the discussion), by extending the MultiLingualString Check if it works to tools (@duexw ) PR to be done (@skinkie @nick-knowles )

Use appropriate ISO 639-x to be able to cover all languages

JohanEntur commented 1 month ago

According to current way of doing things, is this correct?


<FlexibleLine id="example_xml_1" version="1">
    <alternativeTexts>
        <AlternativeText useForLanguage="eng" attributeName="TOMATO">
            <Text>The House</Text>
        </AlternativeText>
        <AlternativeText useForLanguage="eng" attributeName="BANANA">
            <Text>The Castle</Text>
        </AlternativeText>
    </alternativeTexts>
    <Name lang="nor" textIdType="TOMATO">Huset</Name>
    <BookingNote lang="nor" textIdType="BANANA">Slottet</BookingNote>
</FlexibleLine>```
testower commented 1 month ago

To me, @skinkie's suggestion seems like the most sensible. Any other alternative seems overcomplicated for what I expect to be a very common use case in 2024.

duexw commented 1 month ago

I tried out the "mixed" Element type in Microsoft .NET 8. Basically it works, but one would have to adapt existing code. I made a simplified example. We try to to something like this:

image An element which can contain text and optional translations, for example

image

The microsoft tool xsd.exe generates classes from the xsd in some languages (C++, C# etc). For C# it generates

image

for the mixed element MultilingualString. You get a collection property for the contained child elements and you get a string array property for the plain text. This is because plain text and child elements can alternate. You get each bit of plain text in an array member. Using the generated classes is quite simple. This is the code for reading a file like the above:

image

So basically it works well. You do not have to worry about separating plain text and child elements, the framework does that for you. BUT if you had code using the existing "un-mixed" type MultilingualString, you would have to adapt all places where you access these properties, and that would be quite a lot. e.g. where I could write var name = stopPlace.Name before, I would have to write something like var name = string.Join("", stopPlace.Name.Text) now.

JohanEntur commented 6 days ago

Changes in code are always inevitable, but when NeTEx becomes more logical and predictable the code can become more generic and simple.

My key points: