Open skinkie opened 1 year ago
this is for everything really. I'm coming from the perspective of StopPlace here. We use name, and description (stopplace and quay). But I'm also thinking about countries with other scripts, like Cyrillic where it is completely natural to have all fields in both scripts + translations.
I guess the problem is endemic. How about Notice?
Initially the MultilingualString was initially designed to carry the lang (any lang) information associated with a text but not for translations. The translations are expected to be provided by AlternativeName and AlternativeText (AlternativeName being also used for the situation where multiple names are possible, but the NameType attribute clearly states when it is a translation).
@JohanEntur you can have an AlternativeText for a Notice, and for languages like Japanese, the following should give you "畑 はたけ"
<AlternativeText attributeName="Text" id="myDataSpace:Notice:657899" order="1"> <Text lang="jp">畑 はたけ</Text> </AlternativeText>
But in Belgium. Is French the language, or the Alternative Language?
it depends on where in Belgium you are ;-) but in any case, this is not the "official" language, just the one used as default (you need a default one), all other being alternatives
It is certainly more modular to have the translations in-lined within the term they translate. rather than separate
However I suspect the majority of usage of text is monolingual in the default language so don't think we want to complicate 95% of usage just to cover the edge case . Therefore ,as long as we are not requiring the use of a Text wrapper tag within each Multilingual string, it would be okay to also allow a text but it would be clearer to wrap the translations within their own tag as we generally do for repeated children.
< Name >Copenhagen
< translations >
< Text lang="de>Kopenhagen < / Text>
< Text lang="dk">København< /Text>
< Text lang="en">Copenhagen< / Text>
< Text lang="fr">Copenhague< / Text>
< Text lang="ru">Копенгаген< / Text>
< Text lang="se">köpenhamn< / Text>
< / translations >
< / Name >
and also
< Name lang="dk">København
< translations >
< Text lang="de>Kopenhagen < / Text>
< Text lang="en">Copenhagen< / Text>
< Text lang="fr">Copenhague< / Text>
< Text lang="ru">Копенгаген< / Text>
< Text lang="se">köpenhamn< / Text>
< / translations >
< / Name>
I would counter that most usage is monolingual because the alternative model is not appealing to use.
It's hardly an edge case to have multiple text strings, and often there is no default language. In Norway, we have 2 official languages (Norwegian and Sami) where Norwegian has 2 official subsets, and the Sami have I believe 4 or 6. Belgium has I believe 3 official languages, Spain, Ireland, Wales, and Scotland also have widespread minority languages. Then we have the problem with Bulgaria which needs to make data available in two scripts, Latin and Cyrillic. Most likely several countries will want to add English translations to their data to cater to tourists.
I suspect that requirements to publish data in all official languages + languages that are useful internationally will grow as data become better and better in the future, and support for minority languages in minority regions will be persistent.
I have an example as well. Our stops database (NSR) is national, so the dataset as a whole would have a "default" language - naturally. This is currently Norwegian bokmål (nob). But some stops in the north have only Sami names. Since I can't leave the name field empty, I have to place the Sami name in the name field and leave the Sami translation (which other stops have) empty.
For this reason, I had to concoct this rather convoluted rule:
If a Norwegian name exists, it should be recorded in the main name field. If no Norwegian name exists, the local name in any language shall apply to the main name field.
We also have stops in neighbouring countries. For Sweden, it's not really a problem, but for stops in Finland, our default-norwegian registry has to either put our translation to Norwegian in the name field - or write the correct name (Rovaniemi autoasema, Му́рманск) in the name field and use the translation fields to add Norwegian translations.
This is the complete ruleset I've established to cover the situation for now (lovingly translated by ChatGPT:
4.4.11 Stop names in other languages
In NSR, all stops have a main name field. This is a monolingual field and must always be filled in accordance with the rules for naming stops. If a stop has a name in another language, the following rules shall apply:
If a Norwegian name exists, it should be recorded in the main name field. If no Norwegian name exists, the local name in any language shall apply to the main name field.
Exceptions can be made if the local name makes the stop unusable for the general public where understanding the text is important. This could typically be due to a non-Latin character set or words that Norwegian audiences cannot be expected to understand.
Example: Му́рманск, Rovaniemi linja-autoasema Names in other languages where a Norwegian name also exists should be recorded in the ALTERNATIVE NAMES field, where the name can be coded as TRANSLATION along with the respective language.
Translations in many languages are supported, but only one per language.
Both Bokmål and Nynorsk are considered Norwegian.
If unofficial non-Norwegian names of stops are to be registered, the coding ALIAS is used along with the relevant language following the same principles as aliases in Norwegian.
The spelling of alternative names (aliases, translations, etc.) primarily follows the guidelines of the stop's name field. Local spellings should be adhered to as much as possible. The main meaning or reference of the original name should be retained.
4.4.11.1 Sami and Kven names in administrative areas
Stop names in Sami or Kven should always be registered as alternative names if the stop also has a Norwegian name. If the stop only has a non-Norwegian name, alternative names should not be used.
The easiest way would be to allow all text strings to allow multiple inputs, where each input has either language or script as an attribute. I know it's a lot to ask, but if NeTEx is to grow up to be popular in Europe - this kind of thing cannot be neglected.
@JohanEntur To consider: This is "A" way of doing things. And it might be easier in some ways. However, having String replacement done the way it is currently done is not new/not special. This is exactly the way things were done in all C programs for simple multi-language support (https://localazy.com/blog/make-multi-language-application-in-c-gettext-localazy). The problem you face is that it is not easy readable. But NeTEx is not for human-readable consumption. And once AlternativeText and AlternativeName are implemented then one does not have to care.
Thanks for clarifying
But it's less about easy reading, but more about the idea that something is default and others are alternative which I believe adds fictitious roles to each instance of translation.
To illustrate with a hypothetical default Swedish dataset:
Now our Norwegian dataset:
Note that semantically we consider AlternativeName to be distinct from AlternativeText
There is a particualr subtlety though. The ALTERNATIVE TEXT is typically keyed within its parent element on a "Use for language", rather than on the actual language of the alternative text, which is given on the multilingual string on the TEXT attribute within ALTERNATIVE TEXT (though they will usually be the same but may be different). This is because the overall use case is "If I am assembling the texts to present to the user in the user interface, which language should I use?" In particular - what should I do if some elements are translated into my target language but others are not? This can be politically sensitive in bilingual countries (The Netex mechanism was taken from a very general UN model for the use of for alternative texts...). So you can for example, have the case (a) The default language is Flemish. The base text of a given element is French and there are translations in English French and German. If you are presenting the UI in Flemish, if there is no Flemish translation, use the English rather than the French.... That said, in most case "language to use" is the same as the language of the text.
So to be clear, as I understand it the discussion here is just about better ways of encoding AlternativeText so that the translations can be included inline with each text attribute rather than as a set at the beginning of the parent element - this is just for syntactic prettiness (and more concise)
Ie instead of the following as at present, where all the translations for all the different attributes have to be in a pool at the beginning
< FareScheduledStopPoint id="uic:6050a" version="01" > < alternativeTexts> < !-- Translations for FareScheduledStopPoint.Name --> < AlternativeText useForLanguage="en" attributeName="Name"> < Text lang="en" textIdType="translation">Copenhagen< /Text>
< /AlternativeText> < AlternativeText useForLanguage="en" attributeName="Name"> < Text lang="en" textIdType="alias">Central openhagen< /Text>
< /AlternativeText> < AlternativeText useForLanguage="dk" attributeName="Name"> < Text textIdType="translation" lang="dk">København< /Text> < /AlternativeText>
< AlternativeText useForLanguage="fr" attributeName="Name"> < Text textIdType="translation" lang="fr">Copenhague< /Text> < /AlternativeText> < AlternativeText useForLanguage="ru" attributeName="Name"> < Text textIdType="translation" lang="ru">Копенгаген< /Text> < /AlternativeText> < AlternativeTextl useForLanguage="se" attributeName="Name"> < Text textIdType="translation" lang="se">köpenhamn< /Text> < /AlternativeTextlang="fr"> < !-- Translations for FareScheduledStopPoint.Description --> < AlternativeText useForLanguage="en" attributeName="Description"> < Text textIdType="translation" lang="en">Capital of Denmark< /Text> < /AlternativeText> < AlternativeText useForLanguage="fr" attributeName="Description"> < Text textIdType="translation" lang="fr">Capitale du Danemark< /Text> < /AlternativeText> < AlternativeText useForLanguage="it" attributeName="Description"> < Text textIdType="translation" lang="fr">Capitale du Danemark< /Text> < /AlternativeText> < AlternativeText useForLanguage="de" attributeName="Description"> < Text textIdType="translation" lang="de">Hauptstadt von Dänemark< /Text> < /AlternativeText> < !-- Translations for FareScheduledStopPoint.NameOnRouting --> < AlternativeText useForLanguage="de" attributeName="NameOnRouting"> < Text textIdType="translation" lang="de">Kpnhgn< /Text> < /AlternativeText> < AlternativeText useForLanguage="fr" attributeName="NameOnRouting"> < Text textIdType="translation" lang="fr">Cpnhge< /Text> < /AlternativeText> < AlternativeText useForLanguage="en" attributeName="NameOnRouting"> < Text textIdType="translation" lang="en">Kbnhvn< /Text> < /AlternativeText> < /alternativeTexts> < Name lang="dk">København< /Name> < ShortName>CPH< /ShortName> < Description lang="dk">Danmarks hovedstad< /Description> < VehicleModes>rail coach bus< /VehicleModes> < CountryRef ref="dk"/> < NameOnRouting lang="dk">Kophag< /NameOnRouting> < /FareScheduledStopPoint>
We would instead make MultilingualSting into a mixed data structure that allowed the following
< FareScheduledStopPoint id="uic:6050a" version="01" >
< Name lang="dk">København
< alternativeTexts>
< Text lang="en">Copenhagen< /Text>
< Text lang="de">Kopenhagen< /Text>
< Text lang="fr">Copenhague< /Text>
< Text lang="ru">Копенгаген< /Text>
< Text lang="se">köpenhamn< /Text>
< /alternativeTexts>
< /Name>
< ShortName>CPH < /ShortName>
< Description lang="dk">Danmarks hovedstad
< alternativeTexts>
< Text lang="en">Capital of Denmark< /Text>
< Text lang="fr">Capitale du Danemark< /Text>
< Text lang="de">Hauptstadt von Dänemark< /Text>
< /alternativeTexts>
< /Description>
< VehicleModes>rail coach bus< /VehicleModes>
< CountryRef ref="dk"/>
< NameOnRouting lang="dk">Kophag
< alternativeTexts>
< Text lang="de">Kpnhgn< /Text>
< Text lang="fr">Cpnhge< /Text>
< Text lang="en">Kbnhvn< /Text>
< /alternativeTexts>
< /NameOnRouting>
< /FareScheduledStopPoint>
Could also add a list of "use for" languages as an attribute on Text
Btw, this is not just the Name field. There are other fields such as LANDMARK which has the same current structure as the NAME, and it too is meant to carry the same kind of data (public facing text).
<Landmark lang="NO">A tree</Landmark>
By the logic of the structure, an ALTERNATIVELANDMARKS
would be needed here to carry the translations.
This type of issue would propagate to any field where a user is expected enter a text which will be in a specific language, and I would suggest that the issue of multilingual input be solved in the same way for any such field.
Other examples CrossRoad
, NameSuffix
, Label
, Comment
(in AccessibilityAssessment), Description
, and possibly ShortName
. These are the things I dug up when looking at the XML structure for StopPlace/Quay.
@JohanEntur Doesn't Nick's answer/example provide what you are looking for ? Every single DataManagedObject (so most of NeTEx entities) can carry alternativeTexts (for translations, but also all other possible uses)
How does the data know that < Text lang="en">Capital of Denmark< /Text>
is for Description
and not Name
? Is it depending on the order alone? And, doesn't this mean AlternativeNames should be removed. It would not make sense that everything except Name uses this method.
And AlternativeText doesn't have this:
you need to use AlternativeText as in Nick's example
_< FareScheduledStopPoint id="uic:6050a" version="01" >
< alternativeTexts>
....
< !-- Translations for FareScheduledStopPoint.Description -->
< AlternativeText useForLanguage="en" attributeName="Description">
< Text textIdType="translation" lang="en">Capital of Denmark< /Text>
< /AlternativeText>
...._
@Aurige I find the semantics of thing still suboptimal. I would suggest changing this for NeTEx 3.0 to have all MultilingualStrings having the option to directly use <Text>
.
I support the suggestion of @skinkie.
Also, @Aurige, you cant just give me "you need to use AlternativeText as in Nick's example" when my comment was a critique of Nick's example :)
So, I'm eager to get some resolution here. @skinkie has suggested this:
<Name>
<Text lang="no">Oslo</Text>
<Text lang="sv">Oslo</Text>
<Text lang="en">Oslo</Text>
<Text lang="nl">Oslo</Text>
<Text lang="es">Oslo</Text>
</Name>
Is there anything specific that speaks against this? I found this from Nick: "However I suspect the majority of usage of text is monolingual in the default language so don't think we want to complicate 95% of usage just to cover the edge case". I agree that the majority is monolingual, but this is machine data. Surely the computer reading and writing this doesn't care either way.
This seems to me a straightforward way to list all relevant texts per language for any multilingual object, and it works for everything. Also easy to scale up.
Especially compared to the current feature of the main text being linked to a default value of the dataset, and then stop names have AlternativeNames and everything else has AlternativeText.
PS it has to be ISO 639-3
or it wont support all languages.
Experiment the proposal from @nick-knowles on Dec 8, 2023 (see up in the discussion), by extending the MultiLingualString Check if it works to tools (@duexw ) PR to be done (@skinkie @nick-knowles )
Use appropriate ISO 639-x to be able to cover all languages
According to current way of doing things, is this correct?
<FlexibleLine id="example_xml_1" version="1">
<alternativeTexts>
<AlternativeText useForLanguage="eng" attributeName="TOMATO">
<Text>The House</Text>
</AlternativeText>
<AlternativeText useForLanguage="eng" attributeName="BANANA">
<Text>The Castle</Text>
</AlternativeText>
</alternativeTexts>
<Name lang="nor" textIdType="TOMATO">Huset</Name>
<BookingNote lang="nor" textIdType="BANANA">Slottet</BookingNote>
</FlexibleLine>```
To me, @skinkie's suggestion seems like the most sensible. Any other alternative seems overcomplicated for what I expect to be a very common use case in 2024.
I tried out the "mixed" Element type in Microsoft .NET 8. Basically it works, but one would have to adapt existing code. I made a simplified example. We try to to something like this:
An element which can contain text and optional translations, for example
The microsoft tool xsd.exe generates classes from the xsd in some languages (C++, C# etc). For C# it generates
for the mixed element MultilingualString. You get a collection property for the contained child elements and you get a string array property for the plain text. This is because plain text and child elements can alternate. You get each bit of plain text in an array member. Using the generated classes is quite simple. This is the code for reading a file like the above:
So basically it works well. You do not have to worry about separating plain text and child elements, the framework does that for you. BUT if you had code using the existing "un-mixed" type MultilingualString, you would have to adapt all places where you access these properties, and that would be quite a lot. e.g. where I could write var name = stopPlace.Name before, I would have to write something like var name = string.Join("", stopPlace.Name.Text) now.
Changes in code are always inevitable, but when NeTEx becomes more logical and predictable the code can become more generic and simple.
My key points:
At this moment a
MultilingualString
is actually not that. It is obviously a string, but there is nothing multilingual about it. It does not contain any translations for example. @JohanEntur wrote he would expect something like this to happen in NeTEx:I think this is not what we should do. I would be in favor of the following, making the MultilingualString a standard structure which would directly facilitate translations. AlternativeText would still be useful for aliases and variants, but a single variant would host its own translated variantions: