elixir-gettext / gettext

Internationalization and localization support for Elixir.
https://hexdocs.pm/gettext
461 stars 87 forks source link

Choose the right locale based on the user-provided locale #333

Closed whatyouhide closed 1 year ago

whatyouhide commented 1 year ago

This issue is a spin-off of https://github.com/elixir-gettext/gettext/issues/318 to reduce the scope of that.

maennchen commented 1 year ago

Some related standards for this issue:

I guess we're looking for a solution that does a Filtering Matching Scheme on all available Gettext Languages and then tries lookup messages according to the distance and priority of the match.

I think you are already familiar with that @kipcole9. Does that sound about correct?

kipcole9 commented 1 year ago

There are two concepts that could be applied:

I think in the first instance the most useful implementation would be matching (aka lookup). Fallbacks (filtering) could be considered in a second round.

Matching requires defining a language tag syntax

The current Gettext local name has no formal syntax (just a simple string comparison). Performing a locale lookup will require a language tag definition. I suggest a limited subset of the RFC5646 language tag.

The basic structure of a language tag is: language-extlang-script-region-variant-extension-privateuse. I suggest the following limitations be defined to simplify the matching requirements, simplify backwards compatibility and simplify the implementation.

Proposed language tag syntax

Based upon the proposal above, the ABNF of the supported language tag would be:

   langtag       = language
                   ["-" script]
                   ["-" region]

   language      = 2*3ALPHA            ; shortest ISO 639 code
                 / 4ALPHA              ; or reserved for future use
                 / 5*8ALPHA            ; or registered language subtag

   script        = 4ALPHA              ; ISO 15924 code

   region        = 2ALPHA              ; ISO 3166-1 code
                 / 3DIGIT              ; UN M.49 code

Since validation of language tags is not in scope of Gettext, the format of language, script and region could be relaxed to be simply 1*ALPHA.

Lookup process

josevalim commented 1 year ago

Awesome @kipcole9 !

perhaps this is a bit out of scope for Gettext especially because it can be used outside of Gettext?

maennchen commented 1 year ago

@kipcole9 Thanks for those details. I have a few follow-up questions:

In german we have formal / informal language. It is quite common to have a corresponding gettext translation each. How would you represent that in a language tag?


Why would you not support extlang?


Does a lookup normally try to find the closest match even if none match strictly or do you compare strictly?

Example:

Would we identify the de-DE language in this scenario or would we request from the user that he provides a translation for “de” itself?


Currently, gettext calls languages “locales” internally. I believe this to be an incorrect term since there’s not necessarily a region involved which would make it a locale. Would you rename where possible to “language”?


Do you think BCP-47 is a good fit (including supporting underscores) even though it specifies ISO639 as the standard for the Language header?

https://www.gnu.org/software/gettext/manual/html_node/Header-Entry.html

kipcole9 commented 1 year ago

@josevalim ex_cldr implements this proposal (more completely that the proposal) so its available outside of Gettext now. The reason for including this capability in Gettext would be to more easily support the examples where the users locale is en-AU but the site only implements translations for en. Currently no translation would be found. There are 108 (!) en-* locales alone, 6 de etc etc. Given that its no uncommon to set the locale based upon what the browser sends being able to match to an available Gettext locale is helpful I believe.

This could of course be implemented in Gettext.handle_missing_translation/5 so one approach might be to provide a function that a user could delegate to in their own MyApp.Gettext.handle_missing_translation/5 that implements the proposal but otherwise doesn't interfere with default Gettext behaviour?

kipcole9 commented 1 year ago

@maennchen good questions as always:

Why would you not support extlang?

extang is there to support legacy language tag formats. From the spec:

Language+extlang combinations are provided to accommodate legacy language tag forms

In german we have formal / informal language. It is quite common to have a corresponding gettext translation each. How would you represent that in a language tag?

In BCP47 that would be handled with a private-use extension. For example, de-CH-x-informal. Or you could try to have a variant subtag added to the IETF subtag registry :-)

Does a lookup normally try to find the closest match even if none match strictly or do you compare strictly?

In this proposal its strict match. So in your example de-CH would not resolve to de-DE. Fallback chains (filtering) would cater for this requirement but is more complex and likely outside the scope of Gettext. ex_cldr does this when processing Accept-Language headers.

Do you think BCP-47 is a good fit

Based upon your reference, maybe not. But the principles can still apply. de-CH@Latn could be considered the same as de-Latn-CH.

Currently, gettext calls languages “locales” internally.

I believe the appropriate way to references are:

In the Gettext context I think the right descriptions are:

maennchen commented 1 year ago

@kipcole9 There’s some people that thought about ISO639 & BCP47 conversion: https://wiki.openoffice.org/wiki/LocaleMapping

@josevalim I believe based on this that this whole thing is too complicated to handle in gettext as well.

Maybe it would be a good idea to provide a behaviour for the language selection and provide a default implementation.

The default implementation could be just: If the search language starts with the user language, it matches.


@kipcole9

This could of course be implemented in Gettext.handle_missing_translation/5 so one approach might be to provide a function

Why would you implement this functionality at that level instead of earlier when selecting the language to read?

kipcole9 commented 1 year ago

I believe based on this that this whole thing is too complicated to handle in gettext as well

I can extract some code from ex_cldr into a separate library that does the matching (and maybe filtering). If that's preferred then close the issue is fine.

whatyouhide commented 1 year ago

Agreed that at this point this is out of scope for Gettext. Let's focus on making sure that Gettext is extendable enougn so that users who wish to do so can use more complex locale-discovery logic.

maennchen commented 1 year ago

What do you all think about an approach like this?

defmodule Gettext.LanguageSelection do
  @callback language_parents(locale :: String.t()) :: [String.t()]
end

Our default (based on ISO639) would work look something like this:

Gettext.LanguageSelection.Default.language_parents("de_CH@informal")
# => ["de_CH@informal", "de_CH", "de"]

Gettext would just use the first entry where a .po file exist or the default language if none match.

Libraries like CLDR could then provide their own implementation that works more along the lines of this:

Cldr.GettextLanguageSelection.language_parents("de-CH-latin-x-informal")
# => ["de-CH-latin-x-informal", "de-CH-latin", "de-CH-x-informal", "de-CH", "...", "de"]

This way, the default behaviour should be rather simple to implement and would allow for extension.

josevalim commented 1 year ago

I think the job of breaking "de_CH@informal" into multiple locales should be decoupled from Gettext. What Gettext can help with is matching that against the locales it knows.

maennchen commented 1 year ago

@josevalim Ok, fair.

I think we decided to not take any action then and should close this issue. Is that correct or do you still see something that we want to do?

josevalim commented 1 year ago

It depends if there is something to do on this part: "What Gettext can help with is matching that against the locales it knows."

maennchen commented 1 year ago

@josevalim I think we are set then.

That should allow an external library to choose a local from the list of known ones and then select it.