1Password / sys-locale

A small and lightweight Rust library to obtain the active locale on the system.
Apache License 2.0
81 stars 15 forks source link

Wrong locale evaluation on Unix-like systems without codeset but with modifier #32

Open pasabanov opened 1 month ago

pasabanov commented 1 month ago

According to this specification, the POSIX locale is defined as:

language[_territory][.codeset][@modifier]

For example, the locale De_DE@dict is valid.

However, the current implementation of the library does not check for the @ character, leading to an invalid locale detection when the codeset is not present but the modifier is.

Example:

let fr_FR_euro = "fr_FR@euro".to_owned();
let mut env = MockEnv::new();
env.insert("LANGUAGE".into(), fr_FR_euro);
env.insert("LC_ALL".into(), fr_FR_euro);
env.insert("LC_MESSAGES".into(), fr_FR_euro);
env.insert("LANG".into(), fr_FR_euro);
env.insert("LC_CTYPE".into(), fr_FR_euro);
let locale = _get(&env);
// locale is equal to "fr-FR@euro" which is not a valid BCP 47 locale

The simplest possible solution would be:

However, since some POSIX modifiers might be convertible to BCP 47, a more complex solution would be:

complexspaces commented 1 month ago

Implement full support for POSIX modifier

Do you have a rough idea what this would entail? This is an area I'm not familiar with. The level 2 canonicalization described here look close to what you are mentioning, but I'm not 100% sure!

Generally I'd be fine making this handling more robust at the cost of complexity as long as we don't need to drag in any large ICU crates to handle bundling various chunks of locale/region data used for mapping correctly.

pasabanov commented 1 month ago

Do you have a rough idea what this would entail? This is an area I'm not familiar with. The level 2 canonicalization described here look close to what you are mentioning, but I'm not 100% sure!

I'm not an expert it this field either. The link that you attached is about ICU locales. As I understand it, this is the third locale type along with POSIX and BCP 47. Your library is working with BCP 47 locales, as far as I know, so the conversion algorithm should be different.

For now I'm unsure, what the algorithm should be exactly.

Some useful links for further investigation:

complexspaces commented 1 month ago

Your library is working with BCP 47 locales, as far as I know, so the conversion algorithm should be different.

The reason I mentioned it is because the ICU one appears incredibly similar to the BCP47 format and places like MSDN say "This format is used by Windows and many other environments, including ... ICU, ...". The ICU formalization seems to have, at minimum, cribbed the region code variants and handling.

If the above holds up (I could try doing a more detailed comparison), then this statement is valid for sys-locale's considerations because we are parsing POSIX locales:

Level 2 canonicalization is designed to translate POSIX and .NET IDs, as well as nonstandard ICU locale IDs.

pasabanov commented 1 month ago

This should be reopened due to the unresolved conversation about POSIX to BCP 47 modifiers conversion.

complexspaces commented 1 month ago

Whoops, you're right. The GitHub autoclose syntax grabbed it by mistake in your PR.

pasabanov commented 1 month ago

That's because I wrote "partially resolves ..." there. GitHub didn't recognize the word "partially".