TeX-Live / texdoc

Find and view documentation in TeX Live
https://tug.org/texdoc/
GNU General Public License v3.0
47 stars 8 forks source link

Better locale guessing #77

Closed wtsnjp closed 1 year ago

wtsnjp commented 2 years ago

To calculate a better score for the found documents, Texdoc tries to get the system locale. At this moment, this feature completely relies on the os.setlocale() function (which internally calls the setlocale() function of C). In addition to this, I am now thinking to check the LANG variable and set the lang configuration from its value only if Texdoc fails to get the locale information from os.setlocale().

With the current implementation, Texdoc sometimes fails to get the "expected" locale. We got multiple reports claiming that Texdoc does not recognize any locale even though they set the LANG variable (see https://github.com/TeX-Live/texdoc/issues/76#issuecomment-1072306460 and mailing list.)

Not surprisingly, it seems the behavior of the setlocale() function heavily depends on the platform. The exact precedence of the related variables and the values of such variables can differ among platforms. Notably, sometimes the setlocale() function returns the value that Texdoc cannot interpret (e.g., Japanese_Japan.932 on Windows. Texdoc only supports the values starting with 2-letter language code like ja_JP.UTF-8.)

I would rather want to follow the convention of Unix tools for this locale setting, so I checked IEEE Std 1003.1, 2004 Edition:

LANG: This variable shall determine the locale category for native language, local customs, and coded character set in the absence of the LCALL and other LC* ( LC_COLLATE , LC_CTYPE , LC_MESSAGES , LC_MONETARY , LC_NUMERIC , LC_TIME ) environment variables. This can be used by applications to determine the language to use for error messages and instructions, collating sequences, date formats, and so on. LC_ALL: This variable shall determine the values for all locale categories. The value of the LCALL environment variable has precedence over any of the other environment variables starting with LC ( LC_COLLATE , LC_CTYPE , LC_MESSAGES , LC_MONETARY , LC_NUMERIC , LC_TIME ) and the LANG environment variable.

It says we can consider LANG if LC_* are absent. I wonder what if LC_ALL exists but its value is invalid (in the sense of Texdoc.) Should we consider the value of LANG variable when we cannot get a valid language code from the os.setlocale() function?

@kberry @norbusan do you have any suggestions on this?

kberry commented 2 years ago

It sounds completely sensible to me to fall back to looking at LANG if LC_* values are unusable, whether absent or invalid. I see nothing to be lost, and it seems far better for users if texdoc tries as best it can to guess their intended locale.

By the way, GNU gettext also supports a LANGUAGE envvar, overriding all the others, but I rarely see this used nowadays. https://www.gnu.org/software/gettext/manual/gettext.html#Locale-Environment-Variables

Best, Karl

wtsnjp commented 2 years ago

Ok, then I will go for that. Thanks!

lemzwerg commented 1 year ago

As this thread shows there is much more broken regarding language selection than what the OP reports...

wtsnjp commented 1 year ago

The system locale obtained with os.setlocale() seems more static than those obtained from environment variables. I will change Texdoc to check the environment variables first and then use the os.setlocale() as a fallback. Borrowing the specification of GNU gettext, the priority list will be:

  1. LANGUAGE_texdoc
  2. LANGUAGE
  3. LC_ALL
  4. LANG
  5. os.setlocale()