apple / cups

Apple CUPS Sources
https://www.cups.org
Apache License 2.0
1.95k stars 464 forks source link

CUPS uses LC_MESSAGES to determine the charmap of the current locale #1194

Closed michaelrsweet closed 19 years ago

michaelrsweet commented 19 years ago

Version: 1.1.23 CUPS.org User: mfabian

To show that CUPS behaves as stated in the summary of the bug, I use the following locale setting for testing:

mfabian@magellan:~/test-texts$ locale
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
mfabian@magellan:~/test-texts$

Now, when setting LC_CTYPE to an UTF-8 locale and printing a UTF-8 encoded test file:

mfabian@magellan:~/test-texts$ LC_CTYPE=en_GB.UTF-8 lp < german.utf-8 
request id is test-110 (1 file(s))
mfabian@magellan:~/test-texts$

it is not printed correctly.

But when LC_MESSAGES is set instead of LC_CTYPE:

mfabian@magellan:~/test-texts$ LC_MESSAGES=en_GB.UTF-8 lp < german.utf-8
request id is test-111 (1 file(s))
mfabian@magellan:~/test-texts$

the file is printed correctly.

The test file looks like this:

mfabian@magellan:~/test-texts$ cat german.utf-8
-*- coding: utf-8 -*-
Grüß Gott! €
mfabian@magellan:~/test-texts$

I'll attach it as well.

michaelrsweet commented 19 years ago

CUPS.org User: mfabian

(Recycling most of a comment by Markus Kuhn from Bug #41006 on http://bugzilla.novell.com):

LC_MESSAGES is not the variable which determines the charmap of the current locale. Instead it is determined from the effective value of LC_CTYPE (I wrote "effective" because LC_CTYPE maybe overridden by LC_ALL or it may be unset and then it inherits the value from LANG).

See the Open Group's Single Unix Specification, which has since 2001 been identical to the IEEE/ISO POSIX standard, available freely on

http://www.opengroup.org/onlinepubs/007904975/

under Base Definitions/Environment Variables you can read:

LC_CTYPE This environment variable determines the interpretation of sequences of bytes of text data as characters (for example, single as opposed to multi-byte characters), the classification of characters (for example, alpha, digit, graph), and the behavior of character classes.

Further down the same page, this environment variable (like all of LC_*) inherits a default value from LANG and can be overridden with LC_ALL. Therefore, to read LC_CTYPE correctly, you need to use something like

if (((s = getenv("LC_ALL")) && s) || ((s = getenv("LC_CTYPE")) && s) || ((s = getenv("LANG")) && *s)) { printf("LC_CTYPE = %s\n", s); }

The "locale" command line tool does that for example.

The proper way to find out the encoding used is to call the function nllanginfo(CODESET), which is also what the command-line "locale charmap" does, because the name of the used character set is actually defined in the locale definitions file that is identified by the LANG or LC* variable.

Until about two years ago, FreeBSD was the last widely used Unix variant that still lacked nl_langinfo(), therefore people had to use workarounds that tried to guess the encoding name from the LC_CTYPE locale name, which is problematic. Two such workaround hacks are linked on

http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate

Fortunately, in 2003 this practice is no longer needed, because nl_langinfo() is now a proper universally implemented POSIX API call.

michaelrsweet commented 19 years ago

CUPS.org User: mkuhn

In a nutshell, the portable way any application should determine today on POSIX systems the character set selected by the locale is:


include

include

include

int main() { if (!setlocale(LC_CTYPE, "")) { fprintf(stderr, "Can't set the specified locale! " "Check LANG, LC_CTYPE, LC_ALL.\n"); return 1; } puts(nl_langinfo(CODESET)); return 0;

}

This is formally guaranteed by the POSIX spec to work on any system where

_POSIX_VERSION >= 200112L

but it will work in practice almost anywhere else, too.

Unfortunately, the output syntax of nl_langinfo(CODESET) is not standardized properly. In practice, UTF-8 is always signalled as "UTF-8", but ISO 8859-15 can come as "ISO8859-15", "ISO_8859-15", "ISO-8859-15", etc.

Therefore, it is a good idea to normalize the output of nl_langinfo(CODESET), and the simple public-domain function

http://www.cl.cam.ac.uk/~mgk25/ucs/norm_charmap.c

can be used to do exactly that.

michaelrsweet commented 19 years ago

CUPS.org User: mkuhn

Should there be concern about pre-2001 POSIX systems that do not implement nl_langinfo(CODESET), then a public-domain workaround emulator for it, which guesses the character set based on the locale name from the environment variables, is available on:

http://www.cl.cam.ac.uk/~mgk25/ucs/langinfo.c

That routine was widely used before FreeBSD finally added nl_langinfo(CODESET) support with version 4.6 in mid 2002 (the last widely-used POSIX system that was still missing it). I doubt it is still necessary today.

michaelrsweet commented 19 years ago

CUPS.org User: mike

CUPS already uses nl_langinfo(CODESET) when it is available. See the cups/language.c source file.

The current code tests for both nl_langinfo() and a definition of the CODESET constant - if both are not found, the code falls back on environment variables.

Any fix for this will be delayed until 1.2, however if you can look at the current cups/language.c source file and see why it is not working on your OS of choice, we'll be happy to make the necessary changes.

michaelrsweet commented 19 years ago

CUPS.org User: mfabian

nl_langinfo(CODESET) is not used because

include

is missing in cuse-1.1.23/cups/language.c.

Without that,

ifdef HAVE_LANGINFO_H

include

endif /* HAVE_LANGINFO_H */

will of course not include langinfo.h and then CODESET will be undefined.

That's not the only bug though, even with that fix it still doesn't seem to work right.

michaelrsweet commented 19 years ago

CUPS.org User: mfabian

ifdef LC_MESSAGES

etc. didn't work because

#include <locale.h>

was missing.

michaelrsweet commented 19 years ago

CUPS.org User: mfabian

I attached a patch "locale.patch" which hopefully fixes the problem.

michaelrsweet commented 19 years ago

CUPS.org User: mike

OK, first, we'd need a patch against CUPS 1.2. 1.1.x is closed for all but security bugs.

As for , it is included by "language.h". By including "string.h" (which includes ) before checking for langinfo, all of the right headers should now be included...

Please look at the current 1.2 sources; here is a direct link:

http://svn.easysw.com/public/cups/trunk/cups/language.c

The current code seems to work the "right" way using nl_langinfo() when available...

michaelrsweet commented 19 years ago

CUPS.org User: mike

This STR has not been updated by the submitter for two or more weeks and has been closed as required by the CUPS Configuration Management Plan. If the issue still requires resolution, please re-submit a new STR.

michaelrsweet commented 19 years ago

"locale.patch":

diff -ru cups-1.1.23.orig/cups/language.c cups-1.1.23/cups/language.c --- cups-1.1.23.orig/cups/language.c 2005-01-03 20:29:45.000000000 +0100 +++ cups-1.1.23/cups/language.c 2005-06-15 17:56:09.000000000 +0200 @@ -40,9 +40,11 @@

+#include

include

include

include

+#include

ifdef HAVE_LANGINFO_H

include

endif /* HAVE_LANGINFO_H */

@@ -114,6 +116,116 @@ };

+#ifndef HAVE_LANGINFO_H +/*

- */

-# ifdef LC_MESSAGES

- ptr ? ptr : "(null)"));

-#ifdef CODESET /*

- */

- _ptr++ = toupper(_language & 255);

- }

/*
 * Force a POSIX locale for an invalid language name...
 */

@@ -410,7 +486,6 @@ { strcpy(langname, "C"); country[0] = '\0';

diff -ru cups-1.1.23.orig/scheduler/type.c cups-1.1.23/scheduler/type.c --- cups-1.1.23.orig/scheduler/type.c 2005-01-03 20:29:59.000000000 +0100 +++ cups-1.1.23/scheduler/type.c 2005-06-15 14:31:05.000000000 +0200 @@ -942,6 +942,8 @@ case MIME_MAGIC_LOCALE :

if defined(WIN32) || defined(EMX) || defined(APPLE)

       result = (strcmp(rules->value.localev, setlocale(LC_ALL, "")) == 0);

+#elif defined(GLIBC) && defined(LC_CTYPE)