michaelrsweet commented 19 years ago

Version: 1.1.23 CUPS.org User: mfabian

To show that CUPS behaves as stated in the summary of the bug, I use the following locale setting for testing:

mfabian@magellan:~/test-texts$ locale
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
mfabian@magellan:~/test-texts$

Now, when setting LC_CTYPE to an UTF-8 locale and printing a UTF-8 encoded test file:

mfabian@magellan:~/test-texts$ LC_CTYPE=en_GB.UTF-8 lp < german.utf-8 
request id is test-110 (1 file(s))
mfabian@magellan:~/test-texts$

it is not printed correctly.

But when LC_MESSAGES is set instead of LC_CTYPE:

mfabian@magellan:~/test-texts$ LC_MESSAGES=en_GB.UTF-8 lp < german.utf-8
request id is test-111 (1 file(s))
mfabian@magellan:~/test-texts$

the file is printed correctly.

The test file looks like this:

mfabian@magellan:~/test-texts$ cat german.utf-8
-*- coding: utf-8 -*-
GrÃ¼Ã Gott! â¬
mfabian@magellan:~/test-texts$

I'll attach it as well.

michaelrsweet commented 19 years ago

CUPS.org User: mfabian

(Recycling most of a comment by Markus Kuhn from Bug #41006 on http://bugzilla.novell.com):

LC_MESSAGES is not the variable which determines the charmap of the current locale. Instead it is determined from the effective value of LC_CTYPE (I wrote "effective" because LC_CTYPE maybe overridden by LC_ALL or it may be unset and then it inherits the value from LANG).

See the Open Group's Single Unix Specification, which has since 2001 been identical to the IEEE/ISO POSIX standard, available freely on

http://www.opengroup.org/onlinepubs/007904975/

under Base Definitions/Environment Variables you can read:

LC_CTYPE This environment variable determines the interpretation of sequences of bytes of text data as characters (for example, single as opposed to multi-byte characters), the classification of characters (for example, alpha, digit, graph), and the behavior of character classes.

Further down the same page, this environment variable (like all of LC_*) inherits a default value from LANG and can be overridden with LC_ALL. Therefore, to read LC_CTYPE correctly, you need to use something like

if (((s = getenv("LC_ALL")) && s) || ((s = getenv("LC_CTYPE")) && s) || ((s = getenv("LANG")) && *s)) { printf("LC_CTYPE = %s\n", s); }

The "locale" command line tool does that for example.

The proper way to find out the encoding used is to call the function nllanginfo(CODESET), which is also what the command-line "locale charmap" does, because the name of the used character set is actually defined in the locale definitions file that is identified by the LANG or LC* variable.

Until about two years ago, FreeBSD was the last widely used Unix variant that still lacked nl_langinfo(), therefore people had to use workarounds that tried to guess the encoding name from the LC_CTYPE locale name, which is problematic. Two such workaround hacks are linked on

http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate

Fortunately, in 2003 this practice is no longer needed, because nl_langinfo() is now a proper universally implemented POSIX API call.

michaelrsweet commented 19 years ago

CUPS.org User: mkuhn

In a nutshell, the portable way any application should determine today on POSIX systems the character set selected by the locale is:

include

int main() { if (!setlocale(LC_CTYPE, "")) { fprintf(stderr, "Can't set the specified locale! " "Check LANG, LC_CTYPE, LC_ALL.\n"); return 1; } puts(nl_langinfo(CODESET)); return 0;

}

This is formally guaranteed by the POSIX spec to work on any system where

_POSIX_VERSION >= 200112L

but it will work in practice almost anywhere else, too.

Unfortunately, the output syntax of nl_langinfo(CODESET) is not standardized properly. In practice, UTF-8 is always signalled as "UTF-8", but ISO 8859-15 can come as "ISO8859-15", "ISO_8859-15", "ISO-8859-15", etc.

Therefore, it is a good idea to normalize the output of nl_langinfo(CODESET), and the simple public-domain function

http://www.cl.cam.ac.uk/~mgk25/ucs/norm_charmap.c

can be used to do exactly that.

michaelrsweet commented 19 years ago

CUPS.org User: mkuhn

Should there be concern about pre-2001 POSIX systems that do not implement nl_langinfo(CODESET), then a public-domain workaround emulator for it, which guesses the character set based on the locale name from the environment variables, is available on:

http://www.cl.cam.ac.uk/~mgk25/ucs/langinfo.c

That routine was widely used before FreeBSD finally added nl_langinfo(CODESET) support with version 4.6 in mid 2002 (the last widely-used POSIX system that was still missing it). I doubt it is still necessary today.

michaelrsweet commented 19 years ago

CUPS.org User: mike

CUPS already uses nl_langinfo(CODESET) when it is available. See the cups/language.c source file.

The current code tests for both nl_langinfo() and a definition of the CODESET constant - if both are not found, the code falls back on environment variables.

Any fix for this will be delayed until 1.2, however if you can look at the current cups/language.c source file and see why it is not working on your OS of choice, we'll be happy to make the necessary changes.

michaelrsweet commented 19 years ago

CUPS.org User: mfabian

nl_langinfo(CODESET) is not used because

include

is missing in cuse-1.1.23/cups/language.c.

Without that,

ifdef HAVE_LANGINFO_H

include

endif /* HAVE_LANGINFO_H */

will of course not include langinfo.h and then CODESET will be undefined.

That's not the only bug though, even with that fix it still doesn't seem to work right.

michaelrsweet commented 19 years ago

CUPS.org User: mfabian

ifdef LC_MESSAGES

etc. didn't work because

#include <locale.h>

was missing.

michaelrsweet commented 19 years ago

CUPS.org User: mfabian

I attached a patch "locale.patch" which hopefully fixes the problem.

michaelrsweet commented 19 years ago

CUPS.org User: mike

OK, first, we'd need a patch against CUPS 1.2. 1.1.x is closed for all but security bugs.

As for , it is included by "language.h". By including "string.h" (which includes ) before checking for langinfo, all of the right headers should now be included...

Please look at the current 1.2 sources; here is a direct link:

http://svn.easysw.com/public/cups/trunk/cups/language.c

The current code seems to work the "right" way using nl_langinfo() when available...

michaelrsweet commented 19 years ago

CUPS.org User: mike

This STR has not been updated by the submitter for two or more weeks and has been closed as required by the CUPS Configuration Management Plan. If the issue still requires resolution, please re-submit a new STR.

michaelrsweet commented 19 years ago

"locale.patch":

diff -ru cups-1.1.23.orig/cups/language.c cups-1.1.23/cups/language.c --- cups-1.1.23.orig/cups/language.c 2005-01-03 20:29:45.000000000 +0100 +++ cups-1.1.23/cups/language.c 2005-06-15 17:56:09.000000000 +0200 @@ -40,9 +40,11 @@

Include necessary headers... */

+#include

include

+#include

ifdef HAVE_LANGINFO_H

include

endif /* HAVE_LANGINFO_H */

@@ -114,6 +116,116 @@ };

+#ifndef HAVE_LANGINFO_H +/*

* This is a quick-and-dirty emulator of the nl_langinfo(CODESET)
* function defined in the Single Unix Specification for those systems
* (FreeBSD, etc.) that don't have one yet. It behaves as if it had
* been called after setlocale(LC_CTYPE, ""), that is it looks at
* the locale environment variables.
*
* http://www.opengroup.org/onlinepubs/7908799/xsh/langinfo.h.html
*
* Please extend it as needed and suggest improvements to the author.
* This emulator will hopefully become redundant soon as
* nl_langinfo(CODESET) becomes more widely implemented.
*
* Since the proposed Li18nux encoding name registry is still not mature,
* the output follows the MIME registry where possible:
*
* http://www.iana.org/assignments/character-sets
*
* A possible autoconf test for the availability of nl_langinfo(CODESET)
* can be found in
*
* http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate
*
* Markus.Kuhn@cl.cam.ac.uk -- 2002-03-11
* Permission to use, copy, modify, and distribute this software
* for any purpose and without fee is hereby granted. The author
* disclaims all warranties with regard to this software.
*
* Latest version:
*
* http://www.cl.cam.ac.uk/~mgk25/ucs/langinfo.c
_/ + +#define CCODESET "US-ASCII" / Return this as the encoding of the
* C/POSIX locale. Could as well one day
* become "UTF-8". / + +#define digit(x) ((x) >= '0' && (x) <= '9') + +static char buf[16]; + +char nl_langinfo(nl_item item) +{
char l, p; +
DEBUG_printf(("cupsLangGet: using emulator for nl_langinfo(CODESET)\n")); +
if (item != CODESET)
return NULL;
if (((l = getenv("LC_ALL")) && *l) ||
((l = getenv("LC_CTYPE")) && *l) ||
((l = getenv("LANG")) && *l)) {
/* check standardized locales */
if (!strcmp(l, "C") || !strcmp(l, "POSIX"))
return C_CODESET;
/* check for encoding name fragment */
if (strstr(l, "UTF") || strstr(l, "utf"))
return "UTF-8";
if ((p = strstr(l, "8859-"))) {
memcpy(buf, "ISO-8859-\0\0", 12);
p += 5;
if (digit(*p)) {
buf[9] = *p++;
if (digit(p)) buf[10] = p++;
return buf;
}
}
if (strstr(l, "KOI8-R")) return "KOI8-R";
if (strstr(l, "KOI8-U")) return "KOI8-U";
if (strstr(l, "620")) return "TIS-620";
if (strstr(l, "2312")) return "GB2312";
if (strstr(l, "HKSCS")) return "Big5HKSCS"; /* no MIME charset */
if (strstr(l, "Big5") || strstr(l, "BIG5")) return "Big5";
if (strstr(l, "GBK")) return "GBK"; /* no MIME charset */
if (strstr(l, "18030")) return "GB18030"; /* no MIME charset */
if (strstr(l, "Shift_JIS") || strstr(l, "SJIS")) return "Shift_JIS";
/* check for conclusive modifier */
if (strstr(l, "euro")) return "ISO-8859-15";
/* check for language (and perhaps country) codes */
if (strstr(l, "zh_TW")) return "Big5";
if (strstr(l, "zh_HK")) return "Big5HKSCS"; /* no MIME charset */
if (strstr(l, "zh")) return "GB2312";
if (strstr(l, "ja")) return "EUC-JP";
if (strstr(l, "ko")) return "EUC-KR";
if (strstr(l, "ru")) return "KOI8-R";
if (strstr(l, "uk")) return "KOI8-U";
if (strstr(l, "pl") || strstr(l, "hr") ||
strstr(l, "hu") || strstr(l, "cs") ||
strstr(l, "sk") || strstr(l, "sl")) return "ISO-8859-2";
if (strstr(l, "eo") || strstr(l, "mt")) return "ISO-8859-3";
if (strstr(l, "el")) return "ISO-8859-7";
if (strstr(l, "he")) return "ISO-8859-8";
if (strstr(l, "tr")) return "ISO-8859-9";
if (strstr(l, "th")) return "TIS-620"; /* or ISO-8859-11 */
if (strstr(l, "lt")) return "ISO-8859-13";
if (strstr(l, "cy")) return "ISO-8859-14";
if (strstr(l, "ro")) return "ISO-8859-2"; /* or ISO-8859-16 */
if (strstr(l, "am") || strstr(l, "vi")) return "UTF-8";
/* Send me further rules if you like, but don't forget that we are
* only interested in locale naming conventions on platforms
* that do not already provide an nl_langinfo(CODESET) implementation. */
return "ISO-8859-1"; /* should perhaps be "UTF-8" instead */
}
return C_CODESET; +} + +#endif /* not HAVE_LANGINFOH / + /_
- 'cupsLangEncoding()' - Return the character encoding (us-ascii, etc.)
- for the given language. @@ -250,31 +362,10 @@ if (language == NULL) language = appleLangDefault();
  else
setlocale(LC_ALL, ""); if (language == NULL) {
/*
* First see if the locale has been set; if it is still "C" or
* "POSIX", set the locale to the default...

- */

-# ifdef LC_MESSAGES

ptr = setlocale(LC_MESSAGES, NULL); -# else
ptr = setlocale(LC_ALL, NULL); -# endif /* LC_MESSAGES */
DEBUG_printf(("cupsLangGet: current locale is \"%s\"\n",

- ptr ? ptr : "(null)"));

if (!ptr || !strcmp(ptr, "C") || !strcmp(ptr, "POSIX")) -# ifdef LC_MESSAGES
{
ptr = setlocale(LC_MESSAGES, "");
setlocale(LC_CTYPE, "");
} -# else
ptr = setlocale(LC_ALL, ""); -# endif /* LC_MESSAGES */
ptr = setlocale(LC_MESSAGES, "");

if (ptr) { @@ -309,7 +400,6 @@

charset[0] = '\0';

-#ifdef CODESET /*

On systems that support the nl_langinfo(CODESET) call, use
this value as the character set... @@ -330,7 +420,6 @@ DEBUG_printf(("cupsLangGet: charset set to \"%s\" via nl_langinfo(CODESET)...\n", charset)); } -#endif /* CODESET */

/*
Set the locale back to POSIX while we do string ops, since @@ -389,19 +478,6 @@ *ptr = '\0'; }
if (*language == '.' && !charset[0])
{
/*
* Copy the encoding...

- */

for (language ++, ptr = charset; *language; language ++)
if (isalnum(*language & 255) && ptr < (charset + sizeof(charset) - 1))

- _ptr++ = toupper(_language & 255);

*ptr = '\0';

- }

/*
 * Force a POSIX locale for an invalid language name...
 */

@@ -410,7 +486,6 @@ { strcpy(langname, "C"); country[0] = '\0';

charset[0] = '\0'; } }

diff -ru cups-1.1.23.orig/scheduler/type.c cups-1.1.23/scheduler/type.c --- cups-1.1.23.orig/scheduler/type.c 2005-01-03 20:29:59.000000000 +0100 +++ cups-1.1.23/scheduler/type.c 2005-06-15 14:31:05.000000000 +0200 @@ -942,6 +942,8 @@ case MIME_MAGIC_LOCALE :

if defined(WIN32) || defined(EMX) || defined(APPLE)

       result = (strcmp(rules->value.localev, setlocale(LC_ALL, "")) == 0);

+#elif defined(GLIBC) && defined(LC_CTYPE)

result = (strcmp(rules->value.localev, setlocale(LC_CTYPE, "")) == 0);
else
```
   result = (strcmp(rules->value.localev, setlocale(LC_MESSAGES, "")) == 0);
```
endif /* APPLE */

apple / cups

CUPS uses LC_MESSAGES to determine the charmap of the current locale #1194

include

include

include

}

include

ifdef HAVE_LANGINFO_H

include

endif /* HAVE_LANGINFO_H */

ifdef LC_MESSAGES

include

include

include

ifdef HAVE_LANGINFO_H

include

endif /* HAVE_LANGINFO_H */

else

- */

ptr = setlocale(LC_ALL, NULL); -# endif /* LC_MESSAGES */

- ptr ? ptr : "(null)"));

- */

- _ptr++ = toupper(_language & 255);

- }

if defined(WIN32) || defined(EMX) || defined(APPLE)

else

endif /* APPLE */