att / ast

AST - AT&T Software Technology
Eclipse Public License 1.0
557 stars 152 forks source link

Locale support is hard (en_US.ISO-8859-15 versus en_US.ISO8859-15 versus en_US.ISO885915) #412

Open krader1961 opened 6 years ago

krader1961 commented 6 years ago

I noticed the wchar test reports this error on macOS:

ksh.wchar.XXXXXXXX.AC0rc9TN[53]: warning: LC_ALL=en_US.ISO-8859-15 not supported

That is because on macOS the canonical spelling is en_US.ISO8859-15. Other platforms have similar distinctions with respect to the presence of hyphens and upper versus lowercase letters. In some cases the native locale subsystem normalizes the value so that the hyphens and/or letter case don't matter. In other cases it doesn't which means the unit test is likely to either fail or not test what it expects to test.

This makes me wonder why our industry can't manage to agree on something so basic as the interpretation of a locale string.

kernigh commented 6 years ago

IANA keeps a list of names for character sets. IANA's name is ISO-8859-15. I suspect that some platforms use a different name because their locales are older than the IANA's list.

A few platforms (at least illumos, DragonFly BSD, FreeBSD) get their locales from the Common Locale Data Repository, but CLDR doesn't define encodings like ISO-8859-15. CLDR comes with a Java tool to make locales. This tool uses java.nio.charset.CharSet, so it uses Java's character sets. Java provides ISO-8859-15 with several aliases including ISO8859-15, IBM923, LATIN9 and many others. The CLDR tool might be able to make any of en_US.ISO-8859-15, en_US.ISO8859-15, en_US.IBM923, en_US.LATIN9, and so on. This allows each platform to have a different name for the same locale!

illumos used the CLDR tool to make the ISO8859-15 character map. DragonFly and FreeBSD didn't use the CLDR tool for character maps; they got maps from ftp://ftp.unicode.org/Public/MAPPINGS/ and applied the name ISO8859-15 to the map from ISO8859/8859-15.TXT. The name ISO8859-15 might have come from Solaris, because illumos is a Solaris fork, and Citrus wanted BSD to have "the same level of functionality as Solaris7 supports."

In OpenBSD 6.3, the manual for setlocale(3) will say,

The syntax and semantics of the locale argument are not standardized and vary among operating systems. On OpenBSD, if the locale string ends with “.UTF-8”, the UTF-8 locale is selected; otherwise, the “C” locale is selected, which uses the ASCII character set. If the locale contains a dot but does not end with “.UTF-8”, setlocale() fails.

siteshwar commented 6 years ago

This is just a warning coming from here. en_US.ISO8859-15 should be added to this list.