NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.99k stars 14.01k forks source link

`ghc` does not support Unicode, reports locale encoding ASCII. #64603

Closed kindaro closed 3 years ago

kindaro commented 5 years ago

Issue description

I installed ghc via Nix and also via Stack. With the ghc binaries provided by Stack, I can easily do Unicode IO. But the Nix ghc errors out and pretends to know nothing about Unicode.

Steps to reproduce

% ghc -e 'putStrLn "x"'
x
% ghc -e 'putStrLn "λ"'

<interactive>:0:11: error:
    lexical error in string/character literal at character '\56526'
% stack ghc -- -e 'putStrLn "λ"'
λ

% ghci
GHCi, version 8.6.5: http://www.haskell.org/ghc/  :? for help
Prelude> import System.IO
Prelude System.IO> localeEncoding  
ASCII

% stack repl
...
Configuring GHCi with the following packages: 
GHCi, version 8.6.5: http://www.haskell.org/ghc/  :? for help
Loaded GHCi configuration from /run/user/1000/haskell-stack-ghci/2a3bbd58/ghci-script
Prelude> import System.IO
Prelude System.IO> localeEncoding 
UTF-8

% echo $LANG
en_US.UTF-8
% which stack
/home/kindaro/.nix-profile/bin/stack
% which ghci
/nix/store/67g6vwr5mx26h5mickgw17k2irdx1c0d-ghc-8.6.5/bin/ghci

Technical details

matthewbauer commented 5 years ago

Does /usr/lib/locale/locale-archive exist? I wonder if arch has some custom locale location. The difference between how Nix and Stack works here is that Nix uses a Nix-built Libc while Stack uses the system's Libc.

matthewbauer commented 5 years ago

You can also do export LANG=C.UTF-8 to get non-localized unicode support though. See https://github.com/NixOS/nixpkgs/pull/58009 and https://github.com/NixOS/nixpkgs/pull/61202 for info on that

kindaro commented 5 years ago

Yes, this file does exist.

If I set LANG as you say, something strange happens. I cannot quite explain, let me rather show.

  1. % export LANG=C.UTF-8
    
    % ghc -e 'putStrLn "<ce><bb>"'
    λ
    % stack ghc -- -e 'putStrLn "<ce><bb>"'
    <interactive>:0:11: error:
        lexical error in string/character literal at character '\56526'
  2. % export LANG=en_US.UTF-8
    
    % ghc -e 'putStrLn "λ"'
    <interactive>:0:11: error:
        lexical error in string/character literal at character '\56526'
    % stack ghc -- -e 'putStrLn "λ"'
    λ

So, whenever neither Zsh not Stack can do anything with Unicode, ghc can. And the other way around. (I have no idea why Zsh cannot deal with C.UTF-8, but that is a whole other question.)

To clarify:

I can live with one terminal set to C.UTF-8 and another to en_US.UTF-8, but it can hardly be called life.

kindaro commented 5 years ago

So I gather the bug is in the libc? How can I diagnose it further?

cdepillabout commented 4 years ago

We are running into this at work.

When running ghc on NixOS, it correctly determines that the encoding should be UTF-8. However, when running ghc on Ubuntu, it incorrectly thinks the encoding should be ASCII.

Here is an example of running it on NixOS:

$ which locale
/run/current-system/sw/bin/locale
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
$ nix-shell -p ghc --command 'which locale'
/nix/store/22h3f311fjymkvp683kb657jycs7i5pn-glibc-2.27-bin/bin/locale
$ nix-shell -p ghc --command 'locale'
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
$ nix-shell -p ghc --command ghci
> import System.IO
> System.IO.localeEncoding
UTF-8
> import GHC.IO.Encoding.Iconv
> GHC.IO.Encoding.Iconv.localeEncodingName 
"UTF-8"

Here is what happens on Ubuntu:

$ which locale
/usr/bin/locale
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=ja_JP.UTF-8
LC_TIME=ja_JP.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=ja_JP.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=ja_JP.UTF-8
LC_NAME=ja_JP.UTF-8
LC_ADDRESS=ja_JP.UTF-8
LC_TELEPHONE=ja_JP.UTF-8
LC_MEASUREMENT=ja_JP.UTF-8
LC_IDENTIFICATION=ja_JP.UTF-8
LC_ALL=
$ nix-shell -p ghc --command 'which locale'
/nix/store/rjsymbdxlwmfbpasn0jik1w97wgfk3qj-glibc-2.27-bin/bin/locale
$ nix-shell -p ghc --command 'locale'
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=ja_JP.UTF-8
LC_TIME=ja_JP.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=ja_JP.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=ja_JP.UTF-8
LC_NAME=ja_JP.UTF-8
LC_ADDRESS=ja_JP.UTF-8
LC_TELEPHONE=ja_JP.UTF-8
LC_MEASUREMENT=ja_JP.UTF-8
LC_IDENTIFICATION=ja_JP.UTF-8
LC_ALL=
$ nix-shell -p ghc --command 'ghci'
> import System.IO
> System.IO.localeEncoding 
ASCII
$ nix-shell -p ghc --command 'env LC_ALL=C.UTF-8 ghci'
> import System.IO
> System.IO.localeEncoding 
UTF-8

As above, you can see that explicitly setting LC_ALL=C.UTF-8, GHC picks up the encoding correctly. However, be aware that there seems to be some weirdness with locales, and locales you may think exist do not actually exist. On Ubuntu again:

$ nix-shell -p ghc --command 'env LC_ALL=C.UTF-8 ghci'
/nix/store/cinw572b38aln37glr0zb8lxwrgaffl4-bash-4.4-p23/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
> import System.IO
> System.IO.localeEncoding 
ASCII

I haven't done a whole lot of testing of this, but this problem appears to have come about recently.

Here's an old GHC (from 19.03) on Ubuntu again. You can see this appears to be working correctly:

$ cat /nix/var/nix/profiles/per-user/root/channels/nixpkgs/.version
19.03
$ cat /nix/var/nix/profiles/per-user/root/channels/nixpkgs/.git-revision
c2b8270fb8789af290da3f11bd6174a0ba7698f1
$ NIX_PATH=nixpkgs=/nix/var/nix/profiles/per-user/root/channels/nixpkgs nix-shell -p ghc --command 'ghci --version'
The Glorious Glasgow Haskell Compilation System, version 8.6.3
$ NIX_PATH=nixpkgs=/nix/var/nix/profiles/per-user/root/channels/nixpkgs nix-shell -p ghc --command 'ghci --version'
> import System.IO
> System.IO.localeEncoding 
UTF-8

Here's a new GHC from the current nixpkgs-unstable channel, again on Ubuntu. This appears to be not working without explicitly setting LC_ALL=C.UTF-8:

$ cat ~/.nix-defexpr/channels/nixpkgs/.version
20.03
$ cat ~/.nix-defexpr/channels/nixpkgs/.git-revsion 
895874d2145862249df3f78335f4dcf62ef01626
$ NIX_PATH=nixpkgs=$HOME/.nix-defexpr/channels/nixpkgs nix-shell -p ghc --command 'ghci --version'
The Glorious Glasgow Haskell Compilation System, version 8.6.5
$ NIX_PATH=nixpkgs=$HOME/.nix-defexpr/channels/nixpkgs nix-shell -p ghc --command 'ghci'
> import System.IO
> System.IO.localeEncoding 
ASCII
$ NIX_PATH=nixpkgs=$HOME/.nix-defexpr/channels/nixpkgs nix-shell -p ghc --command 'env LC_ALL=C.UTF-8 ghci'
> import System.IO
> System.IO.localeEncoding 
UTF-8

Basically, if someone is willing to do a git bisect between c2b8270fb8789af290da3f11bd6174a0ba7698f1 (known-working) and 895874d2145862249df3f78335f4dcf62ef01626 (known-failing), we might be able to figure out what is the problem here.

I might do this.

cdepillabout commented 4 years ago

Also, just in case you're curious, here is an explanation of text encoding stuff for GHC:

https://www.stackage.org/haddock/lts-7.14/base-4.9.0.0/System-IO.html#g:23

Here's the localeEncoding function I use above:

https://www.stackage.org/haddock/lts-7.14/base-4.9.0.0/System-IO.html#v:localeEncoding

Under the hood, this appears to be using iconv:

https://www.stackage.org/haddock/lts-7.14/base-4.9.0.0/src/GHC-IO-Encoding-Iconv.html


If I were to try to bisect this, I'd look for some change in how glibc or iconv is being handled that has occurred in the past couple months. Or maybe even some direct change to ghc.

cdepillabout commented 4 years ago

I think I figured out what is going on here.

Here's the explanation from the manual:

https://nixos.org/nixpkgs/manual/#locales

To allow simultaneous use of packages linked against different versions of glibc with different locale archive formats Nixpkgs patches glibc to rely on LOCALE_ARCHIVE environment variable.

On non-NixOS distributions this variable is obviously not set. This can cause regressions in language support or even crashes in some Nixpkgs-provided programs. The simplest way to mitigate this problem is exporting the LOCALE_ARCHIVE variable pointing to ${glibcLocales}/lib/locale/locale-archive. The drawback (and the reason this is not the default) is the relatively large (a hundred MiB) size of the full set of locales. It is possible to build a custom set of locales by overriding parameters allLocales and locales of the package.

My guess as to what is happening is as follows:

On Ubuntu, with older versions of nixpkgs, there was no locale archive provided by default, so GHC (really, iconv) falls back to the system locale archive in /usr/lib/locale/locale-archive. The system locale archive has support for many different locales by default. With newer versions of nixpkgs, there is a locale archive provided by default, so GHC (really, iconv) uses it. However, it is very small and only has support for the C.UTF-8 locale.

On NixOS, with older versions of nixpkgs, there is a locale archive hardcoded somewhere with a bunch of locales provided by default. With new versions of nixpkgs, NixOS explicitly sets the LOCALE_ARCHIVE env var pointing to somewhere with a bunch of locales available.

(I figured this out by running locale under strace, so it is possible it is not quite correct.)

So @kindaro, the solution to your problem is to do one of the following things:

  1. Set LOCALE_ARCHIVE to point to either ${glibcLocales}/lib/locale/locale-archive or your system locale archive at /usr/lib/locale/locale-archive (if you want to live dangerously).
  2. Make sure you set LC_ALL=C.UTF-8 before you run GHC.

@matthewbauer Is this explanation about right?

kindaro commented 4 years ago

@cdepillabout Awesome research, thank you. Setting LOCALE_ARCHIVE works, and it is a better solution than resetting LC_ALL because it does not affect other installations of ghc, such as stack's.

stale[bot] commented 4 years ago

Thank you for your contributions. This has been automatically marked as stale because it has had no activity for 180 days. If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity. Here are suggestions that might help resolve this more quickly:

  1. Search for maintainers and people that previously touched the related code and @ mention them in a comment.
  2. Ask on the NixOS Discourse. 3. Ask on the #nixos channel on irc.freenode.net.
cdepillabout commented 3 years ago

I'm going to go ahead and close this, since it seems to be "working as intended", and I listed some workarounds in https://github.com/NixOS/nixpkgs/issues/64603#issuecomment-551419489.

maralorn commented 8 months ago

FYI: I think one can make the program locale independent by calling setLocaleEncoding utf8. (compare https://hackage.haskell.org/package/base-4.19.0.0/docs/GHC-IO-Encoding.html#v:setLocaleEncoding)