ES-Nix / locale

I am having trouble to use locale in nix, so creating this to ask for help and create examples when it works
MIT License
0 stars 0 forks source link

Notes #1

Open PedroRegisPOAR opened 3 years ago

PedroRegisPOAR commented 3 years ago

Abstract

There are 4 packages in nixpkgs involved, at least.

nix build nixpkgs#glibcLocales
nix build nixpkgs#glibc.bin
nix build nixpkgs#locale
nix build nixpkgs#glibc

Some notes about locale/locale-archive

Old, but really great: https://github.com/NixOS/nix/issues/599#issuecomment-153885553

TODO: take a look in this, lots of troubleshoot commands: https://github.com/NixOS/nix/issues/599

What is the locale-archive

locale-archive is a memory-mapped file which is generated by locale-gen(8) invoking localedef(1). Memory-mapped means that once it is created and called by a program it is only loaded once into memory. https://unix.stackexchange.com/a/331706

Difference between locale-archive and Machine Object files in /usr/share/locale//LC_MESSAGES/ directory?

TODO: add real updated values and sha256sum

In the above case deleting all locales except the en_* ones my locale-archive went from 102MB down to 3.4MB https://unix.stackexchange.com/a/498523

find . ! -name 'file.txt' -type f -exec rm -f {} + From: https://unix.stackexchange.com/a/153863

glibc

glibc = super.glibc.overrideAttrs (_: {
      # Warning: MASSIVE rebuild since you'll break ABI
      version = "2.26";
    });

From: https://gurkan.in/wiki/nix.html#override-example-optional-args

TODO: convert this to a flake nix-shell cannot change locale warning

Some troubleshoot commands:

Saving all this for now:

I think the proper fix would be to include en_US.utf-8 in our glibc version by default (can be done via overrideAttrs) and rebuild python/bash/perl against that. Build a yocto rootfs inside nix

Force install locale from "glibcLocales" since there are collisions extraBuildCommands = '' ln -sf ${glibcLocales}/lib/locale/locale-archive $out/usr/lib/locale ''; stammw/yocto.nix

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. Unable to invoke localedef to UTF-8: cannot read character map directory `/usr/share/i18n/charmaps'

TODO: https://unix.stackexchange.com/questions/187402/nix-package-manager-perl-warning-setting-locale-failed/243189#243189

Maybe useful: https://github.com/davidtwco/veritas/blob/6f2c676a76ef2885c9102aeaea874c361dbcaf61/home/profiles/common.nix#L197-L198

TODO: document from where it came from, the python PEPs about it

nix run nixpkgs#python3 -- <<<'print("ℙƴ☂ℌøἤ")'

At the moment, setting "LANG=C" on a Linux system fundamentally breaks Python 3, and that's not OK. ENV LANG C.UTF-8 https://github.com/docker-library/python/blob/56cea612ab370f3d05b29e97466d418a0f07e463/3.10/slim-bullseye/Dockerfile#L12-L14

https://click.palletsprojects.com/en/5.x/python3/#python-3-surrogate-handling

PedroRegisPOAR commented 3 years ago

Abstract

To explain, the default glibc package doesn't contain locale data – that's separate and --pure isn't supposed to see OS data. Source: https://github.com/NixOS/nixpkgs/issues/32848#issuecomment-352996633

Adding glibcLocales to the shell indeed fixes the issue. Though this raises the question as to whether that dependency needs to be explicitly added to any package that depends on glibc and does some kind of text processing. Locales are a non-optional part of the C standard and while it’s great to be able to drop the heavyweight dependency where you know it’s irrelevant, it should not be absent in the default context. From:

MWE

MWE 1:
LC_ALL=en_GB.utf8 date '+%c'
LC_ALL=en_US.utf8 date '+%c' 
nix \
run \
nixpkgs#python39 \
-- \
-c \
'
import locale
locale.setlocale(locale.LC_ALL, "pt_BR.utf8")
print(locale.currency(12345.67, grouping=True, symbol=True))
'

copy/pastes

Code 1:
# 
# nix flake metadata nixpkgs --json | jq -r .url
nix \
store \
ls \
--store https://cache.nixos.org/ \
--long \
--recursive \
"$(nix eval --raw github:NixOS/nixpkgs/9eb60f25aff0d2218c848dd4574a0ab5e296cabe#glibcLocales)"
Code 2:
# 
# nix flake metadata nixpkgs --json | jq -r .url
nix \
store \
ls \
--store https://cache.nixos.org/ \
--long \
--recursive \
"$(nix eval --raw github:NixOS/nixpkgs/9eb60f25aff0d2218c848dd4574a0ab5e296cabe#glibc)"
Code 3:
# 
# nix flake metadata nixpkgs --json | jq -r .url
nix \
store \
ls \
--store https://cache.nixos.org/ \
--long \
--recursive \
"$(nix eval --raw github:NixOS/nixpkgs/9eb60f25aff0d2218c848dd4574a0ab5e296cabe#locale)"
Code 4:
# 
# nix flake metadata nixpkgs --json | jq -r .url
nix \
store \
ls \
--store https://cache.nixos.org/ \
--long \
--recursive \
"$(nix eval --raw github:NixOS/nixpkgs/9eb60f25aff0d2218c848dd4574a0ab5e296cabe#glibc.bin)"
# nix flake metadata github:NixOS/nixpkgs/release-22.05 --json
command -v jq >/dev/null || nix profile install github:NixOS/nixpkgs/4aceab3cadf9fef6f70b9f6a9df964218650db0a#jq \
&& nix \
build \
--impure \
--expr \
'(
  with builtins.getFlake "nixpkgs";
  with legacyPackages.${builtins.currentSystem};
  (
    glibcLocales.override {
        allLocales = false;
        locales = [ 
                           "en_GB.UTF-8/UTF-8" 
                           "ru_RU.UTF-8/UTF-8" 
                           "en_US.UTF-8/UTF-8" 
                           "pt_BR.UTF-8/UTF-8"
                           "ja_JP.UTF-8/UTF-8"
                           "en_IE.UTF-8/UTF-8"
                     ];
      }
  )
)'

Refs.:

LOCALE_ARCHIVE=result/lib/locale/locale-archive \
&& LC_ALL=pt_BT.UTF-8 \
&& nix \
run \
nixpkgs#python39 \
-- \
-c \
'
import locale

locale.setlocale(locale.LC_ALL, "pt_BR.utf8")
print(locale.currency(12345.67, grouping=True, symbol=True))

locale.setlocale(locale.LC_ALL, "en_US.utf8")
print(locale.currency(12345.67, grouping=True, symbol=True))

locale.setlocale(locale.LC_ALL, "ru_RU.utf8")
print(locale.currency(12345.67, grouping=True, symbol=True))

locale.setlocale(locale.LC_ALL, "ja_JP.utf8")
print(locale.currency(12345.67, grouping=True, symbol=True))

locale.setlocale(locale.LC_ALL, "en_IE.utf8")
print(locale.currency(12345.67, grouping=True, symbol=True))
'
{ cat <<WRAP >> foo.c
#include <stdio.h>
#include <locale.h>

int main()
{
    char *locale = setlocale(LC_ALL, "");
    printf("\n locale =%s\n", locale);
    printf("test\n \u263a\u263b Hello from C\n");

    return 0;
}
WRAP
} && gcc foo.c \
&& ./a.out
rm -f a.out foo.c

Refs.:

TODO:

nix run nixpkgs#gcc -- -xc -E -v /dev/null

+

printf '#include <locale.h>\nLC_COLLATE\n' | gcc -E -x c - | tail -n 1

Refs.:

nix run nixpkgs#python39 -- -c "assert '\N{snake}' == '🐍'"

TODO:

printf %b\\n \\u04{51,{3,4}{{0..9},{a..f}}}|sort|sed 's/./\u&&/'|tr -d \\n

Refs.:

cowsay

nix \
shell \
--ignore-environment \
--impure \
--expr \
'
  (
    let
      nixpkgs = (builtins.getFlake "github:NixOS/nixpkgs/0938d73bb143f4ae037143572f11f4338c7b2d1c");
      pkgs = import nixpkgs { };    
    in
      with pkgs; [
        cowsay
      ]
    )
' \
--command cowsay "Hello"

nix \
shell \
--ignore-environment \
--impure \
--expr \
'
  (
    let
      nixpkgs = (builtins.getFlake "github:NixOS/nixpkgs/0938d73bb143f4ae037143572f11f4338c7b2d1c");
      pkgs = import nixpkgs { };    
    in
      with pkgs; [
        (
          glibcLocales.override {
              allLocales = false;
              locales = [
                          "en_US.UTF-8/UTF-8" 
                          "pt_BR.UTF-8/UTF-8"
                        ];
            }
        )
        cowsay
      ]
    )
' \
--command cowsay "Hello"

Refs.:

Others

LOCALE_ARCHIVE=result/lib/locale/locale-archive

LC_ALL=pt_BR.UTF-8 date '+%c'
LC_ALL=en_US.UTF-8 date '+%c'
LC_ALL=ru_RU.UTF-8 date '+%c'
LC_ALL=ja_JP.UTF-8 date '+%c'
LC_ALL=en_IE.UTF-8 date '+%c'
LOCALE_ARCHIVE=result/lib/locale/locale-archive

LC_ALL=en_GB.UTF-8
nix run nixpkgs#uutils-coreutils -- date '+%c'

LC_ALL=en_US.UTF-8
nix run nixpkgs#uutils-coreutils -- date '+%c'
LOCALE_ARCHIVE=result/lib/locale/locale-archive

LC_ALL=en_GB.UTF-8
nix run nixpkgs#busybox -- date '+%c'

LC_ALL=en_US.UTF-8
nix run nixpkgs#busybox -- date '+%c'
nix \
run \
nixpkgs#python39 -- \
-c \
"
v=32
while v:print('Ёё'*(v==26),end='%c%c'%(1072-v,1104-v));v-=1
"
export LC_ALL=en_US.utf8
nix run nixpkgs#python39 -- -c '
import locale
defaultlocale = locale.getdefaultlocale()
locale.setlocale(locale.LC_ALL, defaultlocale[0] + "." + defaultlocale[1])
print(locale.currency(12345.67, grouping=True, symbol=True))
'

export LC_ALL=pt_BR.utf8
nix run nixpkgs#python39 -- -c '
import locale
defaultlocale = locale.getdefaultlocale()
locale.setlocale(locale.LC_ALL, defaultlocale[0] + "." + defaultlocale[1])
print(locale.currency(12345.67, grouping=True, symbol=True))
'

Refs.:

perl -MEncode=decode -E 'while(<>){ chomp; say length decode("UTF-8", $_) }' <<<'文字化け'

Refs.:

TODO:

Faker("cellphone_number", locale="pt-BR")

TODO:

PedroRegisPOAR commented 3 years ago

TODO:

At 31:50 you talk about environment variables. However there are some mistakes worth correcting for future viewers. First, although the environment variables are stored in the process' memory, it is stored as zero-terminated strings and not as one big string separated by new-line characters. It is also is not stored on the heap, nor is there a global variable in the data section pointing to it. The environment is actually stored entirely on the stack and is a part of the initial process stack that is set up before the program starts running. The first value on the stack is the argument count followed by an array of the addresses of the different arguments, then address 0 marking the end of the argument array. Right after that there is a second array of addresses which each point to a zero-terminated string which would be the environment variables, this array is also terminated by having address 0 at the end. There is actually a third array of auxiliary vectors but after that there is an unspecified amount of bytes before the information block starts. It's generally inside this block the command line arguments and environment variables are stored, as in the actual string values. You can confirm this by dumping the stack of pretty much any program and you typically find all the environment variables at the very end (highest memory address). If you are on Linux you can do this by first reading the '/proc//maps' file for any process, just replace with that process' PID. This file contains the ranges of memory mapped to the process and what they are mapped to. Near the bottom you'll see one line with the range mapped to [stack]. Take note of the start address and calculate how big it is in bytes. Then run 'sudo xxd -s -l /dev//mem', example 'sudo xxd -s 0x7fff182bd000 -l 0x22000 /dev/14950/mem'. And the environment variables should get printed out together with their hex values and address location.

To illustrate this further I've written a small c program that prints all the environment variables using the argv array pointer. As you can see the environment variable pointers are stored pretty much right after argv.

#include <stdio.h>
int main(int argc, char **argv)
{
 for (int i = argc + 2; argv[i] != NULL; i++)
 {
  printf("%s\n", argv[i]);
 }
 return 0;
}

You can of course make it less stupid by using the full version of main which includes a pointer to the first element in the environment pointer array.

#include <stdio.h>
int main(int argc, char **argv, char **envp)
{
 for (int i = 0; envp[i] != NULL; i++)
 {
  printf("%s\n", envp[i]);
 }
 return 0;
}

This is all defined as a part of the ABI (application binary interface) for both the x86 and x86_64 architecture, so 32 and 64 bit desktop computers.

tl;dr: The environment is not a single long string separated by new-line characters. The environment variables and the pointers to them are both stored on the stack or just before it. https://www.youtube.com/watch?v=xHu7qI1gDPA&lc=UgwvbQ7HZFUEZ2EGQ7V4AaABAg

Maybe related: https://serverfault.com/a/792136

Related: Dangerous Code Hidden in Plain Sight for 12 years

PedroRegisPOAR commented 1 year ago

Fonts, yes they are related too

Yeah, here i am trying to summarize all this mess.

FC_LANG is used to specify the default language as the weak binding in the query. if this isn't set, the default language will be determined from current locale. https://www.freedesktop.org/software/fontconfig/fontconfig-user.html

export FONTCONFIG_PATH=/etc/fonts

Refs.:

TODOs:

man fonts-conf
ls -al $(nix build --no-link --print-build-logs --print-out-paths github:NixOS/nixpkgs/0938d73bb143f4ae037143572f11f4338c7b2d1c#xorg.fontalias)/share/fonts/X11

Some really good videos

Minimal working example (on my NixOS machine with zsh)

echo '\u2603'

LC_CTYPE=C echo '\u2603'

Refs.:

Errors out:

zsh: character not in range

Also, UTF-8 is not a valid POSIX locale. It may work on some systems, but Arch Linux might not like it. en_US.UTF-8 is valid. Try putting that at the beginning of the line, and using LC_ALL instead of LC_CTYPE. https://github.com/ohmyzsh/ohmyzsh/issues/4065#issuecomment-129913471

We ended up just giving up on trying to fix broken locales, thanks for your contribution and your patience. https://github.com/ohmyzsh/ohmyzsh/pull/4696#issuecomment-537132617

Must read

Main ones:

Tables:

Fonts:

More python focused:

Even Julia:

Of course LaTeX:

Linux:

Troubleshooting

fc-cache -fv
echo "\ue0b0 \ue0a0 \u2b80 \u00b1 \u27a6 \u2718 \u26a1 \u2699"
echo ##1##
echo '\ue0b0 \ue0a0 \u2b80 \u00b1 \u27a6 \u2718 \u26a1 \u2699'
echo ##2##
echo -e '\ue0b0 \ue0a0 \u2b80 \u00b1 \u27a6 \u2718 \u26a1 \u2699'
echo ##3##
echo -e "\ue0b0 \ue0a0 \u2b80 \u00b1 \u27a6 \u2718 \u26a1 \u2699"
fc-list : family
fc-match -s emoji
localectl status

Outputs:

   System Locale: LANG=en_US.UTF-8
                  LC_MONETARY=pt_BR.UTF-8
       VC Keymap: us
      X11 Layout: br
       X11 Model: pc104
     X11 Variant: abnt2
     X11 Options: terminate:ctrl_alt_bksp

TODO: https://ostechnix.com/install-nerd-fonts-to-add-glyphs-in-your-code-on-linux/ https://ostechnix.com/find-installed-fonts-commandline-linux/ https://github.com/ryanoasis/nerd-fonts/issues/485#issuecomment-1417572779 https://github.com/ryanoasis/nerd-fonts/issues/485#issuecomment-1417328572

pango-view

https://github.com/NixOS/nixpkgs/issues/86601#issuecomment-686243898

The echo/print stuff

echo -e "\U1f3f4\Ue0067\Ue0062\Ue0077\Ue006c\Ue0073\Ue007f"
echo #####
echo -e "\U1f9df\U200d\U2640\Ufe0f"
EMOJIS=(🥯  🦆 🦉 🥓 🦄 🦀 🖕 🍣 🍤 🍥 🍡 🥃 🥞 🤯 🤪 🤬 🤮 🤫 🤭 🧐 🐕 🦖 👾 🐉 🐓 🐋 🐌 🐢)
echo $EMOJIS
UNICORN='\U1F984'; THUMBS_UP='\U1F44D'; echo -e "Riding an ${UNICORN} (${THUMBS_UP})"

Refs.:

It should output 🍁:

echo -e '\xF0\x9F\x8D\x81'

Refs.:

toon=$'\U1F479'
print -r ${(l:${(m)#toon}:: :)}$'XYZ\n'$toon' ^-- must point to Y'

Refs.:

echo "a\uf240 abc"

Refs.:

text="Éé"; echo ${#text}

LC_CTYPE=C text="Éé"; echo ${#text}

Refs.:

echo \
'\U1F479' \
'\xF0\x9F\x8D\x81' \
'\U1f9df\U200d' \
'\U1F984' \
'\U1F44D' \
'\U1F9DA' \
'\U1F426' \
'\U1F99C' \
'\U1F996' \
'\U1F420' \
'\U1F41E' \
'\U1F340' \
'\U1F308' \
'\U1F965' \
'\U1F37F' \
'\U1F991' \
'\U1F37A' \
'\U1F692' \
'\U1F6F3' \
'\U26A1' \
'\U1F4A7' \
'\U1F537'

Outputs: 👹 🍁 🧟‍ 🦄 👍 🧚 🐦 🦜 🦖 🐠 🐞 🍀 🌈 🥥 🍿 🦑 🍺 🚒 🛳 ⚡ 💧 🔷

Some flags shows:

echo \
'\U1F3F4\UE0067\UE0062\UE0065\UE006E\UE0067\UE007F' \
'\U1F3F4\UE0067\UE0062\UE0073\UE0063\UE0074\UE007F' \
'\U1f3f4\Ue0067\Ue0062\Ue0077\Ue006c\Ue0073\Ue007f'

Outputs: 🏴󠁧󠁢󠁥󠁮󠁧󠁿 🏴󠁧󠁢󠁳󠁣󠁴󠁿 🏴󠁧󠁢󠁷󠁬󠁳󠁿

Some flags are broken:

echo \
'\U1F3F3\UFE0F\U200D\U1F308' \
'\U1F1E7\U1F1F7'

Outputs:

image

But copying it directly from terminal to here in browser it renders correctly: 🏳️‍🌈 🇧🇷

The Japan flag, for example:

echo -e '\U1f1ef\U1f1f5' | hexdump -C

Refs.:

TODO: test it

env \
FONTCONFIG_FILE=$PWD/etc-fonts/fonts.conf \
FC_DEBUG=1024 \
pango-view --text="Příliš 😂" --font='"Noto Color Emoji" 20'

Refs.:

xterm -fa 'Dank Mono' -fs 11

Refs.:

TODO: impressive awk-fu https://unix.stackexchange.com/a/526681

TODO: it is python code: https://stackoverflow.com/a/37362046

TODO: curl and the '\U0001F514' https://stackoverflow.com/a/55863437

TODO: teste Fira Code + JetBrains Mono