NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.49k stars 13.67k forks source link

environment.memoryAllocator may cause crashes #100799

Closed TredwellGit closed 3 years ago

TredwellGit commented 3 years ago

Describe the bug llvmPackages_11 produces some executables that segmentation fault on use and other binaries that appear to work but compile executables that segmentation fault themselves. Can someone please test if they can reproduce this? The current build running on https://hydra.nixos.org/job/nixos/trunk-combined/tested appears to have this issue. Programs built with this will build successfully but only fail upon being executed. I have not tested if this is limited to llvmPackages_11 or if other versions are affected.

To Reproduce

$ git checkout master
$ nix-shell --pure -A llvmPackages_11.clang-unwrapped --run ld
/tmp/nix-shell-22937-0/rc: line 1: 22984 Segmentation fault      (core dumped) ld
$ git checkout upstream/nixos-unstable
$ nix-shell --pure -A llvmPackages_11.clang-unwrapped --run ld
/nix/store/p792j5f44l3f0xi7ai5jllwnxqwnka88-binutils-2.31.1/bin/ld: no input files

master is currently 662d6a9cb1aaeb4081d75485b218f80b7b8c8727 upstream/nixos-unstable is currently 24c9b05ac53e422f1af81a156f1fd58499eb27fb

Additional context I have been trying to get Chromium built with LLD to run for the past two days wondering why my changes always cause it to segmentation fault despite the fact they should work.

Notify maintainers @lovek323 @7c6f434c @dtzWill @primeos @ggreif @DieGoldeneEnte

Metadata

TredwellGit commented 3 years ago

git bisect yields: 30286ebcc178aec1fb3797ba3ec88cf75feb282b is the first bad commit https://github.com/NixOS/nixpkgs/pull/95089 @edolstra @ma27

TredwellGit commented 3 years ago

Current guess: https://sourceware.org/bugzilla/show_bug.cgi?id=26534

https://en.wikipedia.org/wiki/FMA_instruction_set:

FMA4 is supported in AMD processors starting with the Bulldozer architecture. FMA4 was performed in hardware before FMA3 was. Support for FMA4 was removed since Zen 1.

TredwellGit commented 3 years ago

What CPUs does https://hydra.nixos.org/ use? Does it automatically deploy to nixos-unstable? This might get past testing.

TredwellGit commented 3 years ago

https://nixos.wiki/wiki/Nix_channels:

The nixos.org server has a cronjob for which nixos-channel-scripts are executed and poll for the newest jobset that satisfies the above two conditions and trigger a channel update.

I am going to risk being a little obnoxious to prevent nixos-unstable from being broken.

Ma27 commented 3 years ago

First of all, I think you've meant @edolstra not @eelco, please make sure you're using the right github handles for maintainers (those can be found in maintainers/maintainer-list.nix.

What CPUs does https://hydra.nixos.org/ use? Does it automatically deploy to nixos-unstable? This might get past testing.

nixos-unstable gets bumped as soon as a critical set of tests is successfully built on the CI infrastructure[1]. Unfortunately it's a known issue that we sometimes stumble upon problems related to the required instruction sets.

Current guess: https://sourceware.org/bugzilla/show_bug.cgi?id=26534

Will prepare a patch this weekend which fixes the package, thanks for the link!

I am going to risk being a little obnoxious to prevent nixos-unstable from being broken.

Sorry, I don't understand what you mean.

[1] Explanation can be found at https://howoldis.herokuapp.com/

TredwellGit commented 3 years ago

https://github.com/NixOS/nixpkgs/blob/c06b4077745fe1afa3a7e3e3d4504e0bd35b414c/pkgs/development/libraries/glibc/common.nix#L237 https://github.com/NixOS/nixpkgs/blob/c06b4077745fe1afa3a7e3e3d4504e0bd35b414c/maintainers/maintainer-list.nix#L27 Sorry, handle != github is not obvious.

If I understand correctly, https://hydra.nixos.org/build/128667841 contains this issue and will cause nixos-unstable to be broken. I changed the title to hopefully get someone to cancel the build before it hits production and breaks many systems.

DieGoldeneEnte commented 3 years ago

Can't reproduce (see output below, I get the same ld). What path does your failing ld have (the command I gave doesn't really return what I wanted :/)?

Also note, I have a CPU with fma and no fma4:

$ grep fma /proc/cpuinfo | uniq 
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
$ git status
HEAD detached at 662d6a9cb1a
$ nix-build . -A llvmPackages_11.bintools
these derivations will be built:
  /nix/store/bv37icm9s803q7farwcr7l03wqm54dpw-llvm-binutils-11.0.0.drv
building '/nix/store/bv37icm9s803q7farwcr7l03wqm54dpw-llvm-binutils-11.0.0.drv'...
/nix/store/08flkad738j47qp9bz2gvvaj0j364nc8-llvm-binutils-11.0.0
$ result/bin/ld
ld: error: no input files
$ nix-shell --pure -A llvmPackages_11.clang-unwrapped --run ld
/nix/store/hiwz81i1g3fn3p0acjs042a4h5fri6dh-binutils-2.31.1/bin/ld: no input files

Metadata:

EDIT: the command I gave doesn't really return what I wanted :/

Ma27 commented 3 years ago

@DieGoldeneEnte well, it's not reproducible because your system is still on glibc 2.31.

TredwellGit commented 3 years ago
$ git checkout 662d6a9cb1aaeb4081d75485b218f80b7b8c8727
$ nix-shell --pure -A llvmPackages_11.clang-unwrapped --run 'realpath "$(command -v ld)"'
/nix/store/h770niqkf08ksgm0n6a271kc21w95b13-binutils-wrapper-2.31.1/bin/ld
TredwellGit commented 3 years ago

My system is still on glibc 2.31:

nix-store --query --requisites /run/current-system | rg glibc
/nix/store/9df65igwjmf2wbw0gbrrgair6piqjgmi-glibc-2.31
/nix/store/fgn3sih5vi7543jcw389a7qqax8nwkhz-glibc-2.31-bin
/nix/store/1xpr86xg998h5acn4zqrx58xjdjqnnds-glibc-locales-2.31
/nix/store/4wy9j24psf9ny4di3anjs7yk2fvfb0gq-glibc-2.31-dev
/nix/store/62m4vwphy3d6ayrs7jb5lw0v6bx45p0r-glibc-iconv-2.31
/nix/store/7k3vlmpvzjqrc7lbmz1csbhg5d1rn4fw-glibc-2.31
/nix/store/0nn11qvv1y9agkhiaaw5fhy2dpakp7km-glibc-multi-2.31
/nix/store/5gw82lwj14y3dara9i10x0r6yy43v3xp-glibc-2.31-bin
/nix/store/7isv255s5h9r4gbny93cgsl5h1msk02s-glibc-2.31-dev
/nix/store/f0brxzf4wdgcl54d1calpxj7k2yfmib3-glibc-multi-2.31-dev
/nix/store/skn0nnan2jgkcr0kkpn6k3nc2zfrakmd-glibc-multi-2.31-bin
vcunat commented 3 years ago

So we use the 2.31 upstream branch? It's a bit ugly patching but OK I guess; precedent+explanation: 51cf42ad0.

vcunat commented 3 years ago

⬆️ PR instead of pushing directly, so that people have time to react. With current Hydra speed it will take very long anyway.

TredwellGit commented 3 years ago

I just tested:

$ git checkout master
$ git pull upstream master
$ curl https://patch-diff.githubusercontent.com/raw/NixOS/nixpkgs/pull/100813.patch | git apply -
$ nix-shell --pure -A llvmPackages_11.clang-unwrapped --run ld
/tmp/nix-shell-25687-0/rc: line 1: 25732 Segmentation fault      (core dumped) ld
$ nix-store --query --requisites "$(nix-shell --pure -A llvmPackages_11.clang-unwrapped --run 'realpath "$(command -v ld)"')" | rg glibc
/nix/store/n047qmjrrrpkdv6lxl5vqja1dmg29na4-glibc-2.32-10
/nix/store/6qyklfxyn9kq22q0fi48iy8lxc1rn1s5-glibc-2.32-10-bin
/nix/store/2q9r0mdwjqkqqn1imy3l65h73dzgqr8p-glibc-2.32-10-dev

So maybe the FMA issue is not causing this?

vcunat commented 3 years ago

Well, SIGILL should not be reported as "Segmentation fault".

DieGoldeneEnte commented 3 years ago

I get the same output with the patch except for the segfault (also on an older Intel-based machine). Can you also try:

nix-shell --pure -A busybox --run ld

This reduces the amount of packages that need to be rebuild (and at least for me leads to executing the same binary):

$ git status
HEAD detached at upstream/master
$ curl https://patch-diff.githubusercontent.com/raw/NixOS/nixpkgs/pull/100813.patch | git apply -
$ nix-shell --pure -A busybox --run ld
/nix/store/xd7753hij54b5qvlpahagkn3801fsbqk-binutils-2.31.1/bin/ld: no input files
$ nix-store --query --requisites "$(nix-shell --pure -A busybox --run 'realpath "$(command -v ld)"')" | grep glibc
/nix/store/n047qmjrrrpkdv6lxl5vqja1dmg29na4-glibc-2.32-10
/nix/store/6qyklfxyn9kq22q0fi48iy8lxc1rn1s5-glibc-2.32-10-bin
/nix/store/2q9r0mdwjqkqqn1imy3l65h73dzgqr8p-glibc-2.32-10-dev
$ 
$ nix-shell --pure -A llvmPackages_11.clang-unwrapped --run ld
/nix/store/xd7753hij54b5qvlpahagkn3801fsbqk-binutils-2.31.1/bin/ld: no input files
$ nix-store --query --requisites "$(nix-shell --pure -A llvmPackages_11.clang-unwrapped --run 'realpath "$(command -v ld)"')" | grep glibc
/nix/store/n047qmjrrrpkdv6lxl5vqja1dmg29na4-glibc-2.32-10
/nix/store/6qyklfxyn9kq22q0fi48iy8lxc1rn1s5-glibc-2.32-10-bin
/nix/store/2q9r0mdwjqkqqn1imy3l65h73dzgqr8p-glibc-2.32-10-dev
TredwellGit commented 3 years ago

Both llvmPackages_10 and llvmPackages on master segmentation fault as well.

@DieGoldeneEnte:

DieGoldeneEnte commented 3 years ago

CPU microarchitecture is Zen2. I did the upgrade yesterday after I started trying to reproduce. I omitted the rest of git status, since I had some notes (pure text-files) in the nixpkgs-root, I moved them for now.

$ uname -r
5.8.14
$ nixos-option hardware.cpu.amd.updateMicrocode
Value:
true

Default:
false

Type:
"boolean"

Description:
''
  Update the CPU microcode for AMD processors.
''

Declared by:
[ "/nix/var/nix/profiles/per-user/root/channels/nixos/nixos/modules/hardware/cpu/amd-microcode.nix" ]

Defined by:
[ "/etc/nixos/machine/IDEFIX-NIX.nix" ]
$ git status
HEAD detached at upstream/master
nothing to commit, working tree clean
$ curl https://patch-diff.githubusercontent.com/raw/NixOS/nixpkgs/pull/100813.patch | git apply -
$ nix-shell --pure -A llvmPackages_11.clang-unwrapped --run ld
/nix/store/xd7753hij54b5qvlpahagkn3801fsbqk-binutils-2.31.1/bin/ld: no input files
$ nix-store --query --requisites "$(nix-shell --pure -A llvmPackages_11.clang-unwrapped --run 'realpath "$(command -v ld)"')" | grep glibc
/nix/store/n047qmjrrrpkdv6lxl5vqja1dmg29na4-glibc-2.32-10
/nix/store/6qyklfxyn9kq22q0fi48iy8lxc1rn1s5-glibc-2.32-10-bin
/nix/store/2q9r0mdwjqkqqn1imy3l65h73dzgqr8p-glibc-2.32-10-dev
TredwellGit commented 3 years ago

I ran nix-store --verify --check-contents --repair without issue and still segmentation fault. Can some other people please test and list their CPU microarchitecture?

DieGoldeneEnte commented 3 years ago

Both llvmPackages_10 and llvmPackages on master segmentation fault as well.

I am pretty sure the problem is not with llvm, since ld is not in clang-unwrapped, only in wrapped clang and llvm-binutils (and only llvm-binutils uses lld, wrapped-clang uses ld from binutils which doesn't depend on llvm).

That's why think this should be enough to test:

nix-shell --pure -A busybox --run ld

It executes the same binary (for me) as the llvmPackages_11.clang-unwrappedone.

EDIT: llvm-version doesn't matter (getting the executable for versions 5 to 11 on master):

$ for v in {5..11} ; do nix-shell --pure -A llvmPackages_$v.clang-unwrapped --run 'realpath "$(command -v ld)"'; done
/nix/store/h770niqkf08ksgm0n6a271kc21w95b13-binutils-wrapper-2.31.1/bin/ld
/nix/store/h770niqkf08ksgm0n6a271kc21w95b13-binutils-wrapper-2.31.1/bin/ld
/nix/store/h770niqkf08ksgm0n6a271kc21w95b13-binutils-wrapper-2.31.1/bin/ld
/nix/store/h770niqkf08ksgm0n6a271kc21w95b13-binutils-wrapper-2.31.1/bin/ld
/nix/store/h770niqkf08ksgm0n6a271kc21w95b13-binutils-wrapper-2.31.1/bin/ld
/nix/store/h770niqkf08ksgm0n6a271kc21w95b13-binutils-wrapper-2.31.1/bin/ld
/nix/store/h770niqkf08ksgm0n6a271kc21w95b13-binutils-wrapper-2.31.1/bin/ld
TredwellGit commented 3 years ago

Since a build with glibc 2.32 got past Hydra I am now no longer able to upgrade my system:

$ sudo nixos-rebuild boot --upgrade
unpacking channels...
Segmentation fault
TredwellGit commented 3 years ago

I can't even rebuild anymore!

$ sudo nixos-rebuild boot
Segmentation fault
$ nix-store --query --requisites /run/current-system | rg glibc
/nix/store/9df65igwjmf2wbw0gbrrgair6piqjgmi-glibc-2.31
/nix/store/fgn3sih5vi7543jcw389a7qqax8nwkhz-glibc-2.31-bin
/nix/store/7k3vlmpvzjqrc7lbmz1csbhg5d1rn4fw-glibc-2.31
/nix/store/1xpr86xg998h5acn4zqrx58xjdjqnnds-glibc-locales-2.31
/nix/store/4wy9j24psf9ny4di3anjs7yk2fvfb0gq-glibc-2.31-dev
/nix/store/62m4vwphy3d6ayrs7jb5lw0v6bx45p0r-glibc-iconv-2.31
/nix/store/0nn11qvv1y9agkhiaaw5fhy2dpakp7km-glibc-multi-2.31
/nix/store/5gw82lwj14y3dara9i10x0r6yy43v3xp-glibc-2.31-bin
/nix/store/7isv255s5h9r4gbny93cgsl5h1msk02s-glibc-2.31-dev
/nix/store/f0brxzf4wdgcl54d1calpxj7k2yfmib3-glibc-multi-2.31-dev
/nix/store/skn0nnan2jgkcr0kkpn6k3nc2zfrakmd-glibc-multi-2.31-bin
DieGoldeneEnte commented 3 years ago

You should be able to rollback the channel with:

# nix-channel --rollback

You could maybe also try to start ld in gdb and look at the backtrace (don't know how much info you can get from it).

EDIT: Not the executable you get from realpath above just points you to the wrapper, you need the executable at the bottom (/nix/store/p792j5f44l3f0xi7ai5jllwnxqwnka88-binutils-2.31.1/bin/ld for me).

TredwellGit commented 3 years ago

I figured it out. Setting environment.memoryAllocator.provider = "scudo"; breaks with glibc 2.32. I did not think that it would matter since I was using --pure.

vcunat commented 3 years ago

Well... I don't think this preloading we have in environment.memoryAllocator is done very well. Looking at ldd result/lib/linux/libclang_rt.scudo-x86_64.so, there's even libstdc++, so I suspect you force-override all these libraries in the packages you run. Of course, in usual cases they will be the same but not always – Nix people are used to mixing versions sometimes. IMHO memory allocator overrides are meant to just replace a few particular functions, not whole libc, libstdc++, etc.

EDIT: still, I also don't know why runs in nix-shell --pure could be affected.

S-NA commented 3 years ago

This is a known issue I believe, see #93116. For now the easiest work around is when you run into it is mv /etc/ld-nix.so.preload /etc/ld-nix.so.preload.tmp. I am not sure if it is an issue with preloading in environment.memoryAllocator being done badly. Firefox will segfault and so will Chromium when used with any memory allocator other than what they bundle. (Unless you select build flags which disable their memory allocators they use.) I think dynamic loading is naturally fragile and you should expect weird things.

As for why nix-shell --pure is affected, doing nix-shell --pure -p hello --run 'env | grep glibc' shows it loads glibc which has the patch that looks for /etc/ld-nix.so.preload even in nix-shell --pure.

https://github.com/NixOS/nixpkgs/blob/b3aed163d551a7c8829132022531f7526fa8a2ac/pkgs/development/libraries/glibc/common.nix#L69

One could possible expand the patch to check IN_NIX_SHELL and if it is set to pure do not load /etc/ld-nix.so.preload.

vcunat commented 3 years ago

One could possible expand the patch to check IN_NIX_SHELL and if it is set to pure do not load /etc/ld-nix.so.preload.

possible but I'm not really convinced. It would add even more magic and would it be a significant improvement?

If "pure" shells preserve $HOME, why not global preloads? If someone wants a hardened memory allocator system-wide, should their development environments be an exception? I don't know for sure. Perhaps nix-shell is trying to serve too different purposes by the same approach 🤷🏽 and in that situation I'd personally prefer more simplicity.

nixos-discourse commented 3 years ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/warning-unknown-setting-extra-sandbox-paths/9836/4

Nadrieril commented 3 years ago

Oh god, just spent 4 hours wrestling with half the binaries on my system segfaulting because of this after an upgrade to 20.09 (and the hardened profile). Removing /etc/ld-nix.so.preload fixed it, thanks so much @S-NA! The really bad thing that happened however is that if I tried to boot back into a previous system version, init itself segfaulted! I don't understand how adding a boot entry for a new boot entry corrupted an existing boot entry but it did. This made it extremely hard to recover anything, does someone understand how this happened?

Nadrieril commented 3 years ago

I suspect that during boot the init bash script read /etc/ld-nix.so.preload that was leftover from the previous boot, even though the system I was booting was not supposed to have that file. And that made bash segfault. So this prevented me from booting any system at all, except the one that actually expected /etc/ld-nix.so.preload to be there. But that one had plenty of other segfaults everywhere, because my userspace programs are from nixpkgs-unstable (currently glibc 2.32), whereas my system was on nixos-20.09 (currently glibc 2.31).

danderson commented 3 years ago

Adding a report: both bash and tailscaled (from pkg tailscale) pulled from unstable segfault on a 20.09 hardened NixOS. Setting environment.memoryAllocator.provider = "libc" (undoing the use of scudo) fixes it. This occurs on an Intel system, a Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz DigitalOcean VM. CPUID reports support for fma, but from googling that seems to be FMA3, with FMA4 not supported.

ju1m commented 3 years ago

Just to let you know that environment.memoryAllocator.provider = "scudo" is known to completely break NixOS when the glibc version is bumped, due to /etc/ld-nix.so.preload, as discussed in this old PR of mine trying to mitigate this problem: https://github.com/NixOS/nixpkgs/pull/96289

edolstra commented 3 years ago

Why does this option even exist? A global LD_PRELOAD sounds like a good way to break your system...

vcunat commented 3 years ago

There's a fat warning in the description at least 🤷🏽

danderson commented 3 years ago

Maybe it should be removed from the hardened profile in that case. As it stands, it's encouraging people to run an unsafe config, afaict?

7c6f434c commented 3 years ago

@danderson well, depends on your definition of safety. Programs built just right work, and programs where applying a memory protection measure fails, fail to start — depending on whether you threat model weights lack of service or potentially exploitable software as a larger problem, it might be good or bad I guess?

Nadrieril commented 3 years ago

What really bothers me about this option is that it broke NixOS's promise that rollbacks can always undo mistakes. In my case I ended up with every boot entry except one segfaulting during init. So I would really prefer this got removed from the hardened profile because I didn't even see any warning.

Ma27 commented 3 years ago

cc @emilazy @joachifm who are listed as maintainers in the hardened profile.

TredwellGit commented 3 years ago

Do people actually use the hardened profile? I see it more as a guide on security options available rather than something that can be used practically. I will submit a pull request to remove Scudo from it, but it would be better just to make the environment.memoryAllocator.provider = "scudo"; option do nothing until it works.

joachifm commented 3 years ago

Many of the measures in the hardened profile were never intended for normal deployments, especially not for desktop or development machines. If a measure improves security "for free" then it should be on by default, the hardened profile was intended for those things were there are real trade-offs. I had in mind that potential users would carefully consider each measure for their particular deployment scenario and tune the toggles accordingly. Simply including the profile wholesale without modification will break many useful things and trash performance, so you shouldn't use it unless that degree of feature and performance degradation is actually worth it (not the case for most users, I reckon).

TredwellGit commented 3 years ago

I will submit a pull request to remove Scudo from it

Never mind, I fixed the segmentation faults; pull request incoming.