NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
16.46k stars 12.95k forks source link

(complicated) GUI applications running through Rosetta segfault #209242

Open flokli opened 1 year ago

flokli commented 1 year ago

I set up a aarch64-linux graphical NixOS system (nixpkgs master) inside UTM.

Rosetta is enabled, and I can successfully run a x86_64-linux xclock.

Most of the system is already aarch64-linux, but some applications are available for x86_64-linux only (Electron apps mostly).

I created a "forced x86_64-linux overlay" in my overlay.nix:

  pkgsx86_64 = import sources.nixpkgs {
    system = "x86_64-linux";
    config = {
      allowUnfree = true;
    };
    overlays = [];
  };

… and then referred to all x86_64 only applications via pkgsx86_64.$packageName.

Unfortunately, all these applications segfault :-/

❯ spotify
[1]    3205 segmentation fault (core dumped)  spotify

gdb isn't very helpful obviously:

❯ coredumpctl debug
           PID: 3205 (.spotify-wrappe)
           UID: 1000 (flokli)
           GID: 100 (users)
        Signal: 11 (SEGV)
     Timestamp: Thu 2023-01-05 22:45:05 UTC (27s ago)
  Command Line: /run/binfmt/rosetta /nix/store/zi2pql3pizz139b6pqag5glq8c2qd7hb-spotify-1.1.84.716.gc5f8b819/share/spotify/.spotify-wrapped
    Executable: /run/rosetta/rosetta
 Control Group: /user.slice/user-1000.slice/session-7.scope
          Unit: session-7.scope
         Slice: user-1000.slice
       Session: 7
     Owner UID: 1000 (flokli)
       Boot ID: 22594eeeb2624102b7bb2d3490081ccb
    Machine ID: 4bd940c09fc24a90b5be5ebcabd2634c
      Hostname: utm
       Storage: /var/lib/systemd/coredump/core.\x2espotify-wrappe.1000.22594eeeb2624102b7bb2d3490081ccb.3205.1672958705000000.zst (present)
  Size on Disk: 8.2K
       Message: Process 3205 (.spotify-wrappe) of user 1000 dumped core.

                Stack trace of thread 3205:
                #0  0x0000800000022800 n/a (n/a + 0x0)
                #1  0x000080000002c914 n/a (n/a + 0x0)
                #2  0x000080000002c914 n/a (n/a + 0x0)
                #3  0x000080000002a248 n/a (n/a + 0x0)
                #4  0x0000800000022070 n/a (n/a + 0x0)
                ELF object binary architecture: AARCH64

GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /run/rosetta/rosetta...
(No debugging symbols found in /run/rosetta/rosetta)

warning: core file may not match specified executable file.
[New LWP 3205]
Core was generated by `/run/binfmt/rosetta /nix/store/zi2pql3pizz139b6pqag5glq8c2qd7hb-spotify-1.1.84.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000800000022800 in ?? ()
(gdb) bt
#0  0x0000800000022800 in ?? ()
#1  0x00008000000766bc in ?? ()
#2  0x000000000000020b in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 

I'm somewhat suspecting some weird cross-arch graphics driver interactions, but am a bit lost. Anyone got some ideas?

cc @toonn @alyssais @sandydoo

flokli commented 1 year ago

Instead of virtualisation.rosetta.enable = true;, I tried boot.binfmt.emulatedSystems = [ "x86_64-linux" ];.

I could get saleae-logic to run, but the others (mostly Electron apps) still segfault.

Chrome itself seems to also be very angry:

❯ google-chrome-stable --no-sandbox
[0105/230925.555828:WARNING:crashpad_client_linux.cc(362)] prctl: Invalid argument (22)
[13183:13183:0105/230926.830157:ERROR:nacl_fork_delegate_linux.cc(313)] Bad NaCl helper startup ack (0 bytes)
/nix/store/r17ihqafckhr6ykz4xjr1wz4nhi338ya-gvfs-1.50.2/lib/gio/modules/libgvfsdbus.so: cannot open shared object file: No such file or directory
Failed to load module: /nix/store/r17ihqafckhr6ykz4xjr1wz4nhi338ya-gvfs-1.50.2/lib/gio/modules/libgvfsdbus.so

(google-chrome:13148): Gtk-WARNING **: 23:09:29.610: Could not load a pixbuf from icon theme.
This may indicate that pixbuf loaders or the mime database could not be found.
[13148:13148:0105/230931.202880:ERROR:gpu_process_host.cc(984)] GPU process launch failed: error_code=1002
[13148:13148:0105/230931.431947:ERROR:gpu_process_host.cc(984)] GPU process launch failed: error_code=1002
[13148:13148:0105/230931.535923:ERROR:gpu_process_host.cc(984)] GPU process launch failed: error_code=1002
[13148:13148:0105/230931.592585:ERROR:gpu_process_host.cc(984)] GPU process launch failed: error_code=1002
[13148:13148:0105/230931.643566:ERROR:gpu_process_host.cc(984)] GPU process launch failed: error_code=1002
[13148:13148:0105/230931.666660:ERROR:gpu_process_host.cc(984)] GPU process launch failed: error_code=1002
[13148:13148:0105/230931.682175:ERROR:gpu_process_host.cc(984)] GPU process launch failed: error_code=1002
[13148:13148:0105/230931.902809:ERROR:gpu_process_host.cc(984)] GPU process launch failed: error_code=1002
[13148:13148:0105/230931.949431:ERROR:gpu_process_host.cc(984)] GPU process launch failed: error_code=1002
[13148:13148:0105/230931.949532:FATAL:gpu_data_manager_impl_private.cc(440)] GPU process isn't usable. Goodbye.
**
ERROR:../accel/tcg/cpu-exec.c:954:cpu_exec: assertion failed: (cpu == current_cpu)
Bail out! ERROR:../accel/tcg/cpu-exec.c:954:cpu_exec: assertion failed: (cpu == current_cpu)
[13239:13245:0105/230938.368949:ERROR:ssl_client_socket_impl.cc(982)] handshake failed; returned -1, SSL error code 1, net_error -3
[1]    13148 trace trap (core dumped)  google-chrome-stable --no-sandbox
[13239:13245:0105/230938.374599:ERROR:ssl_client_socket_impl.cc(982)] handshake failed; returned -1, SSL error code 1, net_error -3
[13239:13245:0105/230938.376284:ERROR:ssl_client_socket_impl.cc(982)] handshake failed; returned -1, SSL error code 1, net_error -3
[13239:13245:0105/230938.376514:ERROR:ssl_client_socket_impl.cc(982)] handshake failed; returned -1, SSL error code 1, net_error -3
[13239:13245:0105/230938.376841:ERROR:ssl_client_socket_impl.cc(982)] handshake failed; returned -1, SSL error code 1, net_error -3
[13239:13245:0105/230938.377012:ERROR:ssl_client_socket_impl.cc(982)] handshake failed; returned -1, SSL error code 1, net_error -3
flokli commented 1 year ago

Okay, that crash seems to be a qemu bug: https://gitlab.com/qemu-project/qemu/-/issues/1147

bouk commented 1 year ago

@flokli I found this thread by googling '0x0000800000022800' 😄

I'm getting a very similar stack trace when doing this:

$ nix shell github:oxalica/rust-overlay#packages.x86_64-linux.rust
$ cargo --version
Segmentation fault (core dumped)

$ gdb cargo
(gdb) r
Starting program: /nix/store/qz8gvkxcyiidg4rrrlgif65ca9r8xka9-rust-default-1.67.0/bin/cargo
warning: Selected architecture i386:x86-64 is not compatible with reported target architecture aarch64
warning: Architecture rejected target-supplied description

Program received signal SIGSEGV, Segmentation fault.
0x0000800000022800 in ?? ()
(gdb) b
Breakpoint 1 at 0x800000022800
(gdb) bt
#0  0x0000800000022800 in ?? ()
#1  0x00008000000766bc in ?? ()
#2  0x0000ffffffffd440 in ?? ()
#3  0x3000702d2d720030 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Weirdly this doesn't happen when I do nix run nixpkgs#legacyPackages.x86_64-linux.cargo -- --version

also if I run the program using valgrind using nix shell nixpkgs#legacyPackages.x86_64-linux.valgrind and then valgrind -v cargo, it runs just fine...

I'm also using the rosetta nixos module.

My hypothesis is some sort of impurity that leads to an incorrect binary...

bouk commented 1 year ago

Discovered something interesting:

$ nix build nixpkgs#legacyPackages.x86_64-linux.rust.packages.prebuilt.cargo
$ /run/rosetta/rosetta $(patchelf --print-interpreter result/bin/.cargo-wrapped) result/bin/.cargo-wrapped --version
cargo 1.65.0 (4bc8f24d3 2022-10-20)

$ /run/rosetta/rosetta result/bin/.cargo-wrapped --version
Segmentation fault (core dumped)

It seems rosetta can't handle the interpreter being patched for dynamic libraries. Perhaps it doesn't use the PT_INTERP at all?

We could work around this by changing the binfmt. @flokli can you try the above commands for your programs and see if that resolves things?

flokli commented 1 year ago

@bouk what exactly should i try? I don't have a differently linked signal-desktop binary...

bouk commented 1 year ago

Try running this:

nix shell nixpkgs#patchelf # Or try installing patchelf into your systemPackages
$(patchelf --print-interpreter $(which spotify)) spotify
flokli commented 1 year ago

Ah, you mean manually invoking the interpreter from the interpreter field... Interesting, I'll try and report back.

bouk commented 1 year ago

Doing some stracing reveals more information:

strace ./cargo2
execve("./cargo2", ["./cargo2"], 0xffffec8f0eb0 /* 45 vars */) = 0
openat(AT_FDCWD, "/proc/self/exe", O_RDONLY) = 4
ioctl(4, _IOC(_IOC_READ, 0x61, 0x22, 0x45), 0xffffe95ee350) = 1
close(4)                                = 0
gettid()                                = 7323
getpid()                                = 7323
openat(AT_FDCWD, "/proc/self/maps", O_RDONLY) = 4
pread64(4, "800000000000-800000022000 r--p 0"..., 4170, 0) = 523
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff988b4000
pread64(4, "", 4170, 523)               = 0
close(4)                                = 0
openat(AT_FDCWD, "/proc/sys/vm/mmap_min_addr", O_RDONLY) = 4
read(4, "4096\n", 1023)                 = 5
close(4)                                = 0
readlinkat(AT_FDCWD, "/proc/self/fd/3", "/home/nix/cargo2", 4095) = 16
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0C\0\0\0\0\0"..., 64) = 64
mmap(NULL, 792, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffff988b3000
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffff9a13b000} ---
+++ killed by SIGSEGV (core dumped) +++
Segmentation fault (core dumped)

Only the first 792 bytes of the binary are mmaped, while the interp section is moved to the end of the file (running patchelf --debug)

patching ELF file 'cargo2'
replacing section '.interp' with size 28
this is a dynamic library
last page is 0xf85000
first page is 0x0
needed space is 6472
shifting new PT_LOAD segment by 9449472 bytes to work around a Linux kernel bug
rewriting section '.interp' from offset 0x2e0 (size 28) to offset 0x1888000 (size 28)
rewriting section '.note.ABI-tag' from offset 0x2fc (size 32) to offset 0x1888020 (size 32)
rewriting section '.dynsym' from offset 0x320 (size 6408) to offset 0x1888040 (size 6408)
rewriting symbol table section 36
rewriting symbol table section 41
writing cargo2

So it seems that rosetta tries to read .interp and fails because it hasn't memory mapped that section. Notice that 0xffff9a13b000 - 0xffff988b3000 = 0x1888000. This gives us something to work with! I can file a bug with Apple.

bouk commented 1 year ago

I've submitted the following bug report to Apple under FB11984253:

Hello, I'm trying out Rosetta for Linux in NixOS using UTM.app. I'm running into a segmentation fault inside Rosetta when trying to execute a binary that has an .interp section that's not close to the beginning of the binary. To reproduce the exact binary I'm using, please do the following (I've also attached a copy):

  1. Download and unpack https://static.rust-lang.org/dist/rust-1.66.0-x86_64-unknown-linux-gnu.tar.gz
  2. cp rust-1.66.0-x86_64-unknown-linux-gnu/cargo/bin/cargo cargo2
  3. Execute https://github.com/NixOS/patchelf (I'm using version 0.17.2) as follows: patchelf --debug --set-interpreter /lib64/ld-linux-x86-64.so.2 cargo2
  4. rosetta ./cargo2

Here's what I get when I run strace -i ./cargo2 (note the instruction address is in the rosetta program space):

strace -i ./cargo2                                               argo
[0000ffff93ff504c] execve("./cargo2", ["./cargo2"], 0xffffc8c8c658 /* 45 vars */) = 0
[000080000002306c] openat(AT_FDCWD, "/proc/self/exe", O_RDONLY) = 4
[0000800000022e04] ioctl(4, _IOC(_IOC_READ, 0x61, 0x22, 0x45), 0xfffff6244340) = 1
[0000800000022a80] close(4)             = 0
[0000800000022d6c] gettid()             = 8473
[0000800000023580] getpid()             = 8473
[000080000002306c] openat(AT_FDCWD, "/proc/self/maps", O_RDONLY) = 4
[00008000000230f0] pread64(4, "800000000000-800000022000 r--p 0"..., 4170, 0) = 523
[0000800000022f64] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff82348000
[00008000000230f0] pread64(4, "", 4170, 523) = 0
[0000800000022a94] close(4)             = 0
[000080000002306c] openat(AT_FDCWD, "/proc/sys/vm/mmap_min_addr", O_RDONLY) = 4
[00008000000231cc] read(4, "4096\n", 1023) = 5
[0000800000022a94] close(4)             = 0
[00008000000231f8] readlinkat(AT_FDCWD, "/proc/self/fd/3", "/home/nix/cargo2", 4095) = 16
[00008000000231cc] read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0C\0\0\0\0\0"..., 64) = 64
[0000800000022f64] mmap(NULL, 792, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffff82347000
[0000800000022878] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffff83bcf000} ---
[????????????????] +++ killed by SIGSEGV (core dumped) +++
Segmentation fault (core dumped)

As you can see it segfaults because it tries to access a value 0x1888000 bytes into the binary while only 792 bytes have been mmapped. This makes sense when you look at the debug log of patchelf:

patching ELF file 'cargo2'
replacing section '.interp' with size 28
this is a dynamic library
last page is 0xf85000
first page is 0x0
needed space is 6472
shifting new PT_LOAD segment by 9449472 bytes to work around a Linux kernel bug
rewriting section '.interp' from offset 0x2e0 (size 28) to offset 0x1888000 (size 28)
rewriting section '.note.ABI-tag' from offset 0x2fc (size 32) to offset 0x1888020 (size 32)
rewriting section '.dynsym' from offset 0x320 (size 6408) to offset 0x1888040 (size 6408)
rewriting symbol table section 36
rewriting symbol table section 41
writing cargo2

Running readelf -e cargo2 also provides useful information about the structure of the binary. I've attached its output as cargo2.elf.txt.

This binary was produced using https://github.com/NixOS/patchelf which is a tool that NixOS uses to modify dynamically linked binaries. It moves the .interp section to the back of the binary to safely modify the sections.

Using UTM Version 4.1.5 (74)

Output of /run/rosetta/rosetta:

Usage: rosetta <x86_64 ELF to run>

Optional environment variables:
ROSETTA_DEBUGSERVER_PORT    wait for a debugger connection on given port

version: Rosetta-289.7
uname -a
Linux nixos-builder 5.15.89 #1-NixOS SMP Wed Jan 18 10:48:59 UTC 2023 aarch64 GNU/Linux

Some discussion is also at the following GitHub issue: https://github.com/NixOS/nixpkgs/issues/209242

nixos-discourse commented 1 year ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/running-nixos-on-macos-with-rosetta-segfaults/25351/1

norbertwnuk commented 12 months ago

@bouk - any progress with FB11984253 on Apple side?

bouk commented 12 months ago

Nope, haven't heard anything from Apple.

zhaofengli commented 11 months ago

I gave it a try and made https://github.com/zhaofengli/rosetta-spice to patch Rosetta to fix the problem, and there is a NixOS module that will configure everything. It hooks sys_mmap to map enough of the binary until PT_INTERP. Hopefully this will all become obsolete soon - I want things to work now so I got my hands dirty 😛

As a bonus, it also allows you to use AOT without needing the host to configure it. This requires either macOS Sonoma or setting virtualisation.rosetta-spice.rosettaPkg to packages.aarch64-linux.rosetta from the flake. However, AOT appears to be buggy at the moment and complex programs either segfault when running or OOM during translation.

WIth AOT enabled:

zhaofengli commented 10 months ago

Looks like the segfault no longer occurs on Sonoma Beta 5 (23A5312d)! If you don't want to upgrade to the beta or want to try AOT, you can use rosetta-spice to get the version (the segfault fix no longer has an effect).

cor commented 8 months ago

Can we confirm that this issue is indeed fixed in the released version of Sonoma, and close this issue?

astr0n8t commented 2 weeks ago

I just setup a VM running on UTM with rosetta and after installing ida-free it just works via X11 forwarding. Not sure how that affects it but seems to work just fine