erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.3k stars 2.94k forks source link

VM segfaults on NetBSD 10 amd64 #8550

Closed xorrvin closed 3 months ago

xorrvin commented 3 months ago

Describe the bug Trying to install vix (Elixir wrapper for libvips graphical library) crashes VM with segfault. This package uses native code, so there's compile/make phase involved, however crash occures after all native code is linked.

To Reproduce This may be quite tedious, but in essence you'll need to spin up NetBSD 10 VM, and bootstrap pkgsrc. Erlang and Elixir are in the main tree, and libvips can be found at https://github.com/NetBSD/pkgsrc-wip/tree/master/libvips. I understand it sounds like too much, and can provide access to my VM by request or run some commands if needed.

Anyway, it goes like this.

This needed to indicate that target library is already provided by the system, otherwise vix script would try to compile it and fail on NetBSD:

netbsd$ export VIX_COMPILATION_MODE=PLATFORM_PROVIDED_LIBVIPS

Cloning latest version (doesn't really matter at this point):

netbsd$ git clone https://github.com/akash-akya/vix
Cloning into 'vix'...
remote: Enumerating objects: 2849, done.
remote: Counting objects: 100% (901/901), done.
remote: Compressing objects: 100% (308/308), done.
Receiving objects: 100% (2849/2849), 1.03 MiB | 2.34 MiB/s, done.
remote: Total 2849 (delta 686), reused 637 (delta 578), pack-reused 1948
Resolving deltas: 100% (1880/1880), done.

Installing deps:

netbsd$ export MIX_ENV=prod
netbsd$ mix deps.get
Resolving Hex dependencies...
Resolution completed in 0.124s
Unchanged:
  bunt 0.2.1
  castore 1.0.4
  cc_precompiler 0.1.8
  credo 1.7.1
  dialyxir 1.3.0
  earmark_parser 1.4.37
  elixir_make 0.7.7
  erlex 0.2.6
  ex_doc 0.30.7
  excoveralls 0.18.0
  file_system 0.2.10
  fss 0.1.1
  jason 1.4.1
  kino 0.11.0
  makeup 1.1.0
  makeup_elixir 0.16.1
  makeup_erlang 0.1.2
  nimble_parsec 1.3.1
  table 0.1.2
  temp 0.4.7
* Getting kino (Hex package)
* Getting elixir_make (Hex package)
* Getting cc_precompiler (Hex package)
* Getting castore (Hex package)
* Getting credo (Hex package)
* Getting dialyxir (Hex package)
* Getting ex_doc (Hex package)
* Getting excoveralls (Hex package)
* Getting temp (Hex package)
* Getting jason (Hex package)
* Getting earmark_parser (Hex package)
* Getting makeup_elixir (Hex package)
* Getting makeup_erlang (Hex package)
* Getting makeup (Hex package)
* Getting nimble_parsec (Hex package)
* Getting erlex (Hex package)
* Getting bunt (Hex package)
* Getting file_system (Hex package)
* Getting fss (Hex package)
* Getting table (Hex package)

Compilation

netbsd$ mix compile
==> table
Compiling 5 files (.ex)
Generated table app
==> fss
Compiling 4 files (.ex)
Generated fss app
==> kino
Compiling 46 files (.ex)
Generated kino app
==> castore
Compiling 1 file (.ex)
Generated castore app
==> elixir_make
Compiling 6 files (.ex)
Generated elixir_make app
==> cc_precompiler
Compiling 3 files (.ex)
Generated cc_precompiler app
==> vix
gmake[1]: Entering directory '/home/builder/elixirtest/vix/c_src'
cc -c -D_POSIX_C_SOURCE=200809L -fPIC -I /opt/pkg/lib/erlang/erts-14.2.5/include -I /opt/pkg/lib/erlang/usr/include `pkg-config vips --cflags` -o g_object/g_boxed.o g_object/g_boxed.c
cc -c -D_POSIX_C_SOURCE=200809L -fPIC -I /opt/pkg/lib/erlang/erts-14.2.5/include -I /opt/pkg/lib/erlang/usr/include `pkg-config vips --cflags` -o g_object/g_object.o g_object/g_object.c
cc -c -D_POSIX_C_SOURCE=200809L -fPIC -I /opt/pkg/lib/erlang/erts-14.2.5/include -I /opt/pkg/lib/erlang/usr/include `pkg-config vips --cflags` -o g_object/g_param_spec.o g_object/g_param_spec.c
cc -c -D_POSIX_C_SOURCE=200809L -fPIC -I /opt/pkg/lib/erlang/erts-14.2.5/include -I /opt/pkg/lib/erlang/usr/include `pkg-config vips --cflags` -o g_object/g_type.o g_object/g_type.c
cc -c -D_POSIX_C_SOURCE=200809L -fPIC -I /opt/pkg/lib/erlang/erts-14.2.5/include -I /opt/pkg/lib/erlang/usr/include `pkg-config vips --cflags` -o g_object/g_value.o g_object/g_value.c
cc -c -D_POSIX_C_SOURCE=200809L -fPIC -I /opt/pkg/lib/erlang/erts-14.2.5/include -I /opt/pkg/lib/erlang/usr/include `pkg-config vips --cflags` -o pipe.o pipe.c
cc -c -D_POSIX_C_SOURCE=200809L -fPIC -I /opt/pkg/lib/erlang/erts-14.2.5/include -I /opt/pkg/lib/erlang/usr/include `pkg-config vips --cflags` -o utils.o utils.c
cc -c -D_POSIX_C_SOURCE=200809L -fPIC -I /opt/pkg/lib/erlang/erts-14.2.5/include -I /opt/pkg/lib/erlang/usr/include `pkg-config vips --cflags` -o vips_boxed.o vips_boxed.c
cc -c -D_POSIX_C_SOURCE=200809L -fPIC -I /opt/pkg/lib/erlang/erts-14.2.5/include -I /opt/pkg/lib/erlang/usr/include `pkg-config vips --cflags` -o vips_foreign.o vips_foreign.c
cc -c -D_POSIX_C_SOURCE=200809L -fPIC -I /opt/pkg/lib/erlang/erts-14.2.5/include -I /opt/pkg/lib/erlang/usr/include `pkg-config vips --cflags` -o vips_image.o vips_image.c
cc -c -D_POSIX_C_SOURCE=200809L -fPIC -I /opt/pkg/lib/erlang/erts-14.2.5/include -I /opt/pkg/lib/erlang/usr/include `pkg-config vips --cflags` -o vips_interpolate.o vips_interpolate.c
cc -c -D_POSIX_C_SOURCE=200809L -fPIC -I /opt/pkg/lib/erlang/erts-14.2.5/include -I /opt/pkg/lib/erlang/usr/include `pkg-config vips --cflags` -o vips_operation.o vips_operation.c
cc -c -D_POSIX_C_SOURCE=200809L -fPIC -I /opt/pkg/lib/erlang/erts-14.2.5/include -I /opt/pkg/lib/erlang/usr/include `pkg-config vips --cflags` -o vix.o vix.c
cc g_object/g_boxed.o g_object/g_object.o g_object/g_param_spec.o g_object/g_type.o g_object/g_value.o pipe.o utils.o vips_boxed.o vips_foreign.o vips_image.o vips_interpolate.o vips_operation.o vix.o -o /home/builder/elixirtest/vix/_build/prod/lib/vix/priv/vix.so -shared -L /opt/pkg/lib/erlang/usr/lib `pkg-config vips --libs`
gmake[1]: Leaving directory '/home/builder/elixirtest/vix/c_src'
Compiling 28 files (.ex)
[1]   Segmentation fault (core dumped) mix compile

Trying again, to ensure it's not connected to the native code:

netbsd$ mix compile --verbose
Compiling with make: gmake all
gmake[1]: Entering directory '/home/builder/elixirtest/vix/c_src'
gmake[1]: Leaving directory '/home/builder/elixirtest/vix/c_src'
Compiling 28 files (.ex)
Compiled lib/vix/g_object/g_param_spec.ex
[1]   Segmentation fault (core dumped) mix compile --verbose

Expected behavior No crash

Affected versions Erlang v26.2.5, Elixir v.1.14.5, Elixir v1.16.2

Additional context

netbsd$ uname -a
NetBSD netbsd 10.0 NetBSD 10.0 (GENERIC) #0: Thu Mar 28 08:33:33 UTC 2024  mkrepro@mkrepro.NetBSD.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64

netbsd$ erl
Erlang/OTP 26 [erts-14.2.5] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [jit:ns]

Eshell V14.2.5 (press Ctrl+G to abort, type help(). for help)

netbsd$ file /opt/pkg/lib/erlang/erts-14.2.5/bin/beam.smp
/opt/pkg/lib/erlang/erts-14.2.5/bin/beam.smp: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /usr/libexec/ld.elf_so, for NetBSD 10.0, with debug_info, not stripped

I thought that maybe it crashes due to some unresolved native dependency, but .so looks alright:

netbsd$ ldd _build/prod/lib/vix/priv/vix.so 
vix.so:
    -lvips.42 => /opt/pkg/lib/libvips.so.42
    -lglib-2.0.0 => /opt/pkg/lib/libglib-2.0.so.0
    -liconv.2 => /opt/pkg/lib/libiconv.so.2
    -lc.12 => /usr/lib/libc.so.12
    -lintl.8 => /opt/pkg/lib/libintl.so.8
    -lm.0 => /usr/lib/libm.so.0
    -lpcre2-8.0 => /opt/pkg/lib/libpcre2-8.so.0
    -lpthread.1 => /usr/lib/libpthread.so.1
    -lgio-2.0.0 => /opt/pkg/lib/libgio-2.0.so.0
    -lgobject-2.0.0 => /opt/pkg/lib/libgobject-2.0.so.0
    -lffi.8 => /opt/pkg/lib/libffi.so.8
    -lgmodule-2.0.0 => /opt/pkg/lib/libgmodule-2.0.so.0
    -lz.1 => /opt/pkg/lib/libz.so.1
    -lexpat.1 => /opt/pkg/lib/libexpat.so.1
    -lexif.12 => /opt/pkg/lib/libexif.so.12
    -ljpeg.9 => /opt/pkg/lib/libjpeg.so.9
    -lpng16.16 => /opt/pkg/lib/libpng16.so.16

I'm really puzzled regarding what goes wrong, because opening coredump yields nothing spectacular. I'm attaching coredump here for brevity: coredump.tgz

xorrvin commented 3 months ago

Update: I've compiled Erlang 26.2.5 from source and additionally compiled debug vm. I've changed Elixir script so that it executes cerl -debug, and now it crashes with a different core file (still unintelligible though): coredump_debug.tgz

I tried to run shell in gdb, so that I can execute mix compile from within the debugger, and got this (I've removed most of repeating JITed symbol file is not an object file, ignoring it. messages):

(gdb) r -c "mix compile"
Starting program: /bin/sh -c "mix compile"
process 18498 is executing new program: /usr/bin/env
process 18498 is executing new program: /bin/sh
[New process 18498]
process 18498 is executing new program: /bin/sh
[New process 18498]
process 18498 is executing new program: /root/otp_src_26.2.5/bin/x86_64-unknown-netbsd10.0/erlexec
process 18498 is executing new program: /root/otp_src_26.2.5/bin/x86_64-unknown-netbsd10.0/beam.debug.smp
[New LWP 13717 of process 18498]
[New LWP 22609 of process 18498]
JITed symbol file is not an object file, ignoring it.
JITed symbol file is not an object file, ignoring it.
JITed symbol file is not an object file, ignoring it.
[New LWP 8870 of process 18498]
JITed symbol file is not an object file, ignoring it.
JITed symbol file is not an object file, ignoring it.
[New LWP 5249 of process 18498]
[New LWP 11159 of process 18498]
[New LWP 7127 of process 18498]
[New LWP 944 of process 18498]
[New LWP 7076 of process 18498]
[New LWP 28065 of process 18498]
[New LWP 16383 of process 18498]
[New LWP 17406 of process 18498]
[New LWP 16425 of process 18498]
[New LWP 2267 of process 18498]
[New LWP 17568 of process 18498]
[New LWP 28090 of process 18498]
[New LWP 6614 of process 18498]
[New LWP 29903 of process 18498]
[New LWP 147 of process 18498]
[New LWP 26091 of process 18498]
JITed symbol file is not an object file, ignoring it.
JITed symbol file is not an object file, ignoring it.
gmake[1]: Entering directory '/home/builder/elixirtest/vix/c_src'
gmake[1]: Leaving directory '/home/builder/elixirtest/vix/c_src'
JITed symbol file is not an object file, ignoring it.
JITed symbol file is not an object file, ignoring it.
JITed symbol file is not an object file, ignoring it.
JITed symbol file is not an object file, ignoring it.
Compiling 28 files (.ex)
JITed symbol file is not an object file, ignoring it.
JITed symbol file is not an object file, ignoring it.
--Type <RET> for more, q to quit, c to continue without paging--

Thread 7 "" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 5249 of process 18498]
_rtld_call_ifunc (obj=0x7044910ed400, mask=mask@entry=0x7044d29baa90, cur_objgen=cur_objgen@entry=3)
    at /usr/src/libexec/ld.elf_so/reloc.c:325
325             *where = target;
(gdb) bt
#0  _rtld_call_ifunc (obj=0x7044910ed400, mask=mask@entry=0x7044d29baa90, cur_objgen=cur_objgen@entry=3)
    at /usr/src/libexec/ld.elf_so/reloc.c:325
#1  0x00007f7fb22065ad in _rtld_call_ifunc_functions (cur_objgen=3, obj=<optimized out>, mask=0x7044d29baa90)
    at /usr/src/libexec/ld.elf_so/rtld.c:273
#2  _rtld_call_ifunc_functions (cur_objgen=3, obj=<optimized out>, mask=0x7044d29baa90)
    at /usr/src/libexec/ld.elf_so/rtld.c:266
#3  _rtld_call_init_functions (mask=mask@entry=0x7044d29baa90) at /usr/src/libexec/ld.elf_so/rtld.c:297
#4  0x00007f7fb2207698 in dlopen (name=<optimized out>, mode=2) at /usr/src/libexec/ld.elf_so/rtld.c:1082
#5  0x00000000007949f4 in erts_sys_ddll_open_noext (
    dlname=0x7044d414c1e8 "/home/builder/elixirtest/vix/_build/prod/lib/vix/priv/vix.so", handle=0x7044d29bac30, 
    err=0x7044d29babd0) at sys/unix/erl_unix_sys_ddll.c:131
#6  0x00000000007949a7 in erts_sys_ddll_open (
    full_name=0x7044d414c188 "/home/builder/elixirtest/vix/_build/prod/lib/vix/priv/vix", handle=0x7044d29bac30, 
    err=0x7044d29babd0) at sys/unix/erl_unix_sys_ddll.c:116
#7  0x000000000071a0c0 in erts_load_nif (c_p=0x70448afe9bb0, I=0x7f7ff6d26214, filename=123439695930610, args=15)
    at beam/erl_nif.c:4678
#8  0x000000000048db29 in beam_jit_load_nif (c_p=0x70448afe9bb0, I=0x7f7ff6d26214, reg=0x7044d29badc0)
    at beam/jit/beam_jit_common.cpp:683
#9  0x00007f7ff5ef03d8 in ?? ()
#10 0x0000000000000000 in ?? ()

So it looks like it crashes upon opening compiled library. My current theory is that library does another dlopen and something goes wrong?

xorrvin commented 3 months ago

Another update: I've isolated it to libvips version. Git version works okay, while latest stable release segfaults the VM.

xorrvin commented 3 months ago

Upon further debug it turns out that the culprit is in how native code is linked: removing -Wl,-z,relro from linker args of the library solves the issue. It seems to be NetBSD-specific:

https://mail-index.netbsd.org/netbsd-bugs/2023/12/26/msg080904.html

garazdawi commented 3 months ago

Seems like this is not an issue with Erlang/OTP so I'm closing this issue.