erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.38k stars 2.95k forks source link

ERL-895: VM crash in cleanup of node entries during process terminate #3799

Closed OTP-Maintainer closed 3 years ago

OTP-Maintainer commented 5 years ago

Original reporter: silviu Affected version: OTP-21.1 Component: erts Migrated from: https://bugs.erlang.org/browse/ERL-895


Hello,

We have a custom ejabberd cluster and recently upgraded from OTP 19.3 to OTP 21.1.

After 2 days of running in production suddenly one of the machine crashed. 

The core dump looks like:

#0  0x000000000051547f in ethr_native_atomic64_add_return_mb (incr=-1, var=0x11) at ../include/internal/x86_64/../i386/atomic.h:240
#1  ethr_atomic_add_read (val=-1, var=0x11) at ../include/internal/ethr_atomics.h:4219
#2  ethr_atomic_dec_read (var=0x11) at ../include/internal/ethr_atomics.h:4806
#3  erts_refc_dectest (min_val=0, refcp=0x11) at beam/sys.h:962
#4  erts_deref_node_entry (np=0x1) at beam/erl_node_tables.h:234
#5  erts_cleanup_offheap (offheap=offheap@entry=0x7f527153fdc0) at beam/erl_message.c:184
#6  0x00000000005157b5 in erts_cleanup_messages (msgp=<optimized out>) at beam/erl_message.c:227
#7  0x000000000046f5a5 in delete_process (p=0x7f527141d768) at beam/erl_process.c:11843
#8  erts_continue_exit_process (p=0x7f527141d768) at beam/erl_process.c:12487
#9  erts_do_exit_process () at beam/erl_process.c:12210
#10 0x00000000004681da in terminate_proc (Value=523, c_p=0x7f527141d768) at beam/beam_emu.c:1613
#11 handle_error (c_p=0x7f527141d768, pc=<optimized out>, reg=<optimized out>, bif_mfa=<optimized out>) at beam/beam_emu.c:1467
#12 0x000000000046424b in process_main () at x86_64-unknown-linux-gnu/opt/smp/beam_hot.h:349
#13 0x000000000044ef0b in sched_thread_func (vesdp=0x7f54446ac500) at beam/erl_process.c:8332
#14 0x00000000006880e9 in thr_wrapper (vtwd=0x7ffe80d6b8d0) at pthread/ethread.c:118
#15 0x00007f548c0b06ba in start_thread (arg=0x7f5440c39700) at pthread_create.c:333
#16 0x00007f548bbde41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Any hint in how we can isolate the problem ?

Silviu
OTP-Maintainer commented 5 years ago

sverker said:

This looks like a process heap corruption.
Instead of finding a pointer to a node entry, it finds the value 1 which is then passed to  erts_deref_node_entry().

If you have own linked in drivers or NIF modules then I would recommend running the application with debug compiled emulator and hope for an earlier nicer crash.
OTP-Maintainer commented 5 years ago

silviu said:

Hello,

OK will do this. Currently we are building the VM using:

OTP_RELEASE=21.1
wget https://github.com/erlang/otp/archive/OTP-$OTP_RELEASE.zip
unzip OTP-$OTP_RELEASE.zip
cd otp-OTP-$OTP_RELEASE
./otp_build autoconf
./configure --with-dynamic-trace=lttng
export MAKEFLAGS=-j8
make
sudo rm -rf /usr/local/lib/erlang
sudo make install

There are multiple NIF libraries used but none was updated and we never hit this crash on 19.3. Only thing we did was to build VM using lttng (on 19.3 is not).

What flags we should use for debug compile?
OTP-Maintainer commented 5 years ago

sverker said:

between make and install do
{noformat}
(cd  erts/emulator && make debug)
{noformat}

Then start with
{noformat}
erl -emu_type debug
{noformat}

If that fails with "_erlexec: The emulator '/.../bin/beam.debug.smp' does not exist._" (which it did for me) or if you just want to skip  doing a new install, then copy bin/<target>/beam.debug.smp and erl_child_setup.debug into the install directory next to beam.smp and erl_child_setup.

Note that debug VM can be a several times slower than default optimized VM.
OTP-Maintainer commented 5 years ago

silviu said:

Thanks,

Running the VM in debug mode made us to discover a critical issue into one of our NIF cassandra driver:

https://github.com/silviucpp/erlcass/commit/c1a8305f0687bd8a7957078855230e66a21ddb4e

Seems debug mode have lot of memory checks. Not sure if this NIF bug is the problem of the current crash yet because that function is executed only when server starts to create all prepared statements. And also the bug is there for 2 years while we didn't had this kind of crash till now.

I'll keep you posted if anything will trigger 

Silviu