ERL-1160: Segfault on shutdown in rbt_delete

OTP-Maintainer commented 4 years ago

Original reporter: essen Affected version: OTP-23.0 Component: erts Migrated from: https://bugs.erlang.org/browse/ERL-1160

This happens after some distribution communication and then making the nodes shutdown normally, when using a custom distribution driver that uses 'socket'. The node getting the segfault is the one that is shutdown last, the first node shutting down does not get a segfault. In details:

Node A starts
Node B starts and connects to A
Node A registers the shell process
Node B sends a few big binary messages to A
Node B q()
Node A q() and segfaults

This commit: https://github.com/erlang/otp/commit/9ac25564be7586b06f057b59a3ca61fa90551614

Program terminated with signal SIGSEGV, Segmentation fault.
#0  rbt_delete (root=0x7f2593c401b0, del=<optimized out>) at beam/erl_ao_firstfit_alloc.c:675
675         if (IS_BLACK(y->left) && IS_BLACK(y->right)) {
[Current thread is 1 (Thread 0x7f25909ff700 (LWP 1654684))]
(gdb) backtrace
#0  rbt_delete (root=0x7f2593c401b0, del=<optimized out>) at beam/erl_ao_firstfit_alloc.c:675
#1  0x00005569e09399e3 in aoff_unlink_free_block (allctr=<optimized out>, blk=<optimized out>)
    at beam/erl_ao_firstfit_alloc.c:569
#2  0x00005569e07e1406 in mbc_free (allctr=0x5569e219fbc0, type=<optimized out>, p=<optimized out>, 
    busy_pcrr_pp=0x7f25909feb20) at beam/erl_alloc_util.c:2778
#3  0x00005569e07e9643 in erts_alcu_free_thr_pref (type=26604, extra=<optimized out>, p=0x7f2593c415c8)
    at beam/erl_alloc_util.c:6225
#4  0x00005569e0791ccd in handle_thr_prgr_later_op (awdp=awdp@entry=0x7f2592c77a60, aux_work=aux_work@entry=32, 
    waiting=0) at beam/erl_process.c:2238
#5  0x00005569e079782a in handle_aux_work (awdp=0x7f2592c77a60, orig_aux_work=<optimized out>, waiting=0)
    at beam/erl_process.c:2620
#6  0x00005569e07a6098 in erts_schedule (esdp=0x7f2592c77940, p=<optimized out>, calls=<optimized out>)
    at beam/erl_process.c:9467
#7  0x00005569e0782f28 in process_main (x_reg_array=0x7f2593c401b0, f_reg_array=0x67d8) at beam/beam_emu.c:596
#8  0x00005569e078fb6c in sched_thread_func (vesdp=0x7f2592c77940) at beam/erl_process.c:8497
#9  0x00005569e09eed6a in thr_wrapper (vtwd=0x7ffe24bd22a0) at pthread/ethread.c:118
#10 0x00007f25d64124cf in start_thread () from /usr/lib/libpthread.so.0
#11 0x00007f25d63342d3 in clone () from /usr/lib/libc.so.6

OTP-Maintainer commented 4 years ago

essen said:

A different backtrace I've obtained when changing my code and the commands I run, this time sending a larger binary in the message and stopping Node A first. I'll attach the relevant core as beam.core.2.gz.

#0  rbt_insert (order=order@entry=FF_BF, root=root@entry=0x7f610d5801b0, blk=blk@entry=0x7f610d580758)
    at beam/erl_ao_firstfit_alloc.c:813
#1  0x00005594c9ec9218 in aoff_link_free_block (allctr=<optimized out>, block=0x7f610d580758)
    at beam/erl_ao_firstfit_alloc.c:772
#2  0x00005594c9d704f1 in mbc_free (allctr=0x5594caf559c0, type=<optimized out>, p=<optimized out>, 
    busy_pcrr_pp=<optimized out>) at beam/erl_alloc_util.c:2837
#3  0x00005594c9d78643 in erts_alcu_free_thr_pref (type=27372, extra=<optimized out>, p=0x7f610d580760)
    at beam/erl_alloc_util.c:6225
#4  0x00005594c9eb8331 in run_resource_dtor (vbin=0x7f610dd00d38) at beam/erl_nif.c:2529
#5  0x00005594c9d20866 in handle_misc_aux_work (awdp=awdp@entry=0x7f610c325c60, aux_work=aux_work@entry=2048, 
    waiting=0) at beam/erl_process.c:1852
#6  0x00005594c9d26b42 in handle_aux_work (awdp=0x7f610c325c60, orig_aux_work=<optimized out>, waiting=0)
    at beam/erl_process.c:2637
#7  0x00005594c9d35098 in erts_schedule (esdp=0x7f610c325b40, p=<optimized out>, calls=<optimized out>)
    at beam/erl_process.c:9467
#8  0x00005594c9d11f28 in process_main (x_reg_array=0x3, f_reg_array=0x7f610d5801b0) at beam/beam_emu.c:596
#9  0x00005594c9d1eb6c in sched_thread_func (vesdp=0x7f610c325b40) at beam/erl_process.c:8497
#10 0x00005594c9f7dd6a in thr_wrapper (vtwd=0x7ffc4dcb11f0) at pthread/ethread.c:118
#11 0x00007f614fb3a4cf in start_thread () from /usr/lib/libpthread.so.0
#12 0x00007f614fa5c2d3 in clone () from /usr/lib/libc.so.6

OTP-Maintainer commented 4 years ago

sverker said:

Both of these are some kind of memory corruption bug. The call stacks doesn't seem to point out the guilty ones but rather where the already corrupted memory structures happened to be detected.

I think the best way forward is to try reproduce this with a debug instrumented beam.smp, in the hope of getting an earlier crash closer to where things went wrong.

Build with
{noformat}
cd $ERL_TOP/erts/emulator && make debug 
{noformat}
run with
{noformat}
$ERL_TOP/bin/erl -emu_type debug
{noformat}

OTP-Maintainer commented 4 years ago

essen said:

Gotcha. I will try to do it next week.

OTP-Maintainer commented 4 years ago

zinid said:

Hello, we have the same core, Erlang (22.1) is compiled with -ggdb, here is the output of the first frame:
{noformat}
#0  0x0000562f7b9f5de5 in rbt_delete (root=0x7f0e608001c0, del=0x7f0e60801588) at beam/erl_ao_firstfit_alloc.c:630
        spliced_is_black = 1
        x = 0x7f0e60801258
        y = 0x7f0e60801588
        z = 0x7f0e60801588
        null_x = {hdr = {bhdr = 139698106269696, u = {carrier = 0x2630, udata__ = "0"}}, parent = 0x0, left = 0x7f0de4a03df8, right = 0x0, flags = 3835739808, max_sz = 32525, u = {next = 0x7f0e560bbdf0, birth_time = 139699549879792}}
{noformat}
 The full output of `bt full` is attached as well.

[^dump_smpp_smpp_service_140220.txt]

Hope this will help. Let us know if this is not enough, we will recompile Erlang via `make debug` as you mentioned above.

OTP-Maintainer commented 4 years ago

sverker said:

[~zinid] This crash is even more obvious a case of finding already corrupted memory. The fact that it crashes in rbt_delete doesn't say much of the root cause, just that the memory allocation structures have been corrupted.

OTP-Maintainer commented 4 years ago

essen said:

On 9ac25564be7586b06f057b59a3ca61fa90551614 I get this (see also the attached beam.debug.smp and core file number 3):

beam/erl_nif.c:604:enif_clear_env() Assertion failed: p == menv->env.proc

#0  0x00007f369749ef25 in raise () from /usr/lib/libc.so.6
#1  0x00007f3697488897 in abort () from /usr/lib/libc.so.6
#2  0x000056041cbcc526 in erl_assert_error (expr=0x56041cd882e1 "p == menv->env.proc", 
    func=0x56041cd8a528 <__func__.25306> "enif_clear_env", file=0x56041cd87e46 "beam/erl_nif.c", line=604)
    at sys/unix/sys.c:954
#3  0x000056041cb8b334 in enif_clear_env (env=0x7f3655140be8) at beam/erl_nif.c:604
#4  0x000056041cb8b24e in enif_free_env (env=0x7f3655140be8) at beam/erl_nif.c:576
#5  0x000056041cc3c04c in esock_free_env (slogan=0x56041cda133f "esoc_stop - meta-env", env=0x7f3655140be8)
    at nifs/common/socket_nif.c:3229
#6  0x000056041cc5726b in esock_stop (env=0x7f364edbb500, obj=0x7f36558846f8, fd=19, is_direct_call=1)
    at nifs/common/socket_nif.c:20682
#7  0x000056041cb90636 in erts_resource_stop (resource=0x7f36558846e0, e=19, is_direct_call=1) at beam/erl_nif.c:2548
#8  0x000056041cbd87df in enif_select_x (env=0x7f364edbbaf0, e=19, mode=ERL_NIF_SELECT_STOP, obj=0x7f36558846f8, 
    pid=0x0, msg=907, msg_env=0x0) at sys/common/erl_check_io.c:1265
#9  0x000056041cbd7ee8 in enif_select (env=0x7f364edbbaf0, e=19, mode=ERL_NIF_SELECT_STOP, obj=0x7f36558846f8, 
    pid=0x0, msg=907) at sys/common/erl_check_io.c:1042
#10 0x000056041cc557e7 in esock_select_stop (env=0x7f364edbbaf0, event=19, obj=0x7f36558846f8)
    at nifs/common/socket_nif.c:19912
#11 0x000056041cc577a7 in esock_down (env=0x7f364edbbaf0, obj=0x7f36558846f8, pid=0x7f364edbba90, mon=0x7f364edbbad0)
    at nifs/common/socket_nif.c:20847
#12 0x000056041cb90a91 in erts_fire_nif_monitor (tmon=0x7f3656b40480) at beam/erl_nif.c:2620
#13 0x000056041c9728f4 in erts_proc_exit_handle_monitor (mon=0x7f3656b40480, vctxt=0x7f3656f00240, reds=98)
    at beam/erl_process.c:12904
#14 0x000056041cb56d83 in ml_dl_list_foreach_delete_yielding (list=0x7f3656f00230, 
    func=0x56041c972734 <erts_proc_exit_handle_monitor>, arg=0x7f3656f00240, vyspp=0x7f3656f00288, reds=98)
    at beam/erl_monitor_link.c:511
#15 0x000056041cb57766 in erts_monitor_list_foreach_delete_yielding (list=0x7f3656f00230, 
    func=0x56041c972734 <erts_proc_exit_handle_monitor>, arg=0x7f3656f00240, vyspp=0x7f3656f00288, limit=99)
    at beam/erl_monitor_link.c:782
#16 0x000056041c97498a in erts_continue_exit_process (p=0x7f3654d836f0) at beam/erl_process.c:13574
#17 0x000056041c9440fc in process_main (x_reg_array=0x7f3655e00200, f_reg_array=0x7f3655e02280) at beam/beam_emu.c:785
#18 0x000056041c9656be in sched_thread_func (vesdp=0x7f364fc05f80) at beam/erl_process.c:8497
#19 0x000056041ccb7fae in thr_wrapper (vtwd=0x7ffdead63900) at pthread/ethread.c:118
#20 0x00007f36976404cf in start_thread () from /usr/lib/libpthread.so.0
#21 0x00007f36975622d3 in clone () from /usr/lib/libc.so.6

I also tried a few times on current master but failed to reproduce the crash there, so maybe the issue has been fixed.

OTP-Maintainer commented 4 years ago

bmk said:

Hi,

On Feb 7 a big commit to maint was made, that fixed a bunch of issues,
race and others, with regard to close/stop/down in the socket nif.

We *think* this is one of the issues solved by this commit, but if your 
problems persist, please report again.

Regards,
     /BMK

OTP-Maintainer commented 4 years ago

zinid said:

Hello, Micael

Could you please clarify which branch should we try? I see `maint` and `maint-22` in the repo.

OTP-Maintainer commented 4 years ago

zinid said:

Okay, found on maint 129361a04639482957db8082f658a7f8626a536e

Seems like this is the huge commit in quesstion.

OTP-Maintainer commented 4 years ago

bmk said:

Hi,

Yes, its on maint and yes its big. Lots of fixes.

/BMK

OTP-Maintainer commented 4 years ago

ingela said:

Please open a new ticket if the problem persists with newer versions. As Micael has suggested there is a fair chance that the problem might have been fixed.

erlang / otp

ERL-1160: Segfault on shutdown in rbt_delete #4241