Closed OTP-Maintainer closed 3 years ago
essen
said:
A different backtrace I've obtained when changing my code and the commands I run, this time sending a larger binary in the message and stopping Node A first. I'll attach the relevant core as beam.core.2.gz.
#0 rbt_insert (order=order@entry=FF_BF, root=root@entry=0x7f610d5801b0, blk=blk@entry=0x7f610d580758)
at beam/erl_ao_firstfit_alloc.c:813
#1 0x00005594c9ec9218 in aoff_link_free_block (allctr=<optimized out>, block=0x7f610d580758)
at beam/erl_ao_firstfit_alloc.c:772
#2 0x00005594c9d704f1 in mbc_free (allctr=0x5594caf559c0, type=<optimized out>, p=<optimized out>,
busy_pcrr_pp=<optimized out>) at beam/erl_alloc_util.c:2837
#3 0x00005594c9d78643 in erts_alcu_free_thr_pref (type=27372, extra=<optimized out>, p=0x7f610d580760)
at beam/erl_alloc_util.c:6225
#4 0x00005594c9eb8331 in run_resource_dtor (vbin=0x7f610dd00d38) at beam/erl_nif.c:2529
#5 0x00005594c9d20866 in handle_misc_aux_work (awdp=awdp@entry=0x7f610c325c60, aux_work=aux_work@entry=2048,
waiting=0) at beam/erl_process.c:1852
#6 0x00005594c9d26b42 in handle_aux_work (awdp=0x7f610c325c60, orig_aux_work=<optimized out>, waiting=0)
at beam/erl_process.c:2637
#7 0x00005594c9d35098 in erts_schedule (esdp=0x7f610c325b40, p=<optimized out>, calls=<optimized out>)
at beam/erl_process.c:9467
#8 0x00005594c9d11f28 in process_main (x_reg_array=0x3, f_reg_array=0x7f610d5801b0) at beam/beam_emu.c:596
#9 0x00005594c9d1eb6c in sched_thread_func (vesdp=0x7f610c325b40) at beam/erl_process.c:8497
#10 0x00005594c9f7dd6a in thr_wrapper (vtwd=0x7ffc4dcb11f0) at pthread/ethread.c:118
#11 0x00007f614fb3a4cf in start_thread () from /usr/lib/libpthread.so.0
#12 0x00007f614fa5c2d3 in clone () from /usr/lib/libc.so.6
sverker
said:
Both of these are some kind of memory corruption bug. The call stacks doesn't seem to point out the guilty ones but rather where the already corrupted memory structures happened to be detected.
I think the best way forward is to try reproduce this with a debug instrumented beam.smp, in the hope of getting an earlier crash closer to where things went wrong.
Build with
{noformat}
cd $ERL_TOP/erts/emulator && make debug
{noformat}
run with
{noformat}
$ERL_TOP/bin/erl -emu_type debug
{noformat}
essen
said:
Gotcha. I will try to do it next week.
zinid
said:
Hello, we have the same core, Erlang (22.1) is compiled with -ggdb, here is the output of the first frame:
{noformat}
#0 0x0000562f7b9f5de5 in rbt_delete (root=0x7f0e608001c0, del=0x7f0e60801588) at beam/erl_ao_firstfit_alloc.c:630
spliced_is_black = 1
x = 0x7f0e60801258
y = 0x7f0e60801588
z = 0x7f0e60801588
null_x = {hdr = {bhdr = 139698106269696, u = {carrier = 0x2630, udata__ = "0"}}, parent = 0x0, left = 0x7f0de4a03df8, right = 0x0, flags = 3835739808, max_sz = 32525, u = {next = 0x7f0e560bbdf0, birth_time = 139699549879792}}
{noformat}
The full output of `bt full` is attached as well.
[^dump_smpp_smpp_service_140220.txt]
Hope this will help. Let us know if this is not enough, we will recompile Erlang via `make debug` as you mentioned above.
sverker
said:
[~zinid] This crash is even more obvious a case of finding already corrupted memory. The fact that it crashes in rbt_delete doesn't say much of the root cause, just that the memory allocation structures have been corrupted.
essen
said:
On 9ac25564be7586b06f057b59a3ca61fa90551614 I get this (see also the attached beam.debug.smp and core file number 3):
beam/erl_nif.c:604:enif_clear_env() Assertion failed: p == menv->env.proc
#0 0x00007f369749ef25 in raise () from /usr/lib/libc.so.6
#1 0x00007f3697488897 in abort () from /usr/lib/libc.so.6
#2 0x000056041cbcc526 in erl_assert_error (expr=0x56041cd882e1 "p == menv->env.proc",
func=0x56041cd8a528 <__func__.25306> "enif_clear_env", file=0x56041cd87e46 "beam/erl_nif.c", line=604)
at sys/unix/sys.c:954
#3 0x000056041cb8b334 in enif_clear_env (env=0x7f3655140be8) at beam/erl_nif.c:604
#4 0x000056041cb8b24e in enif_free_env (env=0x7f3655140be8) at beam/erl_nif.c:576
#5 0x000056041cc3c04c in esock_free_env (slogan=0x56041cda133f "esoc_stop - meta-env", env=0x7f3655140be8)
at nifs/common/socket_nif.c:3229
#6 0x000056041cc5726b in esock_stop (env=0x7f364edbb500, obj=0x7f36558846f8, fd=19, is_direct_call=1)
at nifs/common/socket_nif.c:20682
#7 0x000056041cb90636 in erts_resource_stop (resource=0x7f36558846e0, e=19, is_direct_call=1) at beam/erl_nif.c:2548
#8 0x000056041cbd87df in enif_select_x (env=0x7f364edbbaf0, e=19, mode=ERL_NIF_SELECT_STOP, obj=0x7f36558846f8,
pid=0x0, msg=907, msg_env=0x0) at sys/common/erl_check_io.c:1265
#9 0x000056041cbd7ee8 in enif_select (env=0x7f364edbbaf0, e=19, mode=ERL_NIF_SELECT_STOP, obj=0x7f36558846f8,
pid=0x0, msg=907) at sys/common/erl_check_io.c:1042
#10 0x000056041cc557e7 in esock_select_stop (env=0x7f364edbbaf0, event=19, obj=0x7f36558846f8)
at nifs/common/socket_nif.c:19912
#11 0x000056041cc577a7 in esock_down (env=0x7f364edbbaf0, obj=0x7f36558846f8, pid=0x7f364edbba90, mon=0x7f364edbbad0)
at nifs/common/socket_nif.c:20847
#12 0x000056041cb90a91 in erts_fire_nif_monitor (tmon=0x7f3656b40480) at beam/erl_nif.c:2620
#13 0x000056041c9728f4 in erts_proc_exit_handle_monitor (mon=0x7f3656b40480, vctxt=0x7f3656f00240, reds=98)
at beam/erl_process.c:12904
#14 0x000056041cb56d83 in ml_dl_list_foreach_delete_yielding (list=0x7f3656f00230,
func=0x56041c972734 <erts_proc_exit_handle_monitor>, arg=0x7f3656f00240, vyspp=0x7f3656f00288, reds=98)
at beam/erl_monitor_link.c:511
#15 0x000056041cb57766 in erts_monitor_list_foreach_delete_yielding (list=0x7f3656f00230,
func=0x56041c972734 <erts_proc_exit_handle_monitor>, arg=0x7f3656f00240, vyspp=0x7f3656f00288, limit=99)
at beam/erl_monitor_link.c:782
#16 0x000056041c97498a in erts_continue_exit_process (p=0x7f3654d836f0) at beam/erl_process.c:13574
#17 0x000056041c9440fc in process_main (x_reg_array=0x7f3655e00200, f_reg_array=0x7f3655e02280) at beam/beam_emu.c:785
#18 0x000056041c9656be in sched_thread_func (vesdp=0x7f364fc05f80) at beam/erl_process.c:8497
#19 0x000056041ccb7fae in thr_wrapper (vtwd=0x7ffdead63900) at pthread/ethread.c:118
#20 0x00007f36976404cf in start_thread () from /usr/lib/libpthread.so.0
#21 0x00007f36975622d3 in clone () from /usr/lib/libc.so.6
I also tried a few times on current master but failed to reproduce the crash there, so maybe the issue has been fixed.
bmk
said:
Hi,
On Feb 7 a big commit to maint was made, that fixed a bunch of issues,
race and others, with regard to close/stop/down in the socket nif.
We *think* this is one of the issues solved by this commit, but if your
problems persist, please report again.
Regards,
/BMK
zinid
said:
Hello, Micael
Could you please clarify which branch should we try? I see `maint` and `maint-22` in the repo.
zinid
said:
Okay, found on maint 129361a04639482957db8082f658a7f8626a536e
Seems like this is the huge commit in quesstion.
bmk
said:
Hi,
Yes, its on maint and yes its big. Lots of fixes.
/BMK
ingela
said:
Please open a new ticket if the problem persists with newer versions. As Micael has suggested there is a fair chance that the problem might have been fixed.
Original reporter:
essen
Affected version:OTP-23.0
Component:erts
Migrated from: https://bugs.erlang.org/browse/ERL-1160