erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.19k stars 2.93k forks source link

ERL-1044: Segmentation fault occur regularly in test #4166

Closed OTP-Maintainer closed 3 years ago

OTP-Maintainer commented 4 years ago

Original reporter: siri Affected versions: OTP-22.1.3, OTP-21.1.1 Fixed in versions: OTP-22.1.7, OTP-21.3.8.11 Component: Not Specified Migrated from: https://bugs.erlang.org/browse/ERL-1044


Dear friends!

We got a segfault and a core dump from OTP-21.1.1 running on Red Hat RHEL 6.  We wonder if it's a known bug, and if it possibly has been corrected in a later release?

The problem occurs regularly in our test environment, and we have several core files, size 200-800 MB - thus to big to attach here. Below are the backtraces from two of them. Please let me know how to proceed or who to send the file(s) to. 

Thanks!

/siri

{noformat}
Core was generated by `/usr/lib64/erlang/erts-10.1.1/bin/beam.smp -K true -W w -Bc -- -root /usr/lib64'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000601412 in rbt_delete ()
[Current thread is 1 (process 3813)]
(gdb) bt
#0  0x0000000000601412 in rbt_delete ()
#1  0x0000000000601ee2 in aoff_get_free_block ()
#2  0x00000000004d8632 in mbc_alloc ()
#3  0x00000000004dc9f8 in erts_alcu_alloc_thr_pref ()
#4  0x00000000005cf23e in erts_monitor_create ()
#5  0x0000000000579902 in erts_net_message ()
#6  0x0000000000538919 in driver_output2 ()
#7  0x0000000000649764 in tcp_reply_data ()
#8  0x0000000000649c2a in tcp_deliver ()
#9  0x000000000064a1a4 in tcp_recv ()
#10 0x000000000064aeaf in tcp_inet_input ()
#11 0x000000000055709b in erts_port_task_execute ()
#12 0x00000000004a530f in erts_schedule ()
#13 0x0000000000489891 in process_main ()
#14 0x0000000000497f81 in sched_thread_func ()
#15 0x00000000008121fa in thr_wrapper ()
#16 0x00007fd2f1a35aa1 in ?? ()
#17 0x00007fd2a8a7d700 in ?? ()
#18 0x0000000000000000 in ?? ()
(gdb)
{noformat}

{noformat}
Core was generated by `/usr/lib64/erlang/erts-10.1.1/bin/beam.smp -K true -W w -Bc -- -root /usr/lib64'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000601481 in rbt_delete ()
[Current thread is 1 (process 1124)]
(gdb) bt
#0  0x0000000000601481 in rbt_delete ()
#1  0x0000000000601ee2 in aoff_get_free_block ()
#2  0x00000000004d8632 in mbc_alloc ()
#3  0x00000000004dc9f8 in erts_alcu_alloc_thr_pref ()
#4  0x00000000005e5939 in erts_erts_pcre_malloc ()
#5  0x0000000000809244 in erts_pcre_exec ()
#6  0x00000000005e6fb3 in re_run ()
#7  0x000000000048a547 in process_main ()
#8  0x0000000000497f81 in sched_thread_func ()
#9  0x00000000008121fa in thr_wrapper ()
#10 0x00007f624d18caa1 in ?? ()
#11 0x00007f61ffefd700 in ?? ()
#12 0x0000000000000000 in ?? ()
(gdb)
{noformat}
OTP-Maintainer commented 4 years ago

lukas said:

Hello, both crashes are faults in the memory allocators which means that somewhere someone mismanaged memory. I don't recognize any of these crashes.

Do you run any nifs and or linked-in driver? If you do, then most likely there is something wrong one of those.
OTP-Maintainer commented 4 years ago

siri said:

I've looked through all deps (a few open source things like lager, cowboy, goldrush, ranch and a few internal) but I can't see anything but erlang code :( :)
OTP-Maintainer commented 4 years ago

sverker said:

I think the next step on this matter is to try reproduce same kind of crash on latest OTP-21 (currently OTP-21.3.8.7).

That would be a better motivation to spend some serious post-mortem detective work on a core dump. :( :)
OTP-Maintainer commented 4 years ago

siri said:

Hi Sverker/VM - I finally got the same up and running with 22.1.3. Sadly, we still see segfaults :( 

Examples:

{code:java}
Core was generated by `/usr/lib/erlang/erts-10.5.2/bin/beam.smp -K true -W w -Bc -- -root /usr/lib/erl'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000005d4e18 in ?? ()
[Current thread is 1 (process 25080)]
(gdb) bt
#0  0x00000000005d4e18 in ?? ()
#1  0x00000000005d593f in ?? ()
#2  0x00000000004a4f02 in ?? ()
#3  0x00000000004a8428 in erts_alcu_alloc_thr_pref ()
#4  0x00000000005a1f6e in erts_monitor_create ()
#5  0x0000000000547afc in erts_net_message ()
#6  0x000000000050a292 in driver_output2 ()
#7  0x000000000062a00a in ?? ()
#8  0x000000000062cd0e in ?? ()
#9  0x000000000062d2a4 in ?? ()
#10 0x000000000062d65f in ?? ()
#11 0x0000000000528953 in erts_port_task_execute ()
#12 0x0000000000471bde in erts_schedule ()
#13 0x0000000000453b61 in process_main ()
#14 0x0000000000462944 in ?? ()
#15 0x00000000006879ea in ?? ()
#16 0x00007fde5ccf7aa1 in ?? ()
#17 0x00007fde0fdff700 in ?? ()
#18 0x0000000000000000 in ?? ()
(gdb)
{code}

{code:java}
Core was generated by `/usr/lib/erlang/erts-10.5.2/bin/beam.smp -K true -W w -Bc -- -root /usr/lib/erl'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000005d3cf9 in ?? ()
[Current thread is 1 (process 25946)]
(gdb) bt
#0  0x00000000005d3cf9 in ?? ()
#1  0x00000000005d44a0 in ?? ()
#2  0x00000000005d4b4f in ?? ()
#3  0x00000000004a2136 in ?? ()
#4  0x00000000004a5317 in ?? ()
#5  0x00000000004a7cf1 in ?? ()
#6  0x00000000004a860c in erts_alcu_check_delayed_dealloc ()
#7  0x000000000048f90b in erts_alloc_scheduler_handle_delayed_dealloc ()
#8  0x000000000046b856 in ?? ()
#9  0x000000000046c745 in ?? ()
#10 0x0000000000471587 in erts_schedule ()
#11 0x0000000000453b61 in process_main ()
#12 0x0000000000462944 in ?? ()
#13 0x00000000006879ea in ?? ()
#14 0x00007f8e80dccaa1 in ?? ()
#15 0x00007f8e33dff700 in ?? ()
#16 0x0000000000000000 in ?? ()
(gdb)
{code}
OTP-Maintainer commented 4 years ago

sverker said:

Ways forward:
* Share core dumps from these crashes with us. Either just attach here or more privately of they might contain sensitive information.
* Try reproduce with a debug built VM and hope for an earlier more informative crash (dump). Build with {{ERL_TOP=<otp-source-dir> && cd $ERL_TOP/erts/emulator && make debug}} and then you can do a quick and dirty install by copy $ERL_TOP/bin/*/beam.debug.smp over the installed beam.smp and copy erl_child_setup.debug as-is to same directory.
OTP-Maintainer commented 4 years ago

siri said:

Hi, I sent a mail to John earlier today - now I sent it to you as well, Sverker.

Will try to build a debug VM and let you know how it goes.
OTP-Maintainer commented 4 years ago

sverker said:

We need the beam.smp file as well, or point us to exactly which download you used.
OTP-Maintainer commented 4 years ago

siri said:

It's 22.1.3, downloaded from erlang solutions. For centos6.

[^beam.smp]
OTP-Maintainer commented 4 years ago

sverker said:

Do you have more core dumps from 22.1.3?

Erlang Solutions seems to strip their beam.smp files from debug info, which made things more difficult. But after some successful fumbling in the dark I managed to draw some conclusions.
OTP-Maintainer commented 4 years ago

siri said:

Yes, I will send you a couple by mail. I also have a debug compiled beam.smp running now, and waiting for it to core. It has been running since yesterday evening but still nothing - which is a bit disturbing since the "normal" beam goes down at least once during the night. But I'll keep it running and hope for the "best"....
OTP-Maintainer commented 4 years ago

siri said:

Attached [^beam.debug.smp] - used for core.3863 and core.15635 which are sent by mail to Sverker.
OTP-Maintainer commented 4 years ago

siri said:

And here is a snippet from erlang.log.*

{noformat}
===== ALIVE Fri Oct 25 00:14:51 CEST 2019

===== Fri Oct 25 00:20:18 CEST 2019
erts_refc_read(): Bad refc found (refc=0 < 1)!
heart: Fri Oct 25 00:20:18 2019: Erlang has closed.^M
heart: Fri Oct 25 00:20:18 2019: Wait 5 seconds for Erlang to terminate nicely^M
heart: Fri Oct 25 00:20:23 2019: Erlang still alive, kill it^M

=====
===== LOGGING STARTED Fri Oct 25 00:20:25 CEST 2019
=====
heart_beat_kill_pid = 15635
Erlang/OTP 22 [erts-10.5.2] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [hipe] [type-assertions] [debug-compiled] [lock-c
hecking]

Eshell V10.5.2  (abort with ^G)
(cdb@service-cdb1.ci.nordnet.se)1> heart: Fri Oct 25 00:20:27 2019: Executed "/opt/service-cdb/bin/ctl start" -> 0. Terminating.^M

===== ALIVE Fri Oct 25 00:35:27 CEST 2019

===== ALIVE Fri Oct 25 00:50:27 CEST 2019
Performing operations on uninitialized lock!
Currently no locks are locked by the scheduler 1 thread.
heart: Fri Oct 25 00:53:16 2019: Erlang has closed.
heart: Fri Oct 25 00:53:16 2019: Wait 5 seconds for Erlang to terminate nicely
heart: Fri Oct 25 00:53:21 2019: Erlang still alive, kill it

=====
===== LOGGING STARTED Fri Oct 25 00:53:23 CEST 2019
=====
heart_beat_kill_pid = 16540
Erlang/OTP 22 [erts-10.5.2] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [hipe] [type-assertions] [debug-compiled] [lock-c
hecking]

Eshell V10.5.2  (abort with ^G)
(cdb@service-cdb1.ci.nordnet.se)1> heart: Fri Oct 25 00:53:26 2019: Executed "/opt/service-cdb/bin/ctl start" -> 0. Terminating.
{noformat}
OTP-Maintainer commented 4 years ago

sverker said:

I think we found it. Proposed fix:
https://github.com/erlang/otp/pull/2432

Would be nice if you could give it a spin.
OTP-Maintainer commented 4 years ago

siri said:

Great! I'll do a rebuild. If I add the diff on top of 22.1.3, can I just replace the beam.smp again - as I did with the debug build?
OTP-Maintainer commented 4 years ago

sverker said:

Yes, that will work.
OTP-Maintainer commented 4 years ago

siri said:

Hi,

the fix has been running over the weekend and it's the only one of four nodes that did not have at least one core file this morning. I will leave it running for a couple of more days before finally concluding, but it definitely seems promising. :)
OTP-Maintainer commented 4 years ago

siri said:

Still running nicely - no cores since mid day Friday  :) I've now deployed this to all our 8 test machines and will leave it there for a while.

Could you please say something about what would trigger the fault that you've fixed? Just curious, since we have a few other systems still running on 21.1.3  where we haven't seen the crash but wondering if it is a potential problem.
OTP-Maintainer commented 4 years ago

sverker said:

The necessary scenario to trigger the fault is
 # a connection toward a remote node has to be established  (or at least attempted)
 # then disconnected
 # and then a connection attempt made again to the same node name

and the reconnect attempt has to happen at the same time as some process, holding the the last reference to distribution entry of the remote node, is either garbage collected or terminated.

I can't point to any specific reason why there should be a lower risk of happening on 21.1.3, but there are a lot of changes between 21 and 22 and the race for this to happen is quite narrow.

 
OTP-Maintainer commented 4 years ago

sverker said:

Fix released in 22.1.7.

 
OTP-Maintainer commented 4 years ago

sverker said:

and also fixed in 21.3.8.11.