erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.35k stars 2.95k forks source link

ERL-827: repeated segfault & coredump during block deallocations #3964

Closed OTP-Maintainer closed 3 years ago

OTP-Maintainer commented 5 years ago

Original reporter: dch Affected version: OTP-21.2 Fixed in version: OTP-21.2.4 Component: erts Migrated from: https://bugs.erlang.org/browse/ERL-827


This happens repeatedly every ~ 245 minutes on OTP 21.2 on FreeBSD 12.0-RELEASE-p1 amd64.[further details|https://hackmd.io/elgRy4IWSR-FhViJDXHbDA?view]

{code:java}
(lldb) thr backtrace
* thread #1, name = 'beam.smp', stop reason = signal SIGSEGV
  * frame #0: 0x0000000000597db5 beam.smp`aoff_unlink_free_block + 37
    frame #1: 0x00000000003db0b6 beam.smp`mbc_free + 486
    frame #2: 0x00000000003dae7e beam.smp`dealloc_block + 558
    frame #3: 0x00000000003cfe81 beam.smp`handle_delayed_dealloc + 913
    frame #4: 0x00000000003cfae3 beam.smp`erts_alcu_check_delayed_dealloc + 35
    frame #5: 0x00000000003c5724 beam.smp`erts_alloc_scheduler_handle_delayed_dealloc + 404
    frame #6: 0x00000000003978e4 beam.smp`handle_aux_work + 964
    frame #7: 0x000000000039206e beam.smp`$dtrace8406820.erts_schedule + 12030
    frame #8: 0x000000000037044d beam.smp`$dtrace8406819.process_main + 221
    frame #9: 0x000000000038ce2f beam.smp`sched_thread_func + 415
    frame #10: 0x00000000006251bc beam.smp`thr_wrapper + 156
    frame #11: 0x0000000800755776 libthr.so.3`___lldb_unnamed_symbol1$$libthr.so.3 + 326

{code}

another example in the allocator:

{code:java}
(lldb) thr backtrace
* thread #1, name = 'beam.smp', stop reason = signal SIGSEGV
  * frame #0: 0x00000000005982b0 beam.smp`rbt_insert + 64
    frame #1: 0x0000000000597d3a beam.smp`aoff_link_free_block + 42
    frame #2: 0x00000000003dc514 beam.smp`mbc_alloc + 324
    frame #3: 0x00000000003d4898 beam.smp`erts_alcu_alloc_thr_pref + 280
    frame #4: 0x00000000005dfbf6 beam.smp`tcp_inet_ctl + 6454
    frame #5: 0x0000000000435d73 beam.smp`call_driver_control + 963
    frame #6: 0x00000000004354e6 beam.smp`erts_port_control + 902
    frame #7: 0x00000000004c2edf beam.smp`erts_internal_port_control_3 + 159
    frame #8: 0x000000000037140c beam.smp`$dtrace8406819.process_main + 4252
    frame #9: 0x000000000038ce2f beam.smp`sched_thread_func + 415
    frame #10: 0x00000000006251bc beam.smp`thr_wrapper + 156
    frame #11: 0x0000000800755776 libthr.so.3`___lldb_unnamed_symbol1$$libthr.so.3 +
{code}

I will update this to OTP 21.x latest tag to see if that helps, there seem to be a few changes relevant also in maint.

Suggestions welcomed on any further options.
OTP-Maintainer commented 5 years ago

lukas said:

Is this a new fault in 21.2? or does the same thing happen in 21.1.x?
OTP-Maintainer commented 5 years ago

dch said:

I moved from 20.something directly to 21.2, so I can't really confirm. If that's helpful I can re-test after checking if latest OTP release still has the issue.
OTP-Maintainer commented 5 years ago

dch said:

Lukas this still seems to occur on OTP 21.2.2 :-( I'll run this over the weekend for a while and report back if the coredumps are any different to what's already posted. BTW those are available privately if helpful.
OTP-Maintainer commented 5 years ago

lukas said:

Yes, the next step will be to look at the core files. This is some sort of memory corruption fault. Something is doing a double free or buffer overflow which causes the allocators to become very sad.

I noticed that one of the core files segfaulted in this nif: https://github.com/apache/couchdb-khash/blob/master/c_src/hash.c

Please make very very sure that there is no problem in that nif.
OTP-Maintainer commented 5 years ago

davisp said:

@lukas While I've not done a formal proof on that NIF, I've not seen it cause segfaults in years of abuse on multiple VM versions so I'd be fairly surprised if it were causing the issue. Granted there could always be some change in undefined behavior for 21.x that it was relying on but it's not doing anything fancy so I'd be fairly surprised to find that was the cause.
OTP-Maintainer commented 5 years ago

lukas said:

In OTP-21 we added some extra statistics to memory allocation as described here: http://blog.erlang.org/Memory-instrumentation-in-OTP-21/. That could very well expose bugs in nifs that would have gone undetected before.

If you could send me (lukas@erlang.org) links to the core + beam.smp executable I can take a look and see if something obvious pops out.
OTP-Maintainer commented 5 years ago

dch said:

erts is rebuilt with -g flags, waiting on more logs & updates via email 
OTP-Maintainer commented 5 years ago

lukas said:

Fault most likely found. Anyone who has encountered the same issue can try this patch: https://github.com/garazdawi/otp/tree/lukas/erts/fix_inet_multitimer_cleanup/OTP-15536
OTP-Maintainer commented 5 years ago

dch said:

Sweet, sweet patch. This has been running without issue (after rebasing off OTP-21.2.3 as well) for 5 days. Thanks Lukas!