Open dudalev opened 1 day ago
I was able to produce similar crashdump backtraces (stuck in crypto:supports/0) with this snippet (and -heart added to vm opts)
1> application:ensure_all_started(gun), lists:foreach(fun(_) -> spawn_link(fun() ->
TLSOpts = [{verify, verify_none}, {cacerts, public_key:cacerts_get()}],
WSOpts = #{ keepalive => 3000, silence_pings => false},
{ok, ConnPid} = gun:open("echo.websocket.org", 443, #{tls_opts => TLSOpts, ws_opts => WSOpts}),
{ok, _} = gun:await_up(ConnPid), StreamRef = gun:ws_upgrade(ConnPid, "/"),
{upgrade, [<<"websocket">>], _} = gun:await(ConnPid, StreamRef, 3000),
gun:ws_send(ConnPid, StreamRef, {text, <<"BOOM">>}),
gun:await(ConnPid, StreamRef, 5000)
end) end, lists:seq(1, 100000)).
heart: Tue Oct 22 15:16:49 2024: heart-beat time-out, no activity for 65 seconds
heart: Tue Oct 22 15:16:49 2024: Would reboot. Terminating.
Kernel pid terminated (heart) ({port_terminated,{heart,loop,[<0.0.0>,{state,#Port<0.1>,[],[],undefined}]}})
Crash dump is being written to: erl_crash.dump...done
erl crash dump for the snippet above is a bit chonky to attach here - https://www.dropbox.com/scl/fi/obguj5brf6f5rb2iy8gmg/erl_crash.dump.gz?rlkey=6zrb9ykek68ypmsjo7hgasbn8&st=wm4g1ke5&dl=0
The back back trace points to a C function inside OpenSSL. A back trace from a run where you have debug symbols for libcrypto.so.3
would probably provide a clue on why it blocks.
not yet able to get the same bt in gdb again when using the snippet (heart still getting triggered), but seeing a similar one with curve_algorithms nif getting stuck
#0 futex_wait (private=0, expected=2, futex_word=0x748805e9a9d0) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait (futex=futex@entry=0x748805e9a9d0, private=0) at ./nptl/lowlevellock.c:49
#2 0x00007488780a00f1 in lll_mutex_lock_optimized (mutex=0x748805e9a9d0) at ./nptl/pthread_mutex_lock.c:48
#3 ___pthread_mutex_lock (mutex=0x748805e9a9d0) at ./nptl/pthread_mutex_lock.c:93
#4 0x000063c0c4fd0702 in erl_drv_mutex_lock ()
#5 0x000074882afaad7c in curve_algorithms () from /home/qwe/zadrot1/lib/crypto-5.4.2.2/priv/lib/crypto.so
#6 0x000063c0c4e1c6cd in beam_jit_call_nif(process*, void const*, unsigned long*, unsigned long (*)(enif_environment_t*, int, unsigned long*), erl_module_nif*) ()
#7 0x0000748833600757 in ?? ()
#8 0x0000000000000000 in ?? ()
(and it may be a red herring since some threads are paused by gdb)
from the first backtrace, the offset of the #0 frame address relateive to EVP_CipherInit_ex (#1 address) is 0x00007a82761f8ed1 - 0x00007a82761f8ed1 = -2610 so it should be somewhere around here (EVP_CIPHER_CTX_reset ??)
(gdb) info address EVP_CipherInit_ex
Symbol "EVP_CipherInit_ex" is at 0x74875d3f98f0 in a file compiled without debugging.
(gdb) x/20i 0x74875d3f98f0-2610 - 10
0x74875d3f8eb4: movl $0x0,0x78(%rbp)
0x74875d3f8ebb: je 0x74875d3f8ed1
0x74875d3f8ebd: mov 0x70(%r13),%rbx
0x74875d3f8ec1: mov %r13,%rdi
0x74875d3f8ec4: call 0x74875d3f8680 <EVP_CIPHER_CTX_reset>
0x74875d3f8ec9: mov %r14d,0x10(%r13)
0x74875d3f8ecd: mov %rbx,0x70(%r13)
0x74875d3f8ed1: mov %r15,%rdx
0x74875d3f8ed4: cmpq $0x0,0x78(%rdx)
0x74875d3f8ed9: je 0x74875d3f9310
0x74875d3f8edf: cmp %rdx,0xb0(%r13)
0x74875d3f8ee6: je 0x74875d3f8f27
0x74875d3f8ee8: mov %rdx,%rdi
0x74875d3f8eeb: mov %rdx,-0xc8(%rbp)
0x74875d3f8ef2: call 0x74875d3f85b0 <EVP_CIPHER_up_ref>
0x74875d3f8ef7: mov -0xc8(%rbp),%rdx
0x74875d3f8efe: test %eax,%eax
0x74875d3f8f00: je 0x74875d3f9458
0x74875d3f8f06: mov 0xb0(%r13),%rdi
0x74875d3f8f0d: mov %rdx,-0xc8(%rbp)
When you say "gets blocked", do you by that mean that: 1) no observable progress is made, but the the process is still using CPU, or 2) all threads are blocked (not using any CPU)?
Thread 33 in the original back trace could be the same futex you see in the second dump.
if it can help, i think i'm having the same (or related) issue where a big elixir application cannot complete the start phase and when running interactively with iex -S mix
, it never gets to the shell.
My erlang version is 26.2.5.4 and it seems one of the threads is stuck on a futex. Its backtrace is:
(gdb) bt
#0 futex_wait (futex_word=0x557b31e72c98, expected=2, private=0) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait (futex=futex@entry=0x557b31e72c98, private=0) at lowlevellock.c:49
#2 0x00007fa064ca9cc1 in lll_mutex_lock_optimized (mutex=0x557b31e72c98) at pthread_mutex_lock.c:48
#3 ___pthread_mutex_lock (mutex=0x557b31e72c98) at pthread_mutex_lock.c:93
#4 0x0000557afd7d8502 in ethr_mutex_lock (mtx=<optimized out>) at ../include/internal/ethr_mutex.h:655
#5 erl_drv_mutex_lock (dmtx=<optimized out>) at beam/erl_drv_thread.c:228
#6 enif_mutex_lock (mtx=<optimized out>) at beam/erl_nif.c:2341
#7 0x00007fa0235c3ee6 in get_curve_cnt (env=0x7fa0101eec30, fips=0) at /usr/src/debug/erlang-26.2.5.4-1.fc40.x86_64/lib/crypto/c_src/algorithms.c:248
#8 curve_algorithms (env=0x7fa0101eec30, env@entry=<error reading variable: value has been optimized out>, argc=<error reading variable: value has been optimized out>, argv=<error reading variable: value has been optimized out>)
at /usr/src/debug/erlang-26.2.5.4-1.fc40.x86_64/lib/crypto/c_src/algorithms.c:226
#9 0x0000557afd59e86d in beam_jit_call_nif (c_p=0x7fa009eb0550, I=<optimized out>, reg=0x7fa0101eed40, fp=<optimized out>, NifMod=<optimized out>) at beam/jit/beam_jit_common.cpp:646
#10 0x00007fa012800757 in ?? ()
#11 0x0000000000000000 in ?? ()
I will provide more details in the morning
When you say "gets blocked", do you by that mean that: 1) no observable progress is made, but the the process is still using CPU, or 2) all threads are blocked (not using any CPU)?
Thread 33 in the original back trace could be the same futex you see in the second dump.
In the original message, I saw the whole release is slowly becoming dysfunctional - not initiating expected connections, I cannot ping the node, some proc messageboxes becoming huge with data from connections that are still alive. Only one or two scheduler threads are busy/stuck according to the crash dump.
In the repro snippet - heart watchdog kills the vm after a minute. In the crash dump all or almost all schedulers are running crypto:supports/0 and I assume are stuck in it. That is visible in the crash dump I attached.
In the crash dump all or almost all schedulers are running crypto:supports/0 and I assume are stuck in it.
Then I'm afraid that it's outside my area of expertise, as it no longer looks likely that this is related to one of my compiler optimizations. Someone in team:PS will have to look into this.
same snippet on arm64 mac (same erlang version):
1> application:ensure_all_started(gun), lists:foreach(fun(_) -> spawn_link(fun() ->
TLSOpts = [{verify, verify_none}, {cacerts, public_key:cacerts_get()}],
WSOpts = #{ keepalive => 3000, silence_pings => false},
{ok, ConnPid} = gun:open("echo.websocket.org", 443, #{tls_opts => TLSOpts, ws_opts => WSOpts}),
{ok, _} = gun:await_up(ConnPid), StreamRef = gun:ws_upgrade(ConnPid, "/"),
{upgrade, [<<"websocket">>], _} = gun:await(ConnPid, StreamRef, 3000),
gun:ws_send(ConnPid, StreamRef, {text, <<"BOOM">>}),
gun:await(ConnPid, StreamRef, 5000)
end) end, lists:seq(1, 100000)).
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
** exception exit: {badmatch,{error,{eopnotsupp,emfile}}}
in function pubkey_os_cacerts:get/0 (pubkey_os_cacerts.erl, line 39)
erl_child_setup: failed with error 32 on line 282
2024-10-23T20:53:29.246190+11:00 : error: Error in process <0.420.0> on node zadrot1@ziqqurah with exit value:, {{badmatch,{error,{eopnotsupp,emfile}}},[{pubkey_os_cacerts,get,0,[{file,"pubkey_os_cacerts.erl"},{line,39}]},{erl_eval,do_apply,7,[{file,"erl_eval.erl"},{line,750}]},{erl_eval,expr_list,7,[{file,"erl_eval.erl"},{line,1026}]},{erl_eval,expr,6,[{file,"erl_eval.erl"},{line,292}]},{erl_eval,expr,6,[{file,"erl_eval.erl"},{line,282}]},{erl_eval,expr,6,[{file,"erl_eval.erl"},{line,283}]},{erl_eval,expr,6,[{file,"erl_eval.erl"},{line,494}]},{erl_eval,exprs,6,[{file,"erl_eval.erl"},{line,136}]}]}
erl_child_setup: failed with error 32 on line 282
2024-10-23T20:53:29.246212+11:00 : error: Error in process <0.421.0> on node zadrot1@ziqqurah with exit value:, {{badmatch,{error,{eopnotsupp,emfile}}},[{pubkey_os_cacerts,get,0,[{file,"pubkey_os_cacerts.erl"},{line,39}]},{erl_eval,do_apply,7,[{file,"erl_eval.erl"},{line,750}]},{erl_eval,expr_list,7,[{file,"erl_eval.erl"},{line,1026}]},{erl_eval,expr,6,[{file,"erl_eval.erl"},{line,292}]},{erl_eval,expr,6,[{file,"erl_eval.erl"},{line,282}]},{erl_eval,expr,6,[{file,"erl_eval.erl"},{line,283}]},{erl_eval,expr,6,[{file,"erl_eval.erl"},{line,494}]},{erl_eval,exprs,6,[{file,"erl_eval.erl"},{line,136}]}]}
(zadrot1@ziqqurah)2> erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
judging by the pid values, it made around 400 connections before hitting some weird stuff in public_key:ca_certs().
running another time didn't work in the same way as it seems echo.websocket.org banned me now and resets the stream 😭
Just stubled upon this since I'm affected too.
Looking into https://github.com/erlang/otp/blob/OTP-26.2.5.4/lib/crypto/c_src/algorithms.c#L236
Is ok that ther're a couple of return in a enif_mutex_lock
'ed routine without a corresponding unlock before?
Not an expert in otp internals, so I can be totally wrong here.
** exception exit: {badmatch,{error,{eopnotsupp,emfile}}} in function pubkey_os_cacerts:get/0 (pubkey_os_cacerts.erl, line 39)
emfile is out file descriptors, so you are leaking those or use to many.
@dgud ah, probably, it's just a desktop mac without any ulimits tuned, not a real test
@xadhoom said
Looking into https://github.com/erlang/otp/blob/OTP-26.2.5.4/lib/crypto/c_src/algorithms.c#L236
Is ok that ther're a couple of return in a
enif_mutex_lock
'ed routine without a corresponding unlock before? Not an expert in otp internals, so I can be totally wrong here.
That's it. Function get_curve_cnt()
returns in two places without calling enif_mutex_unlock(mtx_init_curve_types)
. It is definitely a deadlock trap for any thread calling get_curve_cnt()
.
This bug exists only in OTP-26.2.5.4.
It's a race. It can only happen the first time crypto:supports()
or crypto:supports(curves)
is called and get called by two (or more) Erlang processes at the same time. So, it takes some "bad luck" to happen.
I will prepare a fix.
@sverker thank you! It seems I was sufficiently lucky to break Erlang with my first real project in it - switching from python because of some hard to fix multithreading/async issues in networking libraries there 😅
It's a race. It can only happen the first time
crypto:supports()
orcrypto:supports(curves)
is called and get called by two (or more) Erlang processes at the same time. So, it takes some "bad luck" to happen.
this seems plausible with my trace:
#0 futex_wait (futex_word=0x1807e830, expected=2, private=0) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait (futex=futex@entry=0x1807e830, private=0) at lowlevellock.c:49
#2 0x00007f8d17f71cc1 in lll_mutex_lock_optimized (mutex=0x1807e830) at pthread_mutex_lock.c:48
#3 ___pthread_mutex_lock (mutex=0x1807e830) at pthread_mutex_lock.c:93
#4 0x00000000005dceae in ethr_mutex_lock (mtx=<optimized out>) at ../include/internal/ethr_mutex.h:655
#5 erl_drv_mutex_lock (dmtx=<optimized out>) at beam/erl_drv_thread.c:228
#6 0x00000000005f3895 in enif_mutex_lock (mtx=<optimized out>) at beam/erl_nif.c:2341
#7 0x00007f8cd6740e78 in get_curve_cnt (fips=0, env=0x7f8cc74f5db0) at algorithms.c:248
#8 curve_algorithms (env=0x7f8cc74f5db0, argc=<optimized out>, argv=<optimized out>) at algorithms.c:226
#9 0x00000000004625da in process_main (esdp=0x1807e830, esdp@entry=0x7f8cc8827a80) at x86_64-pc-linux-gnu/opt/emu/beam_cold.h:140
#10 0x0000000000438f52 in sched_thread_func (vesdp=0x7f8cc8827a80) at beam/erl_process.c:8670
#11 0x00000000006b4a94 in thr_wrapper (vtwd=0x7ffe36524740) at pthread/ethread.c:116
#12 0x00007f8d17f6e6d7 in start_thread (arg=<optimized out>) at pthread_create.c:447
#13 0x00007f8d17ff260c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
(gdb) print mtx_init_curve_types.name
$6 = 0x1807e860 "init_curve_types"
(gdb) print mtx_init_curve_types.mtx
$7 = {pt_mtx = {__data = {__lock = 2, __count = 0, __owner = 544785, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\002\000\000\000\000\000\000\000\021P\b\000\001", '\000' <repeats 26 times>, __align = 2}}
(gdb) thread find 544785
Thread 41 has target id 'Thread 0x7f8cc73f36c0 (LWP 544785)'
(gdb) thread 41
[Switching to thread 41 (Thread 0x7f8cc73f36c0 (LWP 544785))]
#0 0x00007f8d17ff2a32 in epoll_wait (epfd=6, events=events@entry=0x7f8cd7bc6c10, maxevents=maxevents@entry=512, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30 return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout);
(gdb) bt
#0 0x00007f8d17ff2a32 in epoll_wait (epfd=6, events=events@entry=0x7f8cd7bc6c10, maxevents=maxevents@entry=512, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x0000000000643618 in check_fd_events (ps=<optimized out>, pr=0x7f8cd7bc6c10, max_res=<optimized out>, timeout_time=16384000000) at sys/common/erl_poll.c:1861
#2 erts_poll_wait (ps=0x7f8cd7bc1ec8, pr=0x7f8cd7bc6c10, len=0x7f8cc73f2b88, tpd=0x7f8cc8831df8, timeout_time=16384000000) at sys/common/erl_poll.c:1959
#3 0x0000000000649965 in erts_check_io (psi=0x7f8cd7bc53c0, timeout_time=timeout_time@entry=16384000000, poll_only_thread=poll_only_thread@entry=0) at sys/common/erl_check_io.c:1728
#4 0x0000000000452490 in scheduler_wait (fcalls=<synthetic pointer>, esdp=<optimized out>, rq=<optimized out>) at beam/erl_process.c:3545
#5 erts_schedule (esdp=<optimized out>, p=<optimized out>, calls=<optimized out>) at beam/erl_process.c:9803
#6 0x000000000045ab9b in process_main (esdp=0x6, esdp@entry=0x7f8cc8831dc0) at beam/emu/beam_emu.c:366
#7 0x0000000000438f52 in sched_thread_func (vesdp=0x7f8cc8831dc0) at beam/erl_process.c:8670
#8 0x00000000006b4a94 in thr_wrapper (vtwd=0x7ffe36524740) at pthread/ethread.c:116
#9 0x00007f8d17f6e6d7 in start_thread (arg=<optimized out>) at pthread_create.c:447
#10 0x00007f8d17ff260c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
the init_curve_types
mutex is held and never released by thread 41 and thread 42 is stuck in the enif_mutex_lock
call
In a simple websocket streams consumer app (ten workers under one supervisor), one of the vm schedulers occasionally (but not rarely) gets blocked in a nif called from crypto module. That in turn leads to tcp app getting into a bad state, node not responding (but seemingly running) and all that kind of trouble. Oddly enough, if it fails, it always happens with the last (10th) worker to reach the connecting code.
Still working on a minimal example that I can upload here, but it all looks like this:
In a SIGUSR1-generated crash dump, two processes active - the first one is blocked, the second one is the one that called halt().
The process stack always looks like this:
When running BEAM with debug, it's quite likely that gdb gives me the offending function:
Relevant version info:
The connecting code is pretty simple:
Not sure if the whole thing is caused by some ssl compat issues, but some workaround advice would be much appreciated.