Scheduler blocked in a crypto NIF

dudalev commented 1 day ago

In a simple websocket streams consumer app (ten workers under one supervisor), one of the vm schedulers occasionally (but not rarely) gets blocked in a nif called from crypto module. That in turn leads to tcp app getting into a bad state, node not responding (but seemingly running) and all that kind of trouble. Oddly enough, if it fails, it always happens with the last (10th) worker to reach the connecting code.

Still working on a minimal example that I can upload here, but it all looks like this:

In a SIGUSR1-generated crash dump, two processes active - the first one is blocked, the second one is the one that called halt().

The process stack always looks like this:

When running BEAM with debug, it's quite likely that gdb gives me the offending function:

(gdb) info threads
...
  31   Thread 0x7a827d8f06c0 (LWP 693156) "erts_sched_7"   syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  32   Thread 0x7a827d9f36c0 (LWP 693154) "erts_sched_6"   syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  33   Thread 0x7a827daf66c0 (LWP 693153) "erts_sched_5"   futex_wait (private=0, expected=2, futex_word=0x587b1ab9a900) at ../sysdeps/nptl/futex-internal.h:146
  34   Thread 0x7a827dbf96c0 (LWP 693152) "erts_sched_4"   syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  35   Thread 0x7a827dcfc6c0 (LWP 693151) "erts_sched_3"   0x00007a82761f8ed1 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3  <-- the only sched in native code
  36   Thread 0x7a827ddff6c0 (LWP 693150) "erts_sched_2"   syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  37   Thread 0x7a827ef0f6c0 (LWP 693149) "erts_sched_1"   syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  38   Thread 0x7a827ef736c0 (LWP 693143) "erts_async_30"  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  39   Thread 0x7a827ef966c0 (LWP 693142) "erts_async_29"  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38

(gdb) thread 35
[Switching to thread 35 (Thread 0x7a827dcfc6c0 (LWP 693151))]
#0  0x00007a82761f8ed1 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
(gdb) bt
#0  0x00007a82761f8ed1 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#1  0x00007a82761f9903 in EVP_CipherInit_ex () from /lib/x86_64-linux-gnu/libcrypto.so.3
#2  0x00007a827c0123fa in aead_cipher_nif () from /home/qwe/zadrot1/lib/crypto-5.4.2.2/priv/lib/crypto.so
#3  0x0000587b1a4486cd in beam_jit_call_nif(process*, void const*, unsigned long*, unsigned long (*)(enif_environment_t*, int, unsigned long*), erl_module_nif*) ()
#4  0x00007a8281000757 in ?? ()
#5  0x0000000000000000 in ?? ()

Relevant version info:

Pretty much up to date Ubuntu noble - x64 with 6.8.0-47-generic kernel
erlang 26.2.5.4-1rmq1ppa1~ubuntu24.04.1 from https://ppa.launchpadcontent.net/rabbitmq/rabbitmq-erlang/ubuntu
libcrypto.so.3 belongs to libssl3t64 (version 3.0.13-0ubuntu3.4)

The connecting code is pretty simple:

    TLSOpts = [{verify, verify_none}, {cacerts, public_key:cacerts_get()}],
    WSOpts = #{ keepalive => 3000, silence_pings => false},
    {ok, ConnPid} = gun:open("api.alor.ru", 443, #{tls_opts => TLSOpts, ws_opts => WSOpts}),
    {ok, _} = gun:await_up(ConnPid),

Not sure if the whole thing is caused by some ssl compat issues, but some workaround advice would be much appreciated.

dudalev commented 1 day ago

I was able to produce similar crashdump backtraces (stuck in crypto:supports/0) with this snippet (and -heart added to vm opts)

1> application:ensure_all_started(gun), lists:foreach(fun(_) -> spawn_link(fun() -> 
TLSOpts = [{verify, verify_none}, {cacerts, public_key:cacerts_get()}], 
WSOpts = #{ keepalive => 3000, silence_pings => false}, 
{ok, ConnPid} = gun:open("echo.websocket.org", 443, #{tls_opts => TLSOpts, ws_opts => WSOpts}), 
{ok, _} = gun:await_up(ConnPid), StreamRef = gun:ws_upgrade(ConnPid, "/"), 
{upgrade, [<<"websocket">>], _} = gun:await(ConnPid, StreamRef, 3000), 
gun:ws_send(ConnPid, StreamRef, {text, <<"BOOM">>}), 
gun:await(ConnPid, StreamRef, 5000) 
end) end, lists:seq(1, 100000)).
heart: Tue Oct 22 15:16:49 2024: heart-beat time-out, no activity for 65 seconds
heart: Tue Oct 22 15:16:49 2024: Would reboot. Terminating.
Kernel pid terminated (heart) ({port_terminated,{heart,loop,[<0.0.0>,{state,#Port<0.1>,[],[],undefined}]}})

Crash dump is being written to: erl_crash.dump...done

dudalev commented 1 day ago

erl crash dump for the snippet above is a bit chonky to attach here - https://www.dropbox.com/scl/fi/obguj5brf6f5rb2iy8gmg/erl_crash.dump.gz?rlkey=6zrb9ykek68ypmsjo7hgasbn8&st=wm4g1ke5&dl=0

frej commented 1 day ago

The back back trace points to a C function inside OpenSSL. A back trace from a run where you have debug symbols for libcrypto.so.3 would probably provide a clue on why it blocks.

dudalev commented 1 day ago

not yet able to get the same bt in gdb again when using the snippet (heart still getting triggered), but seeing a similar one with curve_algorithms nif getting stuck

#0  futex_wait (private=0, expected=2, futex_word=0x748805e9a9d0) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x748805e9a9d0, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007488780a00f1 in lll_mutex_lock_optimized (mutex=0x748805e9a9d0) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x748805e9a9d0) at ./nptl/pthread_mutex_lock.c:93
#4  0x000063c0c4fd0702 in erl_drv_mutex_lock ()
#5  0x000074882afaad7c in curve_algorithms () from /home/qwe/zadrot1/lib/crypto-5.4.2.2/priv/lib/crypto.so
#6  0x000063c0c4e1c6cd in beam_jit_call_nif(process*, void const*, unsigned long*, unsigned long (*)(enif_environment_t*, int, unsigned long*), erl_module_nif*) ()
#7  0x0000748833600757 in ?? ()
#8  0x0000000000000000 in ?? ()

(and it may be a red herring since some threads are paused by gdb)

dudalev commented 1 day ago

from the first backtrace, the offset of the #0 frame address relateive to EVP_CipherInit_ex (#1 address) is 0x00007a82761f8ed1 - 0x00007a82761f8ed1 = -2610 so it should be somewhere around here (EVP_CIPHER_CTX_reset ??)

(gdb) info address EVP_CipherInit_ex
Symbol "EVP_CipherInit_ex" is at 0x74875d3f98f0 in a file compiled without debugging.
(gdb) x/20i 0x74875d3f98f0-2610 - 10
   0x74875d3f8eb4:      movl   $0x0,0x78(%rbp)
   0x74875d3f8ebb:      je     0x74875d3f8ed1
   0x74875d3f8ebd:      mov    0x70(%r13),%rbx
   0x74875d3f8ec1:      mov    %r13,%rdi
   0x74875d3f8ec4:      call   0x74875d3f8680 <EVP_CIPHER_CTX_reset>
   0x74875d3f8ec9:      mov    %r14d,0x10(%r13)
   0x74875d3f8ecd:      mov    %rbx,0x70(%r13)
   0x74875d3f8ed1:      mov    %r15,%rdx
   0x74875d3f8ed4:      cmpq   $0x0,0x78(%rdx)
   0x74875d3f8ed9:      je     0x74875d3f9310
   0x74875d3f8edf:      cmp    %rdx,0xb0(%r13)
   0x74875d3f8ee6:      je     0x74875d3f8f27
   0x74875d3f8ee8:      mov    %rdx,%rdi
   0x74875d3f8eeb:      mov    %rdx,-0xc8(%rbp)
   0x74875d3f8ef2:      call   0x74875d3f85b0 <EVP_CIPHER_up_ref>
   0x74875d3f8ef7:      mov    -0xc8(%rbp),%rdx
   0x74875d3f8efe:      test   %eax,%eax
   0x74875d3f8f00:      je     0x74875d3f9458
   0x74875d3f8f06:      mov    0xb0(%r13),%rdi
   0x74875d3f8f0d:      mov    %rdx,-0xc8(%rbp)

frej commented 1 day ago

When you say "gets blocked", do you by that mean that: 1) no observable progress is made, but the the process is still using CPU, or 2) all threads are blocked (not using any CPU)?

Thread 33 in the original back trace could be the same futex you see in the second dump.

flaviogrossi commented 1 day ago

if it can help, i think i'm having the same (or related) issue where a big elixir application cannot complete the start phase and when running interactively with iex -S mix, it never gets to the shell.

My erlang version is 26.2.5.4 and it seems one of the threads is stuck on a futex. Its backtrace is:

(gdb) bt
#0  futex_wait (futex_word=0x557b31e72c98, expected=2, private=0) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x557b31e72c98, private=0) at lowlevellock.c:49
#2  0x00007fa064ca9cc1 in lll_mutex_lock_optimized (mutex=0x557b31e72c98) at pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x557b31e72c98) at pthread_mutex_lock.c:93
#4  0x0000557afd7d8502 in ethr_mutex_lock (mtx=<optimized out>) at ../include/internal/ethr_mutex.h:655
#5  erl_drv_mutex_lock (dmtx=<optimized out>) at beam/erl_drv_thread.c:228
#6  enif_mutex_lock (mtx=<optimized out>) at beam/erl_nif.c:2341
#7  0x00007fa0235c3ee6 in get_curve_cnt (env=0x7fa0101eec30, fips=0) at /usr/src/debug/erlang-26.2.5.4-1.fc40.x86_64/lib/crypto/c_src/algorithms.c:248
#8  curve_algorithms (env=0x7fa0101eec30, env@entry=<error reading variable: value has been optimized out>, argc=<error reading variable: value has been optimized out>, argv=<error reading variable: value has been optimized out>)
    at /usr/src/debug/erlang-26.2.5.4-1.fc40.x86_64/lib/crypto/c_src/algorithms.c:226
#9  0x0000557afd59e86d in beam_jit_call_nif (c_p=0x7fa009eb0550, I=<optimized out>, reg=0x7fa0101eed40, fp=<optimized out>, NifMod=<optimized out>) at beam/jit/beam_jit_common.cpp:646
#10 0x00007fa012800757 in ?? ()
#11 0x0000000000000000 in ?? ()

I will provide more details in the morning

dudalev commented 19 hours ago

When you say "gets blocked", do you by that mean that: 1) no observable progress is made, but the the process is still using CPU, or 2) all threads are blocked (not using any CPU)?

Thread 33 in the original back trace could be the same futex you see in the second dump.

In the original message, I saw the whole release is slowly becoming dysfunctional - not initiating expected connections, I cannot ping the node, some proc messageboxes becoming huge with data from connections that are still alive. Only one or two scheduler threads are busy/stuck according to the crash dump.

In the repro snippet - heart watchdog kills the vm after a minute. In the crash dump all or almost all schedulers are running crypto:supports/0 and I assume are stuck in it. That is visible in the crash dump I attached.

frej commented 19 hours ago

In the crash dump all or almost all schedulers are running crypto:supports/0 and I assume are stuck in it.

Then I'm afraid that it's outside my area of expertise, as it no longer looks likely that this is related to one of my compiler optimizations. Someone in team:PS will have to look into this.

dudalev commented 18 hours ago

same snippet on arm64 mac (same erlang version):

1> application:ensure_all_started(gun), lists:foreach(fun(_) -> spawn_link(fun() ->
                     TLSOpts = [{verify, verify_none}, {cacerts, public_key:cacerts_get()}],
                     WSOpts = #{ keepalive => 3000, silence_pings => false},
                     {ok, ConnPid} = gun:open("echo.websocket.org", 443, #{tls_opts => TLSOpts, ws_opts => WSOpts}),
                     {ok, _} = gun:await_up(ConnPid), StreamRef = gun:ws_upgrade(ConnPid, "/"),
                     {upgrade, [<<"websocket">>], _} = gun:await(ConnPid, StreamRef, 3000),
                     gun:ws_send(ConnPid, StreamRef, {text, <<"BOOM">>}),
                     gun:await(ConnPid, StreamRef, 5000)
                     end) end, lists:seq(1, 100000)).
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
** exception exit: {badmatch,{error,{eopnotsupp,emfile}}}
     in function  pubkey_os_cacerts:get/0 (pubkey_os_cacerts.erl, line 39)
erl_child_setup: failed with error 32 on line 282
2024-10-23T20:53:29.246190+11:00 : error: Error in process <0.420.0> on node zadrot1@ziqqurah with exit value:, {{badmatch,{error,{eopnotsupp,emfile}}},[{pubkey_os_cacerts,get,0,[{file,"pubkey_os_cacerts.erl"},{line,39}]},{erl_eval,do_apply,7,[{file,"erl_eval.erl"},{line,750}]},{erl_eval,expr_list,7,[{file,"erl_eval.erl"},{line,1026}]},{erl_eval,expr,6,[{file,"erl_eval.erl"},{line,292}]},{erl_eval,expr,6,[{file,"erl_eval.erl"},{line,282}]},{erl_eval,expr,6,[{file,"erl_eval.erl"},{line,283}]},{erl_eval,expr,6,[{file,"erl_eval.erl"},{line,494}]},{erl_eval,exprs,6,[{file,"erl_eval.erl"},{line,136}]}]}
erl_child_setup: failed with error 32 on line 282
2024-10-23T20:53:29.246212+11:00 : error: Error in process <0.421.0> on node zadrot1@ziqqurah with exit value:, {{badmatch,{error,{eopnotsupp,emfile}}},[{pubkey_os_cacerts,get,0,[{file,"pubkey_os_cacerts.erl"},{line,39}]},{erl_eval,do_apply,7,[{file,"erl_eval.erl"},{line,750}]},{erl_eval,expr_list,7,[{file,"erl_eval.erl"},{line,1026}]},{erl_eval,expr,6,[{file,"erl_eval.erl"},{line,292}]},{erl_eval,expr,6,[{file,"erl_eval.erl"},{line,282}]},{erl_eval,expr,6,[{file,"erl_eval.erl"},{line,283}]},{erl_eval,expr,6,[{file,"erl_eval.erl"},{line,494}]},{erl_eval,exprs,6,[{file,"erl_eval.erl"},{line,136}]}]}
(zadrot1@ziqqurah)2> erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282
erl_child_setup: failed with error 32 on line 282

judging by the pid values, it made around 400 connections before hitting some weird stuff in public_key:ca_certs().

running another time didn't work in the same way as it seems echo.websocket.org banned me now and resets the stream 😭

xadhoom commented 18 hours ago

Just stubled upon this since I'm affected too.

Looking into https://github.com/erlang/otp/blob/OTP-26.2.5.4/lib/crypto/c_src/algorithms.c#L236

Is ok that ther're a couple of return in a enif_mutex_lock'ed routine without a corresponding unlock before? Not an expert in otp internals, so I can be totally wrong here.

dgud commented 18 hours ago

** exception exit: {badmatch,{error,{eopnotsupp,emfile}}}
     in function  pubkey_os_cacerts:get/0 (pubkey_os_cacerts.erl, line 39)

emfile is out file descriptors, so you are leaking those or use to many.

dudalev commented 18 hours ago

@dgud ah, probably, it's just a desktop mac without any ulimits tuned, not a real test

sverker commented 18 hours ago

@xadhoom said

Looking into https://github.com/erlang/otp/blob/OTP-26.2.5.4/lib/crypto/c_src/algorithms.c#L236

Is ok that ther're a couple of return in a enif_mutex_lock'ed routine without a corresponding unlock before? Not an expert in otp internals, so I can be totally wrong here.

That's it. Function get_curve_cnt() returns in two places without calling enif_mutex_unlock(mtx_init_curve_types). It is definitely a deadlock trap for any thread calling get_curve_cnt().

sverker commented 17 hours ago

This bug exists only in OTP-26.2.5.4.

It's a race. It can only happen the first time crypto:supports() or crypto:supports(curves) is called and get called by two (or more) Erlang processes at the same time. So, it takes some "bad luck" to happen.

I will prepare a fix.

dudalev commented 17 hours ago

@sverker thank you! It seems I was sufficiently lucky to break Erlang with my first real project in it - switching from python because of some hard to fix multithreading/async issues in networking libraries there 😅

flaviogrossi commented 16 hours ago

It's a race. It can only happen the first time crypto:supports() or crypto:supports(curves) is called and get called by two (or more) Erlang processes at the same time. So, it takes some "bad luck" to happen.

this seems plausible with my trace:

#0  futex_wait (futex_word=0x1807e830, expected=2, private=0) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x1807e830, private=0) at lowlevellock.c:49
#2  0x00007f8d17f71cc1 in lll_mutex_lock_optimized (mutex=0x1807e830) at pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x1807e830) at pthread_mutex_lock.c:93
#4  0x00000000005dceae in ethr_mutex_lock (mtx=<optimized out>) at ../include/internal/ethr_mutex.h:655
#5  erl_drv_mutex_lock (dmtx=<optimized out>) at beam/erl_drv_thread.c:228
#6  0x00000000005f3895 in enif_mutex_lock (mtx=<optimized out>) at beam/erl_nif.c:2341
#7  0x00007f8cd6740e78 in get_curve_cnt (fips=0, env=0x7f8cc74f5db0) at algorithms.c:248
#8  curve_algorithms (env=0x7f8cc74f5db0, argc=<optimized out>, argv=<optimized out>) at algorithms.c:226
#9  0x00000000004625da in process_main (esdp=0x1807e830, esdp@entry=0x7f8cc8827a80) at x86_64-pc-linux-gnu/opt/emu/beam_cold.h:140
#10 0x0000000000438f52 in sched_thread_func (vesdp=0x7f8cc8827a80) at beam/erl_process.c:8670
#11 0x00000000006b4a94 in thr_wrapper (vtwd=0x7ffe36524740) at pthread/ethread.c:116
#12 0x00007f8d17f6e6d7 in start_thread (arg=<optimized out>) at pthread_create.c:447
#13 0x00007f8d17ff260c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
(gdb) print mtx_init_curve_types.name
$6 = 0x1807e860 "init_curve_types"
(gdb) print mtx_init_curve_types.mtx
$7 = {pt_mtx = {__data = {__lock = 2, __count = 0, __owner = 544785, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = "\002\000\000\000\000\000\000\000\021P\b\000\001", '\000' <repeats 26 times>, __align = 2}}
(gdb) thread find 544785
Thread 41 has target id 'Thread 0x7f8cc73f36c0 (LWP 544785)'
(gdb) thread 41
[Switching to thread 41 (Thread 0x7f8cc73f36c0 (LWP 544785))]
#0  0x00007f8d17ff2a32 in epoll_wait (epfd=6, events=events@entry=0x7f8cd7bc6c10, maxevents=maxevents@entry=512, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30        return SYSCALL_CANCEL (epoll_wait, epfd, events, maxevents, timeout);
(gdb) bt
#0  0x00007f8d17ff2a32 in epoll_wait (epfd=6, events=events@entry=0x7f8cd7bc6c10, maxevents=maxevents@entry=512, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x0000000000643618 in check_fd_events (ps=<optimized out>, pr=0x7f8cd7bc6c10, max_res=<optimized out>, timeout_time=16384000000) at sys/common/erl_poll.c:1861
#2  erts_poll_wait (ps=0x7f8cd7bc1ec8, pr=0x7f8cd7bc6c10, len=0x7f8cc73f2b88, tpd=0x7f8cc8831df8, timeout_time=16384000000) at sys/common/erl_poll.c:1959
#3  0x0000000000649965 in erts_check_io (psi=0x7f8cd7bc53c0, timeout_time=timeout_time@entry=16384000000, poll_only_thread=poll_only_thread@entry=0) at sys/common/erl_check_io.c:1728
#4  0x0000000000452490 in scheduler_wait (fcalls=<synthetic pointer>, esdp=<optimized out>, rq=<optimized out>) at beam/erl_process.c:3545
#5  erts_schedule (esdp=<optimized out>, p=<optimized out>, calls=<optimized out>) at beam/erl_process.c:9803
#6  0x000000000045ab9b in process_main (esdp=0x6, esdp@entry=0x7f8cc8831dc0) at beam/emu/beam_emu.c:366
#7  0x0000000000438f52 in sched_thread_func (vesdp=0x7f8cc8831dc0) at beam/erl_process.c:8670
#8  0x00000000006b4a94 in thr_wrapper (vtwd=0x7ffe36524740) at pthread/ethread.c:116
#9  0x00007f8d17f6e6d7 in start_thread (arg=<optimized out>) at pthread_create.c:447
#10 0x00007f8d17ff260c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

the init_curve_types mutex is held and never released by thread 41 and thread 42 is stuck in the enif_mutex_lock call

erlang / otp

Scheduler blocked in a crypto NIF #8971