beam.smp crashes in strong_rand_bytes_nif (with openssl 3 & Erlang 25)

erlang / otp

Erlang/OTP

http://erlang.org

Apache License 2.0

11.29k stars 2.94k forks source link

beam.smp crashes in strong_rand_bytes_nif (with openssl 3 & Erlang 25) #7098

Open hareen-kancharla opened 1 year ago

hareen-kancharla commented 1 year ago

Describe the bug beam.smp crashes in strong_rand_bytes_nif (openssl 3 & Erlang 25) with the following stacktrace.

Erlang Version: 25.1.2 Crypto app version: crypto-5.1.2 OS: Ubuntu18

12:24:44 Core was generated by `/home/couchbase/jenkins/workspace/ns-server-serverless-simple-test/install/lib/'.
12:24:44 Program terminated with signal SIGSEGV, Segmentation fault.
12:24:44 #0  __pthread_rwlock_wrlock_full (abstime=0x0, rwlock=0x0) at pthread_rwlock_common.c:576
12:24:44 [Current thread is 1 (Thread 0x7fce886cb700 (LWP 125009))]
12:24:44 #0  __pthread_rwlock_wrlock_full (abstime=0x0, rwlock=0x0) at pthread_rwlock_common.c:576
12:24:44 #1  __GI___pthread_rwlock_wrlock (rwlock=0x0) at pthread_rwlock_wrlock.c:27
12:24:44 #2  0x00007fcd90ae39b9 in CRYPTO_THREAD_write_lock () from /home/couchbase/jenkins/workspace/ns-server-serverless-simple-test/ns_server/../install/lib/libcrypto.so.3
12:24:44 #3  0x00007fcd90b141f4 in RAND_get_rand_method () from /home/couchbase/jenkins/workspace/ns-server-serverless-simple-test/ns_server/../install/lib/libcrypto.so.3
12:24:44 #4  0x00007fcd90b14989 in RAND_bytes_ex () from /home/couchbase/jenkins/workspace/ns-server-serverless-simple-test/ns_server/../install/lib/libcrypto.so.3
12:24:44 #5  0x00007fcd91c4ff72 in strong_rand_bytes_nif (env=env@entry=0x7fce886c6c70, argc=argc@entry=1, argv=argv@entry=0x7fce886c6d80) at rand.c:39
12:24:44 #6  0x00000000004799a5 in beam_jit_call_nif (c_p=0x2331cf0, I=<optimized out>, reg=0x7fce886c6d80, fp=0x7fcd91c4ff30 <strong_rand_bytes_nif>, NifMod=<optimized out>) at beam/jit/beam_jit_common.cpp:559
12:24:44 #7  0x00007fce8a94b55e in ?? ()
12:24:44 #8  0x0000000000000000 in ?? ()

To Reproduce

Tried reproducing on a similar environment by repeatedly calling crypto:strong_rand_bytes(16) - unsurprisingly didn't see the crash.
I have included the corefile though from the above crash. We saw the issue 3 times - but unfortunately we don't the corefiles from the other 2 occurrences.

Expected behavior beam.smp shouldn't crash.

Affected versions 25.1.2

Additional context archived_core_dumps.tar.bz2 has the coredump and the necessary .so files (corefile was too large to upload here)

gdb home/couchbase/jenkins/workspace/ns-server-serverless-simple-test/install/lib/erlang/erts-13.1.2/bin/beam.smp --core 'core.5_scheduler.124980.d031d9c9e084.1680117857' -ex 'set debug-file-directory usr/lib/debug' -ex 'set sysroot .'

(gdb) etp-process-info c_p
--
Pid: <0.31284.1>
State: running \| active \| prq-prio-normal \| usr-prio-normal \| act-prio-normal
 
Flags:
Registered name: ns_node_disco
Current function: crypto:strong_rand_bytes_nif/1
I: #Cp<gen:do_call/4+0x2f8> @ "gen.erl":237
Heap size: 1598
Old-heap size: 1598
Mbuf size: 0
Msgq len: 0 (inner=0, outer=0)
Msgq Flags: on-heap
Parent: <0.31282.1>
Pointer: (Process*)0x2331cf0
 
(gdb) etp-stacktrace c_p -jit
%%% WARNING: The process is currently running, so c_p->stop will not be correct
%%%          Consider using -emu or -jit variant instead
% Stacktrace (31)
I: #Cp<gen:do_call/4+0x2f8> @ "gen.erl":237.
0: #Cp<crypto:strong_rand_bytes/1+0x7c> @ "crypto.erl":1140.
1: #Cp<misc:uuid_v4/0+0x8c> @ "src/misc.erl":3294.
2: #Cp<event_log:add_log/2+0x108> @ "src/event_log.erl":259.
8: #Cp<ns_node_disco:handle_info/2+0x6e0> @ "src/ns_node_disco.erl":188.
10: #Cp<gen_server:try_dispatch/4+0x12c> @ "gen_server.erl":1123.
19: #Cp<gen_server:handle_msg/6+0x59c> @ "gen_server.erl":1200.
26: #Cp<proc_lib:init_p_do_apply/3+0xec> @ "proc_lib.erl":240.
30: #Cp<terminate process normally>.

jhogberg commented 1 year ago

Thanks for your report! Something fishy is going on here. We appear to crash here because rand_meth_lock is NULL:

// #ifndef OPENSSL_NO_DEPRECATED_3_0 a bit higher up

const RAND_METHOD *RAND_get_rand_method(void)
{
    const RAND_METHOD *tmp_meth = NULL;

    // this is just pthread_once, and returns after do_rand_init runs. If multiple
    // threads race they will wait until the first one finishes.
    if (!RUN_ONCE(&rand_init, do_rand_init))
        return NULL;

    if (!CRYPTO_THREAD_write_lock(rand_meth_lock)) // <----
        return NULL;

This is weird since do_rand_init is supposed to initialize it:

DEFINE_RUN_ONCE_STATIC(do_rand_init)
{
# ifndef OPENSSL_NO_ENGINE
    rand_engine_lock = CRYPTO_THREAD_lock_new();
    if (rand_engine_lock == NULL)
        return 0;
# endif

# ifndef OPENSSL_NO_DEPRECATED_3_0
    rand_meth_lock = CRYPTO_THREAD_lock_new();
    if (rand_meth_lock == NULL)
        goto err;
# endif

    if (!ossl_rand_pool_init())
        goto err;

    rand_inited = 1;
    return 1;

 err:
# ifndef OPENSSL_NO_DEPRECATED_3_0
    CRYPTO_THREAD_lock_free(rand_meth_lock);
    rand_meth_lock = NULL;
# endif
# ifndef OPENSSL_NO_ENGINE
    CRYPTO_THREAD_lock_free(rand_engine_lock);
    rand_engine_lock = NULL;
# endif
    return 0;
}

Checking the state of the rand_inited variable, it's 0, so we apparently haven't succeeded in initializing the lock. Now why would RUN_ONCE(&rand_init, do_rand_init) return non-zero if that failed?

#  define DEFINE_RUN_ONCE_STATIC(init)            \
    static int init(void);                     \
    static int init##_ossl_ret_ = 0;            \
    static void init##_ossl_(void)              \
    {                                           \
        init##_ossl_ret_ = init();              \
    }                                           \
    static int init(void)

// snip

#  define RUN_ONCE(once, init)                                            \
    (CRYPTO_THREAD_run_once(once, init##_ossl_) ? init##_ossl_ret_ : 0)

// snip

int CRYPTO_THREAD_run_once(CRYPTO_ONCE *once, void (*init)(void))
{
    if (pthread_once(once, init) != 0) // zero on success
        return 0;

    return 1;
}

So, if CRYPTO_THREAD_run_once succeeds RUN_ONCE returns the value of do_rand_init_ossl_ret_, which has been set to the return value of do_rand_init. do_rand_init_ossl_ret_ is 1 while rand_inited is 0, which should not be possible. I'd really like to see some more core dumps if that can be arranged.

How long had the node been running before the crash? Did any previous calls to crypto:strong_rand_bytes/1 succeed?

hareen-kancharla commented 1 year ago

I'd really like to see some more core dumps if that can be arranged.

These coredumps were seen in our Jenkins job and unfortunately our automation didn't collect these corefiles. We have updated that now and if we do you see the issue again I shall upload the corefiles here.

How long had the node been running before the crash? Did any previous calls to crypto:strong_rand_bytes/1 succeed?

This was in our Jenkins job - the node mustn't have run for more than 40 mins . Unfortunately our Jenkins job had other failures and we couldn't collect the logs too, to know if any previous calls were made to crypto:strong_rand_bytes/1.

hareen-kancharla commented 1 year ago

@jhogberg: Another instance of the crash was seen recently in our Jenkin's job. The new corefile is here - https://github.com/hareen-kancharla/erlang-coredumps/blob/main/archived_core_dumps.tar-2.bz2.

jhogberg commented 1 year ago

Thanks, the symptoms near the crash are the same (rand_inited is 0, do_rand_init_ossl_ret_ is 1) so it doesn't seem random like I'd expect from an ordinary memory corruption bug.

What flags have you compiled OTP and OpenSSL with? What other NIFs are in use?

hareen-kancharla commented 1 year ago

@jhogberg: The compiler files we use are: --enable-smp-support --disable-hipe --disable-fp-exceptions --without-javac --without-et --without-debugger --without-megaco --enable-jit

The OpenSSL version is 3.0.7-3.

In terms of the nifs we use - all the libraries and NIFs we use are under: archived_core_dumps/home/couchbase/jenkins/workspace/ns-server-serverless-simple-test dir.

install/lib has: total 59008 -rwxr-xr-x@ 1 hareen.kancharla staff 20910192 Mar 25 2021 libstdc++.so.6 -rwxr-xr-x@ 1 hareen.kancharla staff 592272 Mar 25 2021 libgcc_s.so.1 -rwxr-xr-x@ 1 hareen.kancharla staff 279728 Apr 7 2022 libtinfo.so.6 -rwxr-xr-x@ 1 hareen.kancharla staff 3026280 Jun 16 2022 libsodium.so.23 -rwxr-xr-x@ 1 hareen.kancharla staff 5394160 Jan 11 15:32 libcrypto.so.3

ns-server-serverless-simple-test/ns_server - (main) ~ fd "nif.so" deps/chronicle/priv/sync_nif.so deps/enacl/priv/enacl_nif.so

Let me know if there is more information you need.

jhogberg commented 1 year ago

Okay, would it be possible to run these parts of the pipeline without these NIFs? Is the crash somewhat deterministic in regards to when it crashes (during or near a certain testcase or similar)?

hareen-kancharla commented 1 year ago

Okay, would it be possible to run these parts of the pipeline without these NIFs?

Sure - we'll try that.

Is the crash somewhat deterministic in regards to when it crashes (during or near a certain testcase or similar)?

TBH, not sure :) We have seen it in different tests and the first instance of it was when we were trying to generate a v4 UUID (which needed a call to rand_strong_bytes). These UUID'es are generated frequently in our application - so I would hunch it's not that deterministic, otherwise we would have seen more crashes.

timofey-barmin commented 1 year ago

I think I reproduced it on vanilla erlang 25.3, and 24.3.4.4 with no external nifs loaded. On ubuntu 18.04 and on mac os 13.3. Basically I am generating bytes in a loop and then terminating the VM using erlang:halt(0, [{flush, false}]). I think [{flush, false}] in halt() triggers it somehow. I could not reproduce it without that option.

It doesn't happen in 100% of cases but pretty often (see below).

Script to reproduce:

$ cat test.erl
-module(test).

-export([run/0]).

run() ->
    spawn(fun test/0),
    timer:sleep(300),
    erlang:halt(0, [{flush, false}]).

test() ->
    crypto:strong_rand_bytes(10),
    test().

Running:

$ for i in {1..20}; do erl -run test run; done
Erlang/OTP 25 [erts-13.2] [source-7552343] [64-bit] [smp:6:6] [ds:6:6:10] [async-threads:1] [jit:ns]

Eshell V13.2  (abort with ^G)
1> Segmentation fault (core dumped)
Erlang/OTP 25 [erts-13.2] [source-7552343] [64-bit] [smp:6:6] [ds:6:6:10] [async-threads:1] [jit:ns]

Eshell V13.2  (abort with ^G)
1> Erlang/OTP 25 [erts-13.2] [source-7552343] [64-bit] [smp:6:6] [ds:6:6:10] [async-threads:1] [jit:ns]

Eshell V13.2  (abort with ^G)
1> Erlang/OTP 25 [erts-13.2] [source-7552343] [64-bit] [smp:6:6] [ds:6:6:10] [async-threads:1] [jit:ns]

Eshell V13.2  (abort with ^G)
1> Erlang/OTP 25 [erts-13.2] [source-7552343] [64-bit] [smp:6:6] [ds:6:6:10] [async-threads:1] [jit:ns]

Eshell V13.2  (abort with ^G)
1> Erlang/OTP 25 [erts-13.2] [source-7552343] [64-bit] [smp:6:6] [ds:6:6:10] [async-threads:1] [jit:ns]

Eshell V13.2  (abort with ^G)
1> Erlang/OTP 25 [erts-13.2] [source-7552343] [64-bit] [smp:6:6] [ds:6:6:10] [async-threads:1] [jit:ns]

Eshell V13.2  (abort with ^G)
1> Segmentation fault (core dumped)
Erlang/OTP 25 [erts-13.2] [source-7552343] [64-bit] [smp:6:6] [ds:6:6:10] [async-threads:1] [jit:ns]

Eshell V13.2  (abort with ^G)
1> Segmentation fault (core dumped)
Erlang/OTP 25 [erts-13.2] [source-7552343] [64-bit] [smp:6:6] [ds:6:6:10] [async-threads:1] [jit:ns]

max-au commented 1 year ago

It really reminded me https://github.com/erlang/otp/issues/5325

Could it be that openSSL is unloaded while the NIF is still running?

jhogberg commented 1 year ago

Thank you! I can reproduce it locally :)

Could it be that openSSL is unloaded while the NIF is still running?

Sort of, the exit(2) that tears the emulator down calls all atexit/on_exit callbacks which races with the scheduler thread that calls strong_rand_bytes_nif.

I'll see what I can do to fix this.

jhogberg commented 1 year ago

I can confirm it's very similar to https://github.com/erlang/otp/issues/5325, the crash was on a normal rather than dirty scheduler but the root cause is the same -- we're terminating the emulator while NIFs are in flight.

I don't think there's anything we can do about this without breaking backwards compatibility with NIFs that expect their registered atexit functions to run. I recommend upgrading to OTP 26 or using erlang:halt/2 without {flush,false}.