deltachat / deltachat-node

Email-based instant messaging for Node.js.
GNU General Public License v3.0
45 stars 11 forks source link

Core crash on Desktop startup #503

Closed Simon-Laux closed 2 years ago

Simon-Laux commented 3 years ago
name = 'deltachat_ffi'
operating_system = 'unix:Manjaro'
crate_version = '1.54.0'
explanation = '''
Panic occurred in file '~/.cargo/registry/src/github.com-1ecc6299db9ec823/concurrent-queue-1.2.2/src/bounded.rs' at line 160
'''
cause = 'index out of bounds: the len is 1000 but the index is 1020'
method = 'Panic'
backtrace = '''

   0: 0x7f23caf83bd3 - async_channel::Receiver<T>::try_recv::hb133967b16c05b88
   1: 0x7f23caf0e584 - <async_std::task::builder::SupportTaskLocals<F> as core::future::future::Future>::poll::hc10285cb0024f608
   2: 0x7f23caf3e028 - deltachat::events::EventEmitter::recv_sync::h7cb379ea5b9dfba3
   3: 0x7f23cb1595e6 - dc_get_next_event
   4: 0x7f23cabded96 - event_handler_thread_func
   5: 0x7f23d6e4d299 - start_thread
   6: 0x7f23d57bf053 - clone
   7:        0x0 - <unresolved>'''

Desktop crashes too often for me (~ more than 60% on first startup after system restart and 10-20% of the other startups), most of the time I don't get an crash report, this time I got one.

link2xt commented 3 years ago

Looks like something to be upstreamed to https://github.com/smol-rs/concurrent-queue

link2xt commented 3 years ago

@Simon-Laux is it an amd64 system?

link2xt commented 3 years ago

@Simon-Laux Could you try with deltachat/deltachat-core-rust#2444? Pretty sure it is a bug in concurrent-queue bit twiddling at https://github.com/smol-rs/concurrent-queue/blob/master/src/bounded.rs, think you can upstream the bug to their repo.

link2xt commented 3 years ago

Upstream issue: https://github.com/smol-rs/concurrent-queue/issues/11

Simon-Laux commented 3 years ago

Here a few backtraces from segfaults: I got the first one multiple times out of the 6 crash dumps I looked at:

Program terminated with signal SIGSEGV, Segmentation fault.
t#0  0x00007fd99375fb90 in event_listener::Event::listen () from ./deltachat-node/build/Release/deltachat.node
[Current thread is 1 (Thread 0x7fd97f65a640 (LWP 33629))]
(gdb) bt
#0  0x00007fd99375fb90 in event_listener::Event::listen () at ./deltachat-node/build/Release/deltachat.node
deltachat/deltachat-core-rust#1  0x00007fd9933ad6c4 in <async_std::task::builder::SupportTaskLocals<F> as core::future::future::Future>::poll ()
    at ./deltachat-node/build/Release/deltachat.node
deltachat/deltachat-core-rust#2  0x00007fd9935fbdf3 in dc_get_next_event () at ./deltachat-node/build/Release/deltachat.node
deltachat/deltachat-core-rust#3  0x00007fd9930a5d96 in event_handler_thread_func () at ./deltachat-node/build/Release/deltachat.node
deltachat/deltachat-core-rust#4  0x00007fd99f2f9299 in start_thread () at /usr/lib/libpthread.so.0
deltachat/deltachat-core-rust#5  0x00007fd99dc6b053 in clone () at /usr/lib/libc.so.6
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f14ccdbf14d in dc_get_event_emitter () from ./deltachat-node/build/Release/deltachat.node
[Current thread is 1 (Thread 0x7f14a33bf640 (LWP 38774))]
(gdb) bt
#0  0x00007f14ccdbf14d in dc_get_event_emitter () at ./deltachat-node/build/Release/deltachat.node
deltachat/deltachat-core-rust#1  0x00007f14cc869d62 in event_handler_thread_func () at ./deltachat-node/build/Release/deltachat.node
deltachat/deltachat-core-rust#2  0x00007f14d82bc299 in start_thread () at /usr/lib/libpthread.so.0
deltachat/deltachat-core-rust#3  0x00007f14d6c2e053 in clone () at /usr/lib/libc.so.6

Some of these could also be bugs inside of the node-bindings :shrug: Also I was surprised to see this crash:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fd90e4fb44d in dc_get_chat_id_by_contact_id ()
   from ./deltachat-desktop/node_modules/deltachat-node/build/Release/deltachat.node
[Current thread is 1 (Thread 0x7fd8cf4e6640 (LWP 12048))]
(gdb) bt
#0  0x00007fd90e4fb44d in dc_get_chat_id_by_contact_id ()
    at ./deltachat-desktop/node_modules/deltachat-node/build/Release/deltachat.node
deltachat/deltachat-core-rust#1  0x0000000000000000 in  ()
(gdb) bt
#0  0x00007fd90e4fb44d in dc_get_chat_id_by_contact_id ()
    at ./deltachat-desktop/node_modules/deltachat-node/build/Release/deltachat.node
deltachat/deltachat-core-rust#1  0x0000000000000000 in  ()

no simple function like that should be able to crash the program.

link2xt commented 3 years ago

These functions are ok, they are written in safe rust and don't crash on android. Most likely node bindings call the functions with already freed dc_context_t.

Simon-Laux commented 3 years ago

strange the macro NAPI_DCN_CONTEXT checks for empty context: https://github.com/deltachat/deltachat-node/blob/6c941d2e73632abaf2eddc6b2c29b253357b3327/src/napi-macros-extensions.h#L17

for the event emitter thing I did a pr fixing the garbage collection / closed context check: https://github.com/deltachat/deltachat-node/pull/502

link2xt commented 3 years ago

strange the macro NAPI_DCN_CONTEXT checks for empty context

Maybe someone called dc_context_unref and then keeps using the pointer? NULL check is not enough to catch this.

Edit: I see, the pointer is set to NULL after dc_context_unref. Not sure how thread-safe this is, what if some thread frees the pointer while it is already fetched from memory?

link2xt commented 3 years ago

Related core issue: https://github.com/deltachat/deltachat-core-rust/issues/2280

Simon-Laux commented 3 years ago

got the concurrent-queue crash again:

name = 'deltachat_ffi'
operating_system = 'unix:Manjaro'
crate_version = '1.55.0'
explanation = '''
Panic occurred in file '~/.cargo/registry/src/github.com-1ecc6299db9ec823/concurrent-queue-1.2.2/src/bounded.rs' at line 160
'''
cause = 'index out of bounds: the len is 1024 but the index is 1806'
method = 'Panic'
backtrace = '''

   0: 0x7f7c1e27e353 - async_channel::Receiver<T>::try_recv::hf80e615fbe9ca3bb
   1: 0x7f7c1e1eb654 - <async_std::task::builder::SupportTaskLocals<F> as core::future::future::Future>::poll::h04d16837fdd0a69e
   2: 0x7f7c1e439e03 - dc_get_next_event
   3: 0x7f7c1dee3d96 - event_handler_thread_func
   4: 0x7f7c2a137299 - start_thread
   5: 0x7f7c28aa9053 - clone
   6:        0x0 - <unresolved>'''
link2xt commented 3 years ago

You can revert https://github.com/deltachat/deltachat-core-rust/pull/2444, it should have used 1023 instead of 1024 anyway (currently the cap is 2048 as a result of this off-by-one error) and the real problem looks like use-after-free, not a bug in concurrent-queue.

Simon-Laux commented 2 years ago

let's reopen if it happens again