aws / aws-iot-device-sdk-cpp

SDK for connecting to AWS IoT from a device using C++
http://aws-iot-device-sdk-cpp-docs.s3-website-us-east-1.amazonaws.com
Apache License 2.0
124 stars 113 forks source link

random crashes on disconnects #65

Closed mzanetti closed 6 years ago

mzanetti commented 6 years ago

Every once in while our application crashes when the MBedTLS connection goes down. I managed to catch it in the debugger now, here's the stack trace:

The crash happens in Thread 13

` Thread 13 (Thread 0x7fffd22e6700 (LWP 14027)):

0 0x00007ffff65dbcc0 in __libc_write (fd=32, buf=0x7fffbc00c3c8, nbytes=31) at ../sysdeps/unix/sysv/linux/write.c:26

    resultvar = 18446744073709551584
    sc_cancel_oldtype = 0
    sc_ret = <optimized out>
    nbytes = 31
    buf = 0x7fffbc00c3c8
    fd = 32

1 0x00007ffff72c4aff in mbedtls_net_send () from /usr/lib/x86_64-linux-gnu/libmbedtls.so.10

No symbol table info available.

2 0x00007ffff72d165c in mbedtls_ssl_flush_output () from /usr/lib/x86_64-linux-gnu/libmbedtls.so.10

No symbol table info available.

3 0x00007ffff72d400f in mbedtls_ssl_write_record () from /usr/lib/x86_64-linux-gnu/libmbedtls.so.10

No symbol table info available.

4 0x00007ffff72d5d0f in mbedtls_ssl_send_alert_message () from /usr/lib/x86_64-linux-gnu/libmbedtls.so.10

No symbol table info available.

5 0x00007ffff72d5e17 in mbedtls_ssl_close_notify () from /usr/lib/x86_64-linux-gnu/libmbedtls.so.10

No symbol table info available.

6 0x00007ffff77819d7 in awsiotsdk::network::MbedTLSConnection::DisconnectInternal (this=0x555555b66f50) at ../../guh/libguh-core/MbedTLS/MbedTLSConnection.cpp:320

    ret = 0

7 0x0000555555589373 in awsiotsdk::NetworkConnection::Disconnect() ()

No symbol table info available.

8 0x00005555555a9c0c in awsiotsdk::mqtt::DisconnectActionAsync::PerformAction(std::shared_ptr, std::shared_ptr) ()

No symbol table info available.

9 0x00005555555b904e in awsiotsdk::ClientCoreState::PerformAction(awsiotsdk::ActionType, std::shared_ptr, std::chrono::duration<long, std::ratio<1l, 1000l> >) ()

No symbol table info available.

10 0x00005555555a6d24 in awsiotsdk::mqtt::KeepaliveActionRunner::PerformAction(std::shared_ptr, std::shared_ptr) ()

No symbol table info available.

11 0x00005555555b8a32 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::_Bind<awsiotsdk::ResponseCode (awsiotsdk::Action::*(std::unique_ptr<awsiotsdk::Action, std::default_delete >, std::shared_ptr, std::shared_ptr))(std::shared_ptr, std::shared_ptr)> > > >::_M_run() ()

No symbol table info available.

12 0x00007ffff63000ff in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

No symbol table info available.

13 0x00007ffff65d17fc in start_thread (arg=0x7fffd22e6700) at pthread_create.c:465

    pd = 0x7fffd22e6700
    now = <optimized out>
    unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140736719644416, 1081798041271840026, 140737106032670, 140737106032671, 140736719644416, 23, -1081719558347641574, -1081781811933565670}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
    not_first_call = <optimized out>
    pagesize_m1 = <optimized out>
    sp = <optimized out>
    freesize = <optimized out>
    __PRETTY_FUNCTION__ = "start_thread"

14 0x00007ffff5d61b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

No locals.

Thread 12 (Thread 0x7fffe8b63700 (LWP 14026)):

0 0x00007ffff65dc5f8 in __GI___nanosleep (requested_time=0x7fffe8b62b30, remaining=0x7fffe8b62b30) at ../sysdeps/unix/sysv/linux/nanosleep.c:27

    resultvar = 18446744073709551100
    sc_cancel_oldtype = 0

1 0x00005555555b5cf5 in awsiotsdk::Action::ReadFromNetworkBuffer(std::shared_ptr, std::vector<unsigned char, std::allocator >&, unsigned long) ()

No symbol table info available.

2 0x00005555555ab639 in awsiotsdk::mqtt::NetworkReadActionRunner::ReadPacketFromNetwork(unsigned char&, std::vector<unsigned char, std::allocator >&) ()

No symbol table info available.

3 0x00005555555ac19c in awsiotsdk::mqtt::NetworkReadActionRunner::PerformAction(std::shared_ptr, std::shared_ptr) ()

No symbol table info available.

4 0x00005555555b8a32 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::_Bind<awsiotsdk::ResponseCode (awsiotsdk::Action::*(std::unique_ptr<awsiotsdk::Action, std::default_delete >, std::shared_ptr, std::shared_ptr))(std::shared_ptr, std::shared_ptr)> > > >::_M_run() ()

No symbol table info available.

5 0x00007ffff63000ff in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

No symbol table info available.

6 0x00007ffff65d17fc in start_thread (arg=0x7fffe8b63700) at pthread_create.c:465

    pd = 0x7fffe8b63700
    now = <optimized out>
    unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737097643776, 1081798041271840026, 140737106032670, 140737106032671, 140737097643776, 21, -1081847998807757542, -1081781811933565670}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
    not_first_call = <optimized out>
    pagesize_m1 = <optimized out>
    sp = <optimized out>
    freesize = <optimized out>
    __PRETTY_FUNCTION__ = "start_thread"

7 0x00007ffff5d61b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

No locals.

Thread 10 (Thread 0x7fffea366700 (LWP 14024)):

0 0x00007ffff65dc5f8 in __GI___nanosleep (requested_time=0x7fffea365c60, remaining=0x7fffea365c60) at ../sysdeps/unix/sysv/linux/nanosleep.c:27

    resultvar = 18446744073709551100
    sc_cancel_oldtype = 0

1 0x00005555555baa9d in awsiotsdk::ClientCoreState::ProcessOutboundActionQueue(std::shared_ptr<std::atomic >) ()

No symbol table info available.

2 0x00005555555b88fa in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::_Bind<void (awsiotsdk::ClientCoreState::*(std::shared_ptr, std::shared_ptr<std::atomic >))(std::shared_ptr<std::atomic >)> > > >::_M_run() ()

No symbol table info available.

3 0x00007ffff63000ff in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

No symbol table info available.

4 0x00007ffff65d17fc in start_thread (arg=0x7fffea366700) at pthread_create.c:465

    pd = 0x7fffea366700
    now = <optimized out>
    unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737122821888, 1081798041271840026, 140737488339614, 140737488339615, 140737122821888, 21, -1081842497491522278, -1081781811933565670}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
    not_first_call = <optimized out>
    pagesize_m1 = <optimized out>
    sp = <optimized out>
    freesize = <optimized out>
    __PRETTY_FUNCTION__ = "start_thread"

5 0x00007ffff5d61b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

No locals.

Thread 8 (Thread 0x7fffd2ae7700 (LWP 13637)):

0 0x00007ffff5d55951 in __GI___poll (fds=0x7fffc40182f0, nfds=1, timeout=135119) at ../sysdeps/unix/sysv/linux/poll.c:29

    resultvar = 18446744073709551100
    sc_cancel_oldtype = 0

1 0x00007ffff29d6169 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0

No symbol table info available.

2 0x00007ffff29d627c in g_main_context_iteration () from /lib/x86_64-linux-gnu/libglib-2.0.so.0

No symbol table info available.

3 0x00007ffff6acb47f in QEventDispatcherGlib::processEvents(QFlags) () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5

No symbol table info available.

4 0x00007ffff6a70e3a in QEventLoop::exec(QFlags) () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5

No symbol table info available.

5 0x00007ffff68903ca in QThread::exec() () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5

No symbol table info available.

6 0x00007ffff689529d in ?? () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5

No symbol table info available.

7 0x00007ffff65d17fc in start_thread (arg=0x7fffd2ae7700) at pthread_create.c:465

    pd = 0x7fffd2ae7700
    now = <optimized out>
    unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140736728037120, 1081798041271840026, 140737488343278, 140737488343279, 140736728037120, 140736728037824, -1081720659469882086, -1081781811933565670}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
    not_first_call = <optimized out>
    pagesize_m1 = <optimized out>
    sp = <optimized out>
    freesize = <optimized out>
    __PRETTY_FUNCTION__ = "start_thread"

8 0x00007ffff5d61b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

No locals.

Thread 3 (Thread 0x7fffeb46d700 (LWP 13632)):

0 0x00007ffff5d55951 in __GI___poll (fds=0x7fffdc002de0, nfds=1, timeout=9997) at ../sysdeps/unix/sysv/linux/poll.c:29

    resultvar = 18446744073709551100
    sc_cancel_oldtype = 0

1 0x00007ffff29d6169 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0

No symbol table info available.

2 0x00007ffff29d627c in g_main_context_iteration () from /lib/x86_64-linux-gnu/libglib-2.0.so.0

No symbol table info available.

3 0x00007ffff6acb47f in QEventDispatcherGlib::processEvents(QFlags) () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5

No symbol table info available.

4 0x00007ffff6a70e3a in QEventLoop::exec(QFlags) () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5

No symbol table info available.

5 0x00007ffff68903ca in QThread::exec() () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5

No symbol table info available.

6 0x00007ffff689529d in ?? () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5

No symbol table info available.

7 0x00007ffff65d17fc in start_thread (arg=0x7fffeb46d700) at pthread_create.c:465

    pd = 0x7fffeb46d700
    now = <optimized out>
    unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737140676352, 1081798041271840026, 140737488344318, 140737488344319, 140737140676352, 140737140677056, -1081841256782844646, -1081781811933565670}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
    not_first_call = <optimized out>
    pagesize_m1 = <optimized out>
    sp = <optimized out>
    freesize = <optimized out>
    __PRETTY_FUNCTION__ = "start_thread"

8 0x00007ffff5d61b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

No locals.

Thread 2 (Thread 0x7fffec5e3700 (LWP 13631)):

0 0x00007ffff5d55951 in __GI___poll (fds=0x7fffe40053d0, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29

    resultvar = 18446744073709551100
    sc_cancel_oldtype = 0

1 0x00007ffff29d6169 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0

No symbol table info available.

2 0x00007ffff29d627c in g_main_context_iteration () from /lib/x86_64-linux-gnu/libglib-2.0.so.0

No symbol table info available.

3 0x00007ffff6acb47f in QEventDispatcherGlib::processEvents(QFlags) () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5

No symbol table info available.

4 0x00007ffff6a70e3a in QEventLoop::exec(QFlags) () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5

No symbol table info available.

5 0x00007ffff68903ca in QThread::exec() () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5

No symbol table info available.

6 0x00007ffff41b1e45 in ?? () from /usr/lib/x86_64-linux-gnu/libQt5DBus.so.5

No symbol table info available.

7 0x00007ffff689529d in ?? () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5

No symbol table info available.

8 0x00007ffff65d17fc in start_thread (arg=0x7fffec5e3700) at pthread_create.c:465

    pd = 0x7fffec5e3700
    now = <optimized out>
    unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737158985472, 1081798041271840026, 140737488347166, 140737488347167, 140737158985472, 140737158986176, -1081838996556305126, -1081781811933565670}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
    not_first_call = <optimized out>
    pagesize_m1 = <optimized out>
    sp = <optimized out>
    freesize = <optimized out>
    __PRETTY_FUNCTION__ = "start_thread"

9 0x00007ffff5d61b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

No locals.

Thread 1 (Thread 0x7ffff7fa9640 (LWP 13627)):

0 0x00007ffff5d55951 in __GI___poll (fds=0x555555cc29b0, nfds=22, timeout=870) at ../sysdeps/unix/sysv/linux/poll.c:29

    resultvar = 18446744073709551100
    sc_cancel_oldtype = 0

1 0x00007ffff29d6169 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0

No symbol table info available.

2 0x00007ffff29d627c in g_main_context_iteration () from /lib/x86_64-linux-gnu/libglib-2.0.so.0

No symbol table info available.

3 0x00007ffff6acb47f in QEventDispatcherGlib::processEvents(QFlags) () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5

No symbol table info available.

4 0x00007ffff6a70e3a in QEventLoop::exec(QFlags) () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5

No symbol table info available.

5 0x00007ffff6a79da4 in QCoreApplication::exec() () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5

No symbol table info available.

6 0x00005555555833a1 in main (argc=12, argv=0x7fffffffe2c8) at ../../guh/server/main.cpp:281

    userId = 1000
    ret = -1
    application = {<QCoreApplication> = {<No data fields>}, <No data fields>}
    loggingFiltersPlugins = {{d = 0x55555584efc0, e = 0x55555584efc0}}
    translator = <incomplete type>
    __PRETTY_FUNCTION__ = "int main(int, char**)"
    parser = {d = 0x55555584c490}
    applicationDescription = {static null = {<No data fields>}, d = 0x555555850700}
    foregroundOption = {d = {d = 0x55555584a870}}
    debugDescription = {static null = {<No data fields>}, d = 0x555555871940}
    sortedFilterList = {<QList<QString>> = {<QListSpecialMethods<QString>> = {<No data fields>}, {p = {static shared_null = {ref = {atomic = {_q_value = {<std::__atomic_base<int>> = {static _S_alignment = 4, _M_i = -1}, <No data fields>}}}, alloc = 0, begin = 0, end = 0, array = {0x0}}, d = 0x55555584dc10}, d = 0x55555584dc10}}, <No data fields>}
    sortedPluginList = {<QList<QString>> = {<QListSpecialMethods<QString>> = {<No data fields>}, {p = {static shared_null = {ref = {atomic = {_q_value = {<std::__atomic_base<int>> = {static _S_alignment = 4, _M_i = -1}, <No data fields>}}}, alloc = 0, begin = 0, end = 0, array = {0x0}}, d = 0x555555850b10}, d = 0x555555850b10}}, <No data fields>}
    allOption = {d = {d = 0x55555584d850}}
    logOption = {d = {d = 0x55555584f3d0}}
    dbusOption = {d = {d = 0x55555584b630}}
    debugOption = {d = {d = 0x55555584f0f0}}
    startForeground = true
    service = {<QtService<QCoreApplication>> = {<QtServiceBase> = {_vptr.QtServiceBase = 0x10500000002, d_ptr = 0x5555555bdce9}, app = 0x5555555be4c0 <main::__PRETTY_FUNCTION__>}, <No data fields>}
    ret = 0

`

The contents of "this" and "ssl": ` Locals
ret 0 int this @0x555555b66f50 awsiotsdk::network::MbedTLSConnection [awsiotsdk::NetworkConnection] @0x555555b66f50 awsiotsdk::NetworkConnection cacert
@0x555555b70950 mbedtls_x509crt clicert @0x555555b70b78 mbedtls_x509crt conf @0x555555b707d0 mbedtls_ssl_config ctrdrbg @0x555555b70490 mbedtls_ctr_drbg_context device_certlocation "/home/micha/Develop/cloudstuff/testserver/certificate.pem" awsiotsdk::util::String device_private_keylocation "/home/micha/Develop/cloudstuff/testserver/private.key" awsiotsdk::util::String endpoint_ "a2addxakg5juii.iot.eu-west-1.amazonaws.com" awsiotsdk::util::String endpointport 8883 uint16t entropy @0x555555b67050 mbedtls_entropycontext flags 0 uint32_t isconnected @0x555555b67009 std::atomicbool pkey @0x555555b70da0 mbedtls_pk_context requiresfree @0x555555b70db4 std::atomic_bool root_calocation "/home/micha/Develop/cloudstuff/symantec.pem" awsiotsdk::util::String serverfd @0x555555b70db0 mbedtls_net_context server_verificationflag @0x555555b67008 std::atomicbool ssl @0x555555b70618 mbedtls_ssl_context alpn_chosen 0x0 char badmac_seen 0 unsigned int cli_id 0x0 unsigned char cli_id_len 0 size_t client_auth 1 int conf @0x555555b707d0 mbedtls_ssl_config f_get_timer 0x0 mbedtls_ssl_get_timer_t f_recv 0x0 mbedtls_ssl_recv_t f_recv_timeout mbedtls_ssl_recv_timeout_t f_send mbedtls_ssl_send_t f_set_timer 0x0 mbedtls_ssl_set_timer_t handshake 0x0 mbedtls_ssl_handshake_params hostname "a2addxakg5juii.iot.eu-west-1.amazonaws.com" char in_buf "" unsigned char in_ctr "" unsigned char in_epoch 0 uint16_t in_hdr "\027\003\003" unsigned char in_hslen 0 size_t in_iv "" unsigned char in_left 0 size_t in_len "" unsigned char in_msg "�" unsigned char in_msglen 0 size_t in_msgtype 23 int in_offt 0x0 unsigned char in_window 0 uint64_t in_window_top 0 uint64_t keep_current_message 0 int major_ver 3 int minor_ver 3 int nb_zero 0 int next_record_offset 0 size_t out_buf "" unsigned char out_ctr "" unsigned char out_hdr "\025\003\003" unsigned char out_iv "" unsigned char out_left 31 size_t out_len "" unsigned char out_msg "����H���XAe\r垴1W\022���C,���Q��vQt{o�Z�5-\t�$FVq����Z��\002[��\013\t�\031\005@u\023�����y�\001\006n��\025U��~Ν�{�m\t��$�Z\014\2776L͋�"... (unknown length) unsigned char out_msglen 26 size_t out_msgtype 21 int own_verify_data "46�?�3q���\tQ" char[12] p_bio 0x555555b70db0 void p_timer 0x0 void peer_verify_data "�|j����ΞdyN" char[12] renego_records_seen 0 int renego_status 0 int secure_renegotiation 1 int session @0x7fffbc010520 mbedtls_ssl_session session_in @0x7fffbc010520 mbedtls_ssl_session session_negotiate 0x0 mbedtls_ssl_session session_out @0x7fffbc010520 mbedtls_ssl_session split_done '\0' 0 0x00 signed char state 16 int transform @0x7fffbc007480 mbedtls_ssl_transform transform_in @0x7fffbc007480 mbedtls_ssl_transform transform_negotiate 0x0 mbedtls_ssl_transform transform_out @0x7fffbc007480 mbedtls_ssl_transform verify_data_len 12 size_t tls_handshaketimeout @0x555555b67010 std::chrono::milliseconds tls_readtimeout @0x555555b67018 std::chrono::milliseconds tls_writetimeout @0x555555b67020 std::chrono::milliseconds Inspector
Expressions
Return Value
Tooltip
`

image

vareddy-zz commented 6 years ago

Hi @mzanetti, This seems similar to the issue observed in the OpenSSL wrapper, when the connection goes down abruptly and a SIGPIPE is generated (#60). Which platform are you seeing this crash on? Could you try adding signal(SIGPIPE, SIG_IGN); in the ConnectInternal() function? The MbedTLS wrapper does not have an init function so the signal ignore call will have to be made elsewhere (in the application main perhaps?). Let me know if that works for you. Thanks! Varun

mzanetti commented 6 years ago

Hi @vareddy

I am seeing this on Ubuntu (both, Ubuntu classic and Ubuntu Core aka Snappy).

It indeed seems to be a SIGPIPE triggering this, yes. Now, the issue is that I can't seem to find a way to reliably reproduce this. This only happens every few days of our service running when the aws connection drops for $random_reason. A manual disconnect call on the library does not show this behavior, or, at least in very rare occations.

I will add the ignoring of the SIGPIPE signal as you suggested and I am confident that it will help. In case I still see this happening after the change I'd come back to you.

Thanks a lot!

vareddy-zz commented 6 years ago

Hi @mzanetti, There haven't been any updates on this in a long time, were you able to fix the issue? Thanks! Varun

vareddy-zz commented 6 years ago

Hi @mzanetti, Since this issue has been inactive for a long time, we are closing it. Please open another issue if you have any more questions. Thanks! Varun