facebook / squangle

SQuangLe is a C++ API for accessing MySQL servers
Other
123 stars 54 forks source link

AsyncConnectionPool::CleanUpTimer segfaults #13

Open y3llowcake opened 3 years ago

y3llowcake commented 3 years ago

We (slack) have been seeing a very slow trickle of segfaults from code paths in the cleanup timer in our production environment. This issue is not new, it's been occurring for a while. I do not yet have a repro for these segfaults.

Relevant version of squangle we are running:

┌─(cy@zebu:~/sl/hhvm/third-party/squangle/src)
└─(29)% git log -n 1 --pretty=short  
commit 9b3d6adf34d4f1ec1c1713a54b9def947384b17b (HEAD)
Author: Jay Edgar <jkedgar@fb.com>

    Update state_ inside mutex

There are two unique stack traces we see. The first is more frequent and appears to occur on a call to std::unordered_map::erase():

0:string"raise at ../sysdeps/unix/sysv/linux/raise.c:51"
1:string"HPHP::bt_handler at /build/hhvm/hphp/runtime/base/crash-reporter.cpp:270"
2:string"std::_Hashtable<facebook::common::mysql_client::PoolKey, at /usr/include/c++/7/bits/hashtable.h:1627"
3:string"std::_Hashtable<facebook::common::mysql_client::PoolKey, at /usr/include/c++/7/bits/hashtable.h:1864"
4:string"std::_Hashtable<facebook::common::mysql_client::PoolKey, at /usr/include/c++/7/bits/hashtable.h:755"
5:string"std::unordered_map<facebook::common::mysql_client::PoolKey, at /usr/include/c++/7/bits/unordered_map.h:797"
6:string"facebook::common::mysql_client::AsyncConnectionPool::ConnStorage::cleanupConnections at /build/hhvm/third-party/squangle/squangle/mysql_client/AsyncConnectionPool.cpp:619"
7:string"facebook::common::mysql_client::AsyncConnectionPool::CleanUpTimer::timeoutExpired at /build/hhvm/third-party/squangle/squangle/mysql_client/AsyncConnectionPool.cpp:519"

The second one looks like use of an invalid map iterator ('ref_iter->second' where ref_iter is probably pointing to end()?) :

0:string"raise at ../sysdeps/unix/sysv/linux/raise.c:51"
1:string"HPHP::bt_handler at /build/hhvm/hphp/runtime/base/crash-reporter.cpp:270"
2:string"facebook::common::mysql_client::AsyncMysqlClient::activeConnectionRemoved at /build/hhvm/third-party/squangle/src/squangle/mysql_client/AsyncMysqlClient.h:357"
3:string"facebook::common::mysql_client::MysqlConnectionHolder::~MysqlConnectionHolder at /build/hhvm/third-party/squangle/squangle/mysql_client/Connection.cpp:63"
4:string"facebook::common::mysql_client::MysqlConnectionHolder::~MysqlConnectionHolder at /build/hhvm/third-party/squangle/squangle/mysql_client/Connection.cpp:64"
5:string"std::default_delete<facebook::common::mysql_client::MysqlPooledHolder>::operator() at /usr/include/c++/7/bits/unique_ptr.h:78"
6:string"std::unique_ptr<facebook::common::mysql_client::MysqlPooledHolder, at /usr/include/c++/7/bits/unique_ptr.h:263"
7:string"__gnu_cxx::new_allocator<std::_List_node<std::unique_ptr<facebook::common::mysql_client::MysqlPooledHolder, at /usr/include/c++/7/ext/new_allocator.h:140"
8:string"std::allocator_traits<std::allocator<std::_List_node<std::unique_ptr<facebook::common::mysql_client::MysqlPooledHolder, at /usr/include/c++/7/bits/alloc_traits.h:487"
9:string"std::__cxx11::list<std::unique_ptr<facebook::common::mysql_client::MysqlPooledHolder, at /usr/include/c++/7/bits/stl_list.h:1815"
10:string"facebook::common::mysql_client::AsyncConnectionPool::ConnStorage::cleanupConnections at /build/hhvm/third-party/squangle/squangle/mysql_client/AsyncConnectionPool.cpp:613"
11:string"facebook::common::mysql_client::AsyncConnectionPool::CleanUpTimer::timeoutExpired at /build/hhvm/third-party/squangle/squangle/mysql_client/AsyncConnectionPool.cpp:519"
12:string"folly::AsyncTimeout::libeventCallback at /build/hhvm/third-party/folly/src/folly/io/async/AsyncTimeout.cpp:171"
13:string"folly::EventBase::loopBody at /build/hhvm/third-party/folly/src/folly/io/async/EventBase.cpp:394"
14:string"folly::EventBase::loop at /build/hhvm/third-party/folly/src/folly/io/async/EventBase.cpp:312"
15:string"folly::EventBase::loopForever at /build/hhvm/third-party/folly/src/folly/io/async/EventBase.cpp:535"
16:string"facebook::common::mysql_client::AsyncMysqlClient::<lambda()>::operator() at /build/hhvm/third-party/squangle/squangle/mysql_client/AsyncMysqlClient.cpp:80"
17:string"std::__invoke_impl<void, at /usr/include/c++/7/bits/invoke.h:60"
18:string"std::__invoke<facebook::common::mysql_client::AsyncMysqlClient::init()::<lambda()> at /usr/include/c++/7/bits/invoke.h:95"
19:string"std::thread::_Invoker<std::tuple<facebook::common::mysql_client::AsyncMysqlClient::init()::<lambda()> at /usr/include/c++/7/thread:234"
20:string"std::thread::_Invoker<std::tuple<facebook::common::mysql_client::AsyncMysqlClient::init()::<lambda()> at /usr/include/c++/7/thread:243"
21:string"std::thread::_State_impl<std::thread::_Invoker<std::tuple<facebook::common::mysql_client::AsyncMysqlClient::init()::<lambda()> at /usr/include/c++/7/thread:186"
22:string"start_thread at pthread_create.c:463"
23:string"clone at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95"

Additionally noteworthy is our non standard hacklang pool configuration:

new AsyncMysqlConnectionPool(darray[
    'per_key_connection_limit' => 20,
    'idle_timeout_micros' => 4000000,
    'expiration_policy' => "IdleTime",
]);
jupyung commented 3 years ago

Thanks for reporting the issue. Our team started to look at this issue. We will get back to you soon.

y3llowcake commented 3 years ago

Curious if you have any intuition about what might be causing this. Based on other crashes, we think we might be victims to subtle memory corruption bugs in HHVM, but given the frequency and predictability of these stack traces I am more inclined to think the bug is in squangle.

y3llowcake commented 2 years ago

I am curious if the following commit is related to this issue: https://github.com/facebook/squangle/commit/9737cfd6ddf4bdfa3e1fac26659c03ba8d6d5e54

jupyung commented 2 years ago

Yes, the commit you mentioned was meant to fix rare segfault happening in connection cleanup, which looks related to this issue.

fredemmott commented 2 years ago

Yep, cherry-picking that commit fixed this issue for Slack :)