eBay / NuRaft

C++ implementation of Raft core logic as a replication library
Apache License 2.0
1.02k stars 241 forks source link

Data race in peer #493

Open as51340 opened 8 months ago

as51340 commented 8 months ago

I found data race using TSAN in my code. The issue seems to be in peer.

WARNING: ThreadSanitizer: data race (pid=182953)
  Write of size 8 at 0x7b5800000e50 by thread T5 (mutexes: write M2013, write M1987):
    #0 operator delete(void*) /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/rtl/tsan_new_delete.cpp:126 (coordinator_instance+0x2c2d215)
    #1 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/shared_ptr_base.h:184 (coordinator_instance+0x486baa4)
    #2 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/shared_ptr_base.h:705 (coordinator_instance+0x486baa4)
    #3 std::__shared_ptr<nuraft::peer, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/shared_ptr_base.h:1154 (coordinator_instance+0x486baa4)
    #4 std::__shared_ptr<nuraft::peer, (__gnu_cxx::_Lock_policy)2>::reset() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/shared_ptr_base.h:1272 (coordinator_instance+0x486baa4)
    #5 nuraft::raft_server::reconfigure(std::shared_ptr<nuraft::cluster_config> const&) /home/andi/Memgraph/code/memgraph/libs/nuraft/src/handle_commit.cxx:717 (coordinator_instance+0x486baa4)

  Previous atomic read of size 1 at 0x7b5800000e50 by thread T4:
    #0 pthread_mutex_lock /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:4249 (coordinator_instance+0x2ba6914)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/x86_64-linux-gnu/bits/gthr-default.h:749 (coordinator_instance+0x4894ad7)
    #2 std::mutex::lock() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/std_mutex.h:100 (coordinator_instance+0x4894ad7)
    #3 std::lock_guard<std::mutex>::lock_guard(std::mutex&) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/std_mutex.h:229 (coordinator_instance+0x4894ad7)
    #4 nuraft::timer_helper::set_duration_us(unsigned long) /home/andi/Memgraph/code/memgraph/libs/nuraft/include/libnuraft/internal_timer.hxx:57 (coordinator_instance+0x4894ad7)
    #5 nuraft::timer_helper::set_duration_ms(unsigned long) /home/andi/Memgraph/code/memgraph/libs/nuraft/include/libnuraft/internal_timer.hxx:62 (coordinator_instance+0x4894ad7)
    #6 nuraft::peer::handle_rpc_result(std::shared_ptr<nuraft::peer>, std::shared_ptr<nuraft::rpc_client>, std::shared_ptr<nuraft::req_msg>&, std::shared_ptr<nuraft::cmd_result<std::shared_ptr<nuraft::resp_msg>, std::shared_ptr<nuraft::rpc_exception> > >&, std::shared_ptr<nuraft::resp_msg>&, std::shared_ptr<nuraft::rpc_exception>&) /home/andi/Memgraph/code/memgraph/libs/nuraft/src/peer.cxx:146 (coordinator_instance+0x4894ad7)

  Mutex M2013 (0x7b6c00001298) created at:
    #0 pthread_mutex_trylock /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1296 (coordinator_instance+0x2b9a324)
    #1 __gthread_mutex_trylock(pthread_mutex_t*) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/x86_64-linux-gnu/bits/gthr-default.h:758 (coordinator_instance+0x4867b9f)
    #2 std::mutex::try_lock() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/std_mutex.h:111 (coordinator_instance+0x4867b9f)
    #3 std::unique_lock<std::mutex>::unique_lock(std::mutex&, std::try_to_lock_t) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/unique_lock.h:78 (coordinator_instance+0x4867b9f)
    #4 nuraft::raft_server::commit_in_bg_exec(unsigned long) /home/andi/Memgraph/code/memgraph/libs/nuraft/src/handle_commit.cxx:161 (coordinator_instance+0x4867b9f)

  Mutex M1987 (0x7b6c000011f0) created at:
    #0 pthread_mutex_lock /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:4249 (coordinator_instance+0x2ba6914)
    #1 __gthread_mutex_lock(pthread_mutex_t*) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/x86_64-linux-gnu/bits/gthr-default.h:749 (coordinator_instance+0x4880979)
    #2 __gthread_recursive_mutex_lock(pthread_mutex_t*) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/x86_64-linux-gnu/bits/gthr-default.h:811 (coordinator_instance+0x4880979)
    #3 std::recursive_mutex::lock() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/mutex:108 (coordinator_instance+0x4880979)
    #4 std::unique_lock<std::recursive_mutex>::lock() /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/unique_lock.h:139 (coordinator_instance+0x4880979)
    #5 std::unique_lock<std::recursive_mutex>::unique_lock(std::recursive_mutex&) /opt/toolchain-v4/bin/../lib/gcc/x86_64-linux-gnu/11.2.0/../../../../include/c++/11.2.0/bits/unique_lock.h:69 (coordinator_instance+0x4880979)
    #6 nuraft::raft_server::restart_election_timer() /home/andi/Memgraph/code/memgraph/libs/nuraft/src/handle_timeout.cxx:164 (coordinator_instance+0x4880979)
    #7 memgraph::coordination::RaftState::MakeRaftState(memgraph::coordination::CoordinatorInstanceInitConfig const&, std::function<void ()>&&, std::function<void ()>&&) /home/andi/Memgraph/code/memgraph/src/coordination/raft_state.cpp:104 (coordinator_instance+0x2dc1ace)
    #8 memgraph::coordination::CoordinatorInstance::CoordinatorInstance(memgraph::coordination::CoordinatorInstanceInitConfig const&) /home/andi/Memgraph/code/memgraph/src/coordination/coordinator_instance.cpp:36 (coordinator_instance+0x2cb89eb)
    #9 CoordinatorInstanceTest_ConnectCoordinators_Test::TestBody() /home/andi/Memgraph/code/memgraph/tests/unit/coordinator_instance.cpp:67 (coordinator_instance+0x2c30777)
    #10 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ??:? (coordinator_instance+0x2fa05b5)
    #11 main /home/andi/Memgraph/code/memgraph/tests/unit/main.cpp:19 (coordinator_instance+0x2cb83be)

  Thread T5 'nuraft_commit' (tid=182959, running) created by main thread at:
    #0 pthread_create /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:977 (coordinator_instance+0x2bbb63a)
    #1 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) /root/memgraph/environment/toolchain/build/gcc-11.2.0/build/x86_64-linux-gnu/libstdc++-v3/include/x86_64-linux-gnu/bits/gthr-default.h:663 (coordinator_instance+0x60720b9)
    #2 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) /root/memgraph/environment/toolchain/build/gcc-11.2.0/build/x86_64-linux-gnu/libstdc++-v3/src/c++11/../../../../../libstdc++-v3/src/c++11/thread.cc:147 (coordinator_instance+0x60720b9)
    #3 memgraph::coordination::RaftState::MakeRaftState(memgraph::coordination::CoordinatorInstanceInitConfig const&, std::function<void ()>&&, std::function<void ()>&&) /home/andi/Memgraph/code/memgraph/src/coordination/raft_state.cpp:104 (coordinator_instance+0x2dc1ace)
    #4 memgraph::coordination::CoordinatorInstance::CoordinatorInstance(memgraph::coordination::CoordinatorInstanceInitConfig const&) /home/andi/Memgraph/code/memgraph/src/coordination/coordinator_instance.cpp:36 (coordinator_instance+0x2cb89eb)
    #5 CoordinatorInstanceTest_ConnectCoordinators_Test::TestBody() /home/andi/Memgraph/code/memgraph/tests/unit/coordinator_instance.cpp:67 (coordinator_instance+0x2c30777)
    #6 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ??:? (coordinator_instance+0x2fa05b5)
    #7 main /home/andi/Memgraph/code/memgraph/tests/unit/main.cpp:19 (coordinator_instance+0x2cb83be)

  Thread T4 'nuraft_w_0' (tid=182958, running) created by main thread at:
    #0 pthread_create /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:977 (coordinator_instance+0x2bbb63a)
    #1 __gthread_create /root/memgraph/environment/toolchain/build/gcc-11.2.0/build/x86_64-linux-gnu/libstdc++-v3/include/x86_64-linux-gnu/bits/gthr-default.h:663 (coordinator_instance+0x60720b9)
    #2 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) /root/memgraph/environment/toolchain/build/gcc-11.2.0/build/x86_64-linux-gnu/libstdc++-v3/src/c++11/../../../../../libstdc++-v3/src/c++11/thread.cc:147 (coordinator_instance+0x60720b9)
    #3 memgraph::coordination::RaftState::MakeRaftState(memgraph::coordination::CoordinatorInstanceInitConfig const&, std::function<void ()>&&, std::function<void ()>&&) /home/andi/Memgraph/code/memgraph/src/coordination/raft_state.cpp:104 (coordinator_instance+0x2dc1ace)
    #4 memgraph::coordination::CoordinatorInstance::CoordinatorInstance(memgraph::coordination::CoordinatorInstanceInitConfig const&) /home/andi/Memgraph/code/memgraph/src/coordination/coordinator_instance.cpp:36 (coordinator_instance+0x2cb89eb)
    #5 CoordinatorInstanceTest_ConnectCoordinators_Test::TestBody() /home/andi/Memgraph/code/memgraph/tests/unit/coordinator_instance.cpp:67 (coordinator_instance+0x2c30777)
    #6 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ??:? (coordinator_instance+0x2fa05b5)
    #7 main /home/andi/Memgraph/code/memgraph/tests/unit/main.cpp:19 (coordinator_instance+0x2cb83be)

SUMMARY: ThreadSanitizer: data race /root/memgraph/environment/toolchain/build/llvm-13.0.0/projects/compiler-rt/lib/tsan/rtl/tsan_new_delete.cpp:126 in operator delete(void*)
=

Relevant code that is using NuRaft here is related to adding servers, we do:

srv_config const srv_config_to_add(static_cast<int>(config.coordinator_id), endpoint);

  auto cmd_result = raft_server_->add_srv(srv_config_to_add);

  if (cmd_result->get_result_code() == nuraft::cmd_result_code::OK) {
    spdlog::info("Request to add server {} to the cluster accepted", endpoint);
  } else {
    throw RaftAddServerException("Failed to accept request to add server {} to the cluster with error code {}",
                                 endpoint, int(cmd_result->get_result_code()));
  }

  // Waiting for server to join
  constexpr int max_tries{10};
  auto maybe_stop = utils::ResettableCounter<max_tries>();
  constexpr int waiting_period{200};
  bool added{false};
  while (!maybe_stop()) {
    std::this_thread::sleep_for(std::chrono::milliseconds(waiting_period));
    const auto server_config = raft_server_->get_srv_config(static_cast<nuraft::int32>(config.coordinator_id));
    if (server_config) {
      spdlog::trace("Server with id {} added to cluster", config.coordinator_id);
      added = true;
      break;
    }
  }

  if (!added) {
    throw RaftAddServerException("Failed to add server {} to the cluster in {}ms", endpoint,
                                 max_tries * waiting_period);
  }
greensky00 commented 8 months ago

From what you provided, it seems there is a race between srv_to_join_.reset() and peer::lock_. However, that is not possible as peer::handle_rpc_result has myself not to make the reference counter of the shared pointer become 0. I haven't seen TSAN alert during the normal server add/removal process.

as51340 commented 8 months ago

Thanks for the reply!

We are using NuRaft v2.1.0 Yes, TSAN alerts on every run.

Can you instruct me on how to get logs?

greensky00 commented 7 months ago

I tried the same code with the same NuRaft version, still didn't get TSAN alert.

Could you share your build environment? Such as OS version, compiler version, etc. Also please share Raft parameters (heartbeat, election lower/upper timeout, ...).