eclipse-cyclonedds / cyclonedds

Eclipse Cyclone DDS project
https://projects.eclipse.org/projects/iot.cyclonedds
Other
889 stars 363 forks source link

cyclonedds (ros2 foxy)ddsrt_mutex_lock abort(crash) #2103

Open dogchenya opened 1 month ago

dogchenya commented 1 month ago

The program crashed twice,while using ros2(foxy) I cant share the code because it is from a private database

platform:aarch64

E1008 05:38:55.100389 4253 log_comm.h:104] [bm] Aborted at 1728337135 (unix time) try "date -d @1728337135" if you are using GNU date E1008 05:38:55.115846 4253 log_comm.h:104] [bm] PC: @ 0x0 (unknown) E1008 05:38:55.118182 4253 log_comm.h:104] [bm] SIGABRT (@0x3e800000f93) received by PID 3987 (TID 0x7f6f7f4c20) from PID 3987; stack trace: E1008 05:38:55.127180 4253 log_comm.h:104] [bm] @ 0x7f8e222f38 google::(anonymous namespace)::FailureSignalHandler() E1008 05:38:55.133519 4253 log_comm.h:104] [bm] @ 0x7f8eb3b7c0 ([vdso]+0x7bf) E1008 05:38:55.140120 4253 log_comm.h:104] [bm] @ 0x7f8d11dd78 gsignal E1008 05:38:55.145430 4253 log_comm.h:104] [bm] @ 0x7f8d10aaac abort E1008 05:38:55.147627 4253 log_comm.h:104] [bm] @ 0x7f81e13144 ddsrt_mutex_lock E1008 05:38:55.148913 4253 log_comm.h:104] [bm] @ 0x7f81df8a9c dds_entity_status_signal E1008 05:38:55.149859 4253 log_comm.h:104] [bm] @ 0x7f81df18e0 (unknown) E1008 05:38:55.150660 4253 log_comm.h:104] [bm] @ 0x7f81d9ddfc deliver_locally_allinsync E1008 05:38:55.151491 4253 log_comm.h:104] [bm] @ 0x7f81dd5c78 (unknown) E1008 05:38:55.152331 4253 log_comm.h:104] [bm] @ 0x7f81dd5dac (unknown) E1008 05:38:55.153183 4253 log_comm.h:104] [bm] @ 0x7f81dd6e60 (unknown) E1008 05:38:55.154292 4253 log_comm.h:104] [bm] @ 0x7f81dd8150 (unknown) E1008 05:38:55.155452 4253 log_comm.h:104] [bm] @ 0x7f81dd9fd8 recv_thread E1008 05:38:55.156653 4253 log_comm.h:104] [bm] @ 0x7f81ddb15c (unknown) E1008 05:38:55.157874 4253 log_comm.h:104] [bm] @ 0x7f81e136c8 (unknown) E1008 05:38:55.164083 4253 log_comm.h:104] [bm] @ 0x7f8d961624 start_thread E1008 05:38:55.170007 4253 log_comm.h:104] [bm] @ 0x7f8d1bb62c (unknown)

E1010 16:17:59.833220 3622 log_comm.h:104] [bm] Aborted at 1728548279 (unix time) try "date -d @1728548279" if you are using GNU date E1010 16:17:59.845293 3622 log_comm.h:104] [bm] PC: @ 0x0 (unknown) E1010 16:17:59.851938 3622 log_comm.h:104] [bm] SIGABRT (@0x3e800000db7) received by PID 3511 (TID 0x7f83fa6970) from PID 3511; stack trace: E1010 16:17:59.859593 3622 log_comm.h:104] [bm] @ 0x7f94f207c0 ([vdso]+0x7bf) E1010 16:17:59.866200 3622 log_comm.h:104] [bm] @ 0x7f92880d78 gsignal E1010 16:17:59.872292 3622 log_comm.h:104] [bm] @ 0x7f9286daac abort E1010 16:17:59.873351 3622 log_comm.h:104] [bm] @ 0x7f8731b18c ddsrt_mutex_unlock E1010 16:17:59.874254 3622 log_comm.h:104] [bm] @ 0x7f872f98e0 (unknown) E1010 16:17:59.875172 3622 log_comm.h:104] [bm] @ 0x7f872a5c50 deliver_locally_one E1010 16:17:59.876324 3622 log_comm.h:104] [bm] @ 0x7f872ddb90 (unknown) E1010 16:17:59.877462 3622 log_comm.h:104] [bm] @ 0x7f872dddac (unknown) E1010 16:17:59.878531 3622 log_comm.h:104] [bm] @ 0x7f872deff8 (unknown) E1010 16:17:59.879562 3622 log_comm.h:104] [bm] @ 0x7f872e0150 (unknown) E1010 16:17:59.880486 3622 log_comm.h:104] [bm] @ 0x7f872e1fd8 recv_thread E1010 16:17:59.881397 3622 log_comm.h:104] [bm] @ 0x7f872e315c (unknown) E1010 16:17:59.882284 3622 log_comm.h:104] [bm] @ 0x7f8731b6c8 (unknown) E1010 16:17:59.889055 3622 log_comm.h:104] [bm] @ 0x7f93169624 start_thread E1010 16:17:59.897092 3622 log_comm.h:104] [bm] @ 0x7f9291e62c (unknown)

eboasson commented 3 weeks ago

Strange, I've never seen that one before ... 🤔

I don't think it is relevant that it is Foxy, or ROS 2 for that matter, because this very much looks like pthread_mutex_lock returned an error on trying to lock an internal mutex on an internal thread (recv_thread). That smells like a race condition, memory corruption, use-after-free or similar. Given that it is an aarch64 that means I am concerned that it might be a missing memory barrier somewhere, because x86/x64 tend to be more forgiving of such mistakes ...

Any idea what the program was doing at the time? Deleting readers, perhaps?

dogchenya commented 3 weeks ago

Strange, I've never seen that one before ... 🤔

I don't think it is relevant that it is Foxy, or ROS 2 for that matter, because this very much looks like pthread_mutex_lock returned an error on trying to lock an internal mutex on an internal thread (recv_thread). That smells like a race condition, memory corruption, use-after-free or similar. Given that it is an aarch64 that means I am concerned that it might be a missing memory barrier somewhere, because x86/x64 tend to be more forgiving of such mistakes ...

Any idea what the program was doing at the time? Deleting readers, perhaps?

Thank you for your reply。 We reproduced this bug and it will appear after stress testing. Crashed after running for 10 minutes,both arm and x86。 Must appear

Code

class CrashTest {
 public:
  CrashTest(int32_t index) : index_(index) {}
  void Start() {
    client_node_ = rclcpp::Node::make_shared("test_node" + std::to_string(index_));
    thread_ = std::thread([this]() {
      for (size_t i = 0; i < 0xFFFFFFFFFFFF; i++) {
        auto qos = rclcpp::QoS(rclcpp::KeepLast(1000)).reliable().lifespan(std::chrono::seconds(60));
        auto client = client_node_->create_client<CommonService>("/test", qos.get_rmw_qos_profile());
        auto ros_req = std::make_shared<CommonService::Request>();
        auto& req = *ros_req;
        req.type = "";
        req.data = "{}";
        if (client->wait_for_service(std::chrono::seconds(10))) {
          auto result_future = client->async_send_request(ros_req);
          auto spin_result = rclcpp::spin_until_future_complete(client_node_, result_future, std::chrono::seconds(10));
          if (spin_result != rclcpp::FutureReturnCode::SUCCESS) {
            AERROR << "call failed: " << static_cast<int32_t>(spin_result);
          } else {
            AINFO << "call success!";
          }
        } else {
          AERROR << "call failed!";
        }
      }
    });
  }

  void Join(){
    if (thread_.joinable()) {
      thread_.join();
    }
  }

 private:
  int32_t index_;
  rclcpp::Node::SharedPtr client_node_;
  std::thread thread_;
};

int main(int argc, char** argv) {
  auto crash_ptr = std::make_shared<CrashTest>(1);
  crash_ptr->Start();
  auto crash_ptr2 = std::make_shared<CrashTest>(2);
  crash_ptr2->Start();
  auto crash_ptr3 = std::make_shared<CrashTest>(3);
  crash_ptr3->Start();

  crash_ptr->Join();
  crash_ptr2->Join();
  crash_ptr3->Join();
}