autowarefoundation / autoware.universe

https://autowarefoundation.github.io/autoware.universe/
Apache License 2.0
1k stars 647 forks source link

goal_planner (as well as start_planner) module of behavior_path_planner is not thread safe and we have a lot of room for refactoring #6718

Closed soblin closed 6 months ago

soblin commented 7 months ago

Checklist

Description

7 non-thread-safe variables

data race for plannerdata

The PlannerManager runs each sub modules inside the critical section of BehaviorPathPlannerNode::planner_data_, and the onTimer and onFreespaceParkingTimer belong to callback groups which are dependent on the callback group of bahavior_path_planner::run, so the access to GoalPlannerModule::planner_data_ from those timer callbacks are not thread-safe. What I found out is that when the onTimer tries to access planner_data_->dynamic_objects, it could be modified by the msg callback planner_data_->dynamic_objects = msg at the same time because they are actually same shared_ptr object and GoalPlannerModule/behavior_path_planner does not share the mutex to it.

You may think "how could that be ? We are using shared_ptr to share the resource across the threads, and if the consumer thread's access to shared_ptr::dynamic_objects coincides with producer thread's update to shared_ptr::dynamic_objects, then the latter one just repoints to new msg while keeping the former pointer, right ?"

One thing to note is that when behavior_path_planner calls PlannerManager::run() it passes its plannerdata to each submodules via PlannerManager::setData(). As you can learn from its implementation, this copy assignment increments the reference count of shared plannerdata but its internal shared_ptrs themselves are not copied. You can check that in the following snippet.

https://wandbox.org/permlink/Yxz8dNN3URSPxgbY

This means that although GoalPlannerModule::planner_data_ and behavior_path_planner::planner_data_ share the same resource, the PlannerData instance on the heap and its internal member shared_ptr variables are not shared and their reference count is still 1. So GoalPlannerModule::planner_data_.dynamic_objects and behavior_path_planner::planner_data_.dynamic_objects is identical object.

As the C++ standard says Read and Write access to different shared_ptr variables are thread-safe even if they point to same resource. In the following example p1 and p2 are different shared_ptr variables so we can freely manipulate them without mutex (We do not need to care about the use_count() because it is atomically added/subtracted).

shared_ptr<int> p1 = make_shared<int>();
auto p2 = p1;

// now p1.use_count() == p2.use_count() == 2

thread th1([&](){
shared_ptr<int> msg = make_shared<int>();
p1 = msg; // p1 is written
// now p1.use_count() 1
});

thread th2([&](){
cout << *p2 << endl; // p2 is read
// p2.use_count() == 1
});

This does not hold for our case because we are actually accessing to the same shared_ptr variable.

For the plannerdata variable, the goal_planner and start_planner have to copy the PlannerData inside them to allow their custom timer thread access to it in thread-safe manner. Also we need to pay attention to the route_handler, because it also holds shared_ptr variable in it.

GoalPlannerModule::run() {
   {
       // copy latest planner_data_ to my_planner_data for the onTimer to use the latest information
       // also we copy the planner_data as value to increment the reference count as a new owner
       lock_guard<mutex> guard(goal_planner_mutex);
       my_planner_data = *planner_data_; // copy as value, not shared_ptr
       my_planner_data.route_handler = make_shared<RouteHandler>(*planner_data_->route_handler);
   }
  // then we do GoalPlanner stuff
  // in the following we should use planner_data_
}

GoalPlannerManager::onTimer() {
   PlannerData local_planner_data;
   {
   // here we copy my_planner_data to an another local data to minimize critical section
   lock_guard<mutex> guard(goal_planner_mutex);
   local_planner_data = my_planner_data;
   local_planner_data.route_handler = make_shared<RouteHandler>(my_planner_data.route_handler);
   }
   // then do onTimer stuff using local_planner_data
}

The onTimer() is called in MutuallyExclusive manner so we will not have more than 2 onTimer() callbacks running, so we do not need to be worried about the case where multiple local_planner_data are created, waiting for the timer to exit and owing huge size messages for a long time.

Also we should explicitly copy the internal shared_ptr variables and route_handler in setPlannerData to pass a thread-safe planner_data in the future.

data race for getCurrentStatus(), getPreviousModuleOutput()

The value of getCurrentStatus(), getPreviousModuleOutput() is updated through setPreviousModuleOutput and updateCurrentStatus in the callback group of bahavior_path_planner::run so the access from onTimer causes data race. We should put these variables into the planner_data as well.

data race for occupancy_gridmap, last_previous_moduleoutput, prevdata, parameters_

minimize critical sections

Currently GoalPlannerModule has too many access to ThreadSafeData -- about 30-40 critical sections at least and maybe more than 100 if we count the net number of calls per one cycle. To be honest this is not a good design. I will post another refactoring roadmap.

Expected behavior

By compiling with -fsanitize=thread option and installing libtsan0-<arch> we can detect data race at runtime if it happened. I think it is almost technically impossible to statically detect data race because behavior_path_planner modules are divided into different shared libraries, possibly preventing static tools to trace data paths.

I'll check if this option does not tell false positive data races from external libraries including rclcpp.

Actual behavior

Sometimes behavior_path_planner dies from goal_planner logic.

Steps to reproduce

There is no way to reproduce.

Versions

No response

Possible causes

No response

Additional context

No response

soblin commented 7 months ago

Here is the code to mock the data race before/after the fix

sudo apt-get install libtsan0-amd64-cross
g++ -fsanitize=thread tsan.cpp
#include <chrono>
#include <iostream>
#include <memory>
#include <mutex>
#include <optional>
#include <thread>

using namespace std;

class PlannerData
{
public:
  PlannerData(const shared_ptr<int> ptr) : odometry(ptr) {}
  shared_ptr<int> odometry{nullptr};
};

int current()
{
  shared_ptr<PlannerData> planner_data = make_shared<PlannerData>(make_shared<int>(0));
  mutex planner_data_mutex;

  shared_ptr<PlannerData> my_planner_data = make_shared<PlannerData>(make_shared<int>(0));
  mutex my_planner_data_mutex;

  const int sim_duration = 10000;  // 10000ms = 10s
  int i = 0;
  const int tick = 10;
  thread odometryCallback([&]() {
    if (i > sim_duration / tick) {
      return;
    }
    {
      lock_guard<mutex> guard(planner_data_mutex);
      planner_data->odometry = make_shared<int>(i);
      ++i;
    }
    // 0.01秒ごとにplanner_data->odometryをインクリメントする(自己位置の更新の感じ)
    this_thread::sleep_for(chrono::milliseconds(tick));
  });

  int j = 0;
  const int tock = 500;
  thread onTimer([&]() {
    if (j > sim_duration / tock) {
      return;
    }
    // mutexなしで直接アクセスする
    // 自己位置を使った計算の模擬
    cout << *(planner_data->odometry) << endl;
    ++j;
    // 0.5秒ごとに読み出し
    this_thread::sleep_for(chrono::milliseconds(tock));
  });
  odometryCallback.join();
  onTimer.join();

  return 0;
}

int fixed()
{
  // これはOKっぽい?
  shared_ptr<PlannerData> planner_data = make_shared<PlannerData>(make_shared<int>(0));
  mutex planner_data_mutex;

  PlannerData my_planner_data(make_shared<int>(0));
  mutex my_planner_data_mutex;

  const int sim_duration = 20;  // 10s

  int i1 = 0;
  const double tick = 0.01;
  thread odometryCallback([&]() {
    while (i1 < sim_duration / tick) {
      {
        const auto msg = make_shared<int>(i1);
        lock_guard<mutex> guard(planner_data_mutex);
        planner_data->odometry = msg;
      }
      // 0.01秒ごとにplanner_data->odometryをインクリメントする(自己位置の更新の感じ)
      i1 += 1;
      this_thread::sleep_for(chrono::milliseconds(static_cast<int>(1000 * tick)));
    }
  });

  int i2 = 0;
  const double tack = 0.1;
  thread onGoalPlannerMain([&]() {
    while (i2 < sim_duration / tack) {
      {
        // 10Hzで動作
        // ここはPlannerManagerのスコープということで
        lock_guard<mutex> guard(planner_data_mutex);
        {
          // GoalPlannerのスコープ
          {
            // my_planner_dataをコピー
            lock_guard<mutex> my_guard(my_planner_data_mutex);
            my_planner_data = *planner_data;
          }
          // GoalPlannerのメイン処理はplanner_dataを使えば良い
          // 1/5でサブ処理
          if (planner_data->odometry) {
            const int a = *(planner_data->odometry);
          }
          this_thread::sleep_for(chrono::milliseconds(static_cast<int>(1000 * tack / 5.0)));
        }
        // 4/5で残りの処理
        i2 += 1;
        this_thread::sleep_for(chrono::milliseconds(static_cast<int>(1000 * 4.0 * tack / 5.0)));
      }
    }
  });

  int i3 = 0;
  const double tock = 0.2;  // 0.5秒ごとに動作
  thread onTimer([&]() {
    while (i3 < sim_duration / tock) {
      optional<PlannerData> local_planner_data{nullopt};
      {
        lock_guard<mutex> guard(my_planner_data_mutex);
        local_planner_data = my_planner_data;
      }
      if (local_planner_data) {
        cout << *(local_planner_data.value().odometry) << endl;
      }
      i3 += 1;
      this_thread::sleep_for(chrono::milliseconds(static_cast<int>(1000 * tock)));
    }
  });

  odometryCallback.join();
  onGoalPlannerMain.join();
  onTimer.join();

  return 0;
}

int main()
{
  fixed();
  return 0;
}
soblin commented 7 months ago

NOTE

soblin commented 7 months ago

Actually occupancy_gridmap ptr is thread safe, but it is no longer used in PullOverPlannerBase and its derived classes, so we can removed those members.

For goalsearcher, its only functionality is to generate sorted goals, but once they are obtained we can pass them via ThreadSafeData. So we can replace the access to goal_searcher in the thread by thread_safedata/

idorobotics commented 7 months ago

@soblin can this issue be closed as completed?

meliketanrikulu commented 6 months ago

Thanks for your contribution @soblin . Is there anything missing regarding this issue? Can we close this ?

soblin commented 6 months ago

We can close this issue. I will work on refactoring in the future PRs

soblin commented 6 months ago

We can close this issue. I will work on refactoring in the future PRs