Closed soblin closed 6 months ago
Here is the code to mock the data race before/after the fix
sudo apt-get install libtsan0-amd64-cross
g++ -fsanitize=thread tsan.cpp
#include <chrono>
#include <iostream>
#include <memory>
#include <mutex>
#include <optional>
#include <thread>
using namespace std;
class PlannerData
{
public:
PlannerData(const shared_ptr<int> ptr) : odometry(ptr) {}
shared_ptr<int> odometry{nullptr};
};
int current()
{
shared_ptr<PlannerData> planner_data = make_shared<PlannerData>(make_shared<int>(0));
mutex planner_data_mutex;
shared_ptr<PlannerData> my_planner_data = make_shared<PlannerData>(make_shared<int>(0));
mutex my_planner_data_mutex;
const int sim_duration = 10000; // 10000ms = 10s
int i = 0;
const int tick = 10;
thread odometryCallback([&]() {
if (i > sim_duration / tick) {
return;
}
{
lock_guard<mutex> guard(planner_data_mutex);
planner_data->odometry = make_shared<int>(i);
++i;
}
// 0.01秒ごとにplanner_data->odometryをインクリメントする(自己位置の更新の感じ)
this_thread::sleep_for(chrono::milliseconds(tick));
});
int j = 0;
const int tock = 500;
thread onTimer([&]() {
if (j > sim_duration / tock) {
return;
}
// mutexなしで直接アクセスする
// 自己位置を使った計算の模擬
cout << *(planner_data->odometry) << endl;
++j;
// 0.5秒ごとに読み出し
this_thread::sleep_for(chrono::milliseconds(tock));
});
odometryCallback.join();
onTimer.join();
return 0;
}
int fixed()
{
// これはOKっぽい?
shared_ptr<PlannerData> planner_data = make_shared<PlannerData>(make_shared<int>(0));
mutex planner_data_mutex;
PlannerData my_planner_data(make_shared<int>(0));
mutex my_planner_data_mutex;
const int sim_duration = 20; // 10s
int i1 = 0;
const double tick = 0.01;
thread odometryCallback([&]() {
while (i1 < sim_duration / tick) {
{
const auto msg = make_shared<int>(i1);
lock_guard<mutex> guard(planner_data_mutex);
planner_data->odometry = msg;
}
// 0.01秒ごとにplanner_data->odometryをインクリメントする(自己位置の更新の感じ)
i1 += 1;
this_thread::sleep_for(chrono::milliseconds(static_cast<int>(1000 * tick)));
}
});
int i2 = 0;
const double tack = 0.1;
thread onGoalPlannerMain([&]() {
while (i2 < sim_duration / tack) {
{
// 10Hzで動作
// ここはPlannerManagerのスコープということで
lock_guard<mutex> guard(planner_data_mutex);
{
// GoalPlannerのスコープ
{
// my_planner_dataをコピー
lock_guard<mutex> my_guard(my_planner_data_mutex);
my_planner_data = *planner_data;
}
// GoalPlannerのメイン処理はplanner_dataを使えば良い
// 1/5でサブ処理
if (planner_data->odometry) {
const int a = *(planner_data->odometry);
}
this_thread::sleep_for(chrono::milliseconds(static_cast<int>(1000 * tack / 5.0)));
}
// 4/5で残りの処理
i2 += 1;
this_thread::sleep_for(chrono::milliseconds(static_cast<int>(1000 * 4.0 * tack / 5.0)));
}
}
});
int i3 = 0;
const double tock = 0.2; // 0.5秒ごとに動作
thread onTimer([&]() {
while (i3 < sim_duration / tock) {
optional<PlannerData> local_planner_data{nullopt};
{
lock_guard<mutex> guard(my_planner_data_mutex);
local_planner_data = my_planner_data;
}
if (local_planner_data) {
cout << *(local_planner_data.value().odometry) << endl;
}
i3 += 1;
this_thread::sleep_for(chrono::milliseconds(static_cast<int>(1000 * tock)));
}
});
odometryCallback.join();
onGoalPlannerMain.join();
onTimer.join();
return 0;
}
int main()
{
fixed();
return 0;
}
NOTE
Actually occupancy_gridmap ptr is thread safe, but it is no longer used in PullOverPlannerBase and its derived classes, so we can removed those members.
For goalsearcher, its only functionality is to generate sorted goals, but once they are obtained we can pass them via ThreadSafeData. So we can replace the access to goal_searcher in the thread by thread_safedata/
@soblin can this issue be closed as completed?
Thanks for your contribution @soblin . Is there anything missing regarding this issue? Can we close this ?
We can close this issue. I will work on refactoring in the future PRs
We can close this issue. I will work on refactoring in the future PRs
Checklist
Description
7 non-thread-safe variables
data race for plannerdata
The PlannerManager runs each sub modules inside the critical section of
BehaviorPathPlannerNode::planner_data_
, and theonTimer
andonFreespaceParkingTimer
belong to callback groups which are dependent on the callback group ofbahavior_path_planner::run
, so the access toGoalPlannerModule::planner_data_
from those timer callbacks are not thread-safe. What I found out is that when the onTimer tries to accessplanner_data_->dynamic_objects
, it could be modified by the msg callbackplanner_data_->dynamic_objects = msg
at the same time because they are actually same shared_ptr object and GoalPlannerModule/behavior_path_planner does not share the mutex to it.You may think "how could that be ? We are using shared_ptr to share the resource across the threads, and if the consumer thread's access to shared_ptr::dynamic_objects coincides with producer thread's update to shared_ptr::dynamic_objects, then the latter one just repoints to new msg while keeping the former pointer, right ?"
One thing to note is that when behavior_path_planner calls
PlannerManager::run()
it passes its plannerdata to each submodules viaPlannerManager::setData()
. As you can learn from its implementation, this copy assignment increments the reference count of shared plannerdata but its internal shared_ptrs themselves are not copied. You can check that in the following snippet.https://wandbox.org/permlink/Yxz8dNN3URSPxgbY
This means that although
GoalPlannerModule::planner_data_
andbehavior_path_planner::planner_data_
share the same resource, the PlannerData instance on the heap and its internal member shared_ptr variables are not shared and their reference count is still 1. SoGoalPlannerModule::planner_data_.dynamic_objects
andbehavior_path_planner::planner_data_.dynamic_objects
is identical object.As the C++ standard says Read and Write access to different shared_ptr variables are thread-safe even if they point to same resource. In the following example
p1
andp2
are different shared_ptr variables so we can freely manipulate them without mutex (We do not need to care about the use_count() because it is atomically added/subtracted).This does not hold for our case because we are actually accessing to the same shared_ptr variable.
For the plannerdata variable, the goal_planner and start_planner have to copy the PlannerData inside them to allow their custom timer thread access to it in thread-safe manner. Also we need to pay attention to the route_handler, because it also holds shared_ptr variable in it.
The onTimer() is called in
MutuallyExclusive
manner so we will not have more than 2 onTimer() callbacks running, so we do not need to be worried about the case where multiplelocal_planner_data
are created, waiting for the timer to exit and owing huge size messages for a long time.Also we should explicitly copy the internal shared_ptr variables and route_handler in setPlannerData to pass a thread-safe planner_data in the future.
data race for getCurrentStatus(), getPreviousModuleOutput()
The value of getCurrentStatus(), getPreviousModuleOutput() is updated through setPreviousModuleOutput and updateCurrentStatus in the callback group of
bahavior_path_planner::run
so the access from onTimer causes data race. We should put these variables into the planner_data as well.data race for occupancy_gridmap, last_previous_moduleoutput, prevdata, parameters_
bahavior_path_planner::run
through GoalPlannerManager::updateData so the access to it from GoalPlannerModule::onFreespaceParkingTimer through GoalPlannerModule::isStuck through GoalPlanneManagerModule::checkOccupancyGridCollision is not thread safeminimize critical sections
Currently GoalPlannerModule has too many access to ThreadSafeData -- about 30-40 critical sections at least and maybe more than 100 if we count the net number of calls per one cycle. To be honest this is not a good design. I will post another refactoring roadmap.
Expected behavior
By compiling with
-fsanitize=thread
option and installinglibtsan0-<arch>
we can detect data race at runtime if it happened. I think it is almost technically impossible to statically detect data race because behavior_path_planner modules are divided into different shared libraries, possibly preventing static tools to trace data paths.I'll check if this option does not tell false positive data races from external libraries including rclcpp.
Actual behavior
Sometimes behavior_path_planner dies from goal_planner logic.
Steps to reproduce
There is no way to reproduce.
Versions
No response
Possible causes
No response
Additional context
No response