Open kyoichi-sugahara opened 1 year ago
Thanks for the issue!
Similar problems occurred with both start/goal planner
. The logs are a little different, but I think it is the same kind of bug.
It seems to be caused by the fact that open_list_
sometimes contains 0x0
.
@NorahXiong Hello, and sorry for sudden mention. I haven't been able to identify the cause yet, and I apologize if this turns out to be unrelated, but I was wondering if there might be a possibility that the issue is being influenced by this PR? If this change is found to be unrelated, I sincerely apologize for any inconvenience.
@NorahXiong Hello, and sorry for sudden mention. I haven't been able to identify the cause yet, and I apologize if this turns out to be unrelated, but I was wondering if there might be a possibility that the issue is being influenced by this PR? If this change is found to be unrelated, I sincerely apologize for any inconvenience.
I tried many times but the crash never happened and no clue was found in the related code, Is there any special step not mentioned in the reproduce steps?
Thanks for the issue! Similar problems occurred with both
start/goal planner
. The logs are a little different, but I think it is the same kind of bug.It seems to be caused by the fact that
open_list_
sometimes contains0x0
.
I think the 0x0 elements are in the underlying wrapping data structure (vector) rather than in the queue. It would not be very likely that the empty pointers are pushed into the queue as the pointers have all been visited before being pushed.
@NorahXiong Thank you so much for the reponse. I tried to reproduce problem again and I successfully reproduced the probelem(It's very difficult to reproduce problem though...) The reproducibility is not perfect. The situation is
@NorahXiong Thanks for the reply and for trying! In my case, I was able to reproduce it by putting the goal many times.
This pull request has been automatically marked as stale because it has not had recent activity.
@NorahXiong We have not been able to fix this issue. We are sorry to bother you, but could you please try to reproduce it? :pray:
@VRichardJP I think you are familiar with memory, so if you could advise me on this I would appreciate it. :bow:
It looks like a concurrency issue, as the program seems to crash at different places:
Astar::search()
~Astar()
Astar::search()
Did you try to run the program with valgrind? e.g. with launch-prefix="gnome-terminal -- valgrind --tool=memcheck --leak-check=yes "
@VRichardJP thanks for the advice. We did not use valgrind so we will try!
I am not sure I totally understand how the modules are running, but the StartPlannerModule
creates a new callback queue for the FreespacePullOut
module here:
Is my understanding correct?
run()
function one after the other.
Obviously, both don't happen at the same time.Then, what happens to the callback created by FreespacePullOut
? I am not sure what is the behavior when the default callback queue is used. Maybe the module share the same callback queue than the manager. In such case the timed callbacks are mutually exclusive and the freespace pull out timed callback cannot run at the same time than the manager.
But here, the callback is running in a different queue, so you may have a thread running inside FreespacePullOut
timed callback while another thread is in the manager and trying to destroy the module (or modifying some data required by the module).
For instance, what happens if you put a sleep right before the planFreespacePath()
line here:
I guess it will crash right away.
@VRichardJP Thank you very much for your very detailed look!!
Is my understanding correct? ・behavior path planner manager create/destroy planning modules at runtime depending on the situation. ・the manager calls the modules run() function one after the other. Obviously, both don't happen at the same time.
yes, your understanding is correct.
The manager deletes or std::move modules depending on the situation. For example, when a new route is received, the module instance is cleared. https://github.com/autowarefoundation/autoware.universe/blob/feat/avoidance_pull_over/planning/behavior_path_planner/src/behavior_path_planner_node.cpp#L379
If FreespacePullOut is running in a separate thread at this time, it is possible that the data could be rewritten and crashed.
So, I feel that locking the clear from manager while FreespacePullOut's callback is running, or as a separate instance of FreespacePullOut (building a server), etc. might be a solution.
For instance, what happens if you put a sleep right before the planFreespacePath() line here: autoware.universe/planning/behavior_path_start_planner_module/src/start_planner_module.cpp Lines 99 to 101 in 2252226 if (isStuck() && is_new_costmap) { planFreespacePath(); } I guess it will crash right away.For instance, what happens if you put a sleep right before the planFreespacePath() line here:
I would like to confirm this as well, but I wonder if it dies during the planFreespacePath process and then sleeps before that process? Is the intention to generate a time delay so that clearing is more likely to occur during the planFreespacePath process ?Specifically, should I perform the same reproduction method by doing the following?
if (isStuck() && is_new_costmap) {
std::this_thread::sleep_for(std::chrono::seconds(10));
planFreespacePath();
}
@kosuke55
Yes, if the issue is what I think it is, then while you have one thread sleeping before planFreespacePath()
the behavior path planner manager will continue doing its work:
if (isStuck() && is_new_costmap) {
std::this_thread::sleep_for(std::chrono::seconds(10));
planFreespacePath();
}
In particular, if you reset the goal in that 10s window, the freespace object (or things it is refering to) are likely to be destroyed/moved, and I guess you will get some sort of segmentation fault.
@NorahXiong We have not been able to fix this issue. We are sorry to bother you, but could you please try to reproduce it? 🙏
Have you found out the reason leading to the segmentation fault? I'm sorry I have to try it again later if you still need.
@NorahXiong Sorry for the delay. No we have not been able to proceed with any analysis yet.
@kosuke55 @kyoichi-sugahara I tried again but still no segmentation fault occurred. Here's the video link. Any suggestions to help me reproducing the bug?
@NorahXiong Thank you very much for trying again. ego vehicle is needed to be in parking_lot to run FreespacePullOver(). (and better to be also in lane) parking_lot is light yellow area and the red rectangle is expample of ego positon.
@NorahXiong oh, sorry currently it hve to be close enough to goal to execute the goal planner, and the braking distance determines that distance. If this is made large enough, it will be triggered regardless of the braking distance.
minimum_request_length: 100.0
And the goal is needed to be put in the road_shoulder
@NorahXiong
As @VRichardJP indicated, sleep could easily produce a crash.
if (isStuck() && is_new_costmap && needPathUpdate(path_update_duration)) {
std::this_thread::sleep_for(std::chrono::seconds(10));
planFreespacePath();
}
https://github.com/autowarefoundation/autoware.universe/pull/6322 may fix the issue, we will test more
@kosuke55 I followed your steps by
Information may help you confirm the cause:
Env Info:
This pull request has been automatically marked as stale because it has not had recent activity.
Checklist
Description
While executing the goal_planner, the program crashes due to a segmentation fault. Based on the stack trace, the issue seems to arise when an std::unordered_map, holding values of type freespace_planning_algorithms::AstarNode, is being deallocated.
goal_planner_issue_5154.webm
Expected behavior
The intended behavior is for the nodes to remain alive, and for freespace_planning_algorithms::AstarNode to successfully generate a path to the goal and reach it without issues.
Actual behavior
It's not guaranteed to be reproducible 100% of the time when generating paths using freespace_planning_algorithms, but after several repetitions, the node eventually crashes.
Here is the stack trace:
Steps to reproduce
Please use attached lanelet map virtual_G_dev_road_shoulder.zip
Possible causes
mutex
in the goal_planner, which runs in multi-threading, may not be operating correctly.Additional context
This issue results in an occasional crash of the node, affecting the reliability of the path planning process, and thereby requires prompt attention and resolution.