Localization performance get worst over time

SteveMacenski / slam_toolbox

Slam Toolbox for lifelong mapping and localization in potentially massive maps with ROS

GNU Lesser General Public License v2.1

1.67k stars 525 forks source link

Localization performance get worst over time #334

Open kenza-ifollow opened 3 years ago

kenza-ifollow commented 3 years ago

Required Info:

Operating System:
- Ubuntu 18.04.5 LTS
Installation type:
- from source : slam_toolbox melodic-devel
ROS Version
- ROS1
Version or commit hash:
- 5a8c76f (release 2.3.0)
Laser unit:
- Sick nanoScan 3

Steps to reproduce issue

roslaunch slam_toolbox localization.launch with a given map and localize the robot using /initialpose
run slam_toolbox's localization on the same path, back and forth. The path is three long corridors in a "Z" shape. The total length of these corridors is about 130 m. The speed of the robot used is under 1 m/s and the environment is dynamic (a type of warehouse with moving objects, people, etc...)
run the localization in the environment, which is pretty dynamic

Expected behavior

Good localization with uniform performances over time

Actual behavior

Localization is really good during ~ 20 minutes and then it becomes impossible to re-localize the robot using /initialpose. If the slam_toolbox's node is killed and then relaunch at the same place with no other changes, re-localization is possible constant_deloc

Additional information

Several leads have been investigated to try to solve this problem :

Memory consumption : slam_toolbox's memory usage has been monitored along the way but the usage seems coherent and stable (around 1% of the total memory capacity). The size of the posegraph is also correct (around 800 points) and does not looks like it is increasing
CPU consumption : slam_toolbox's cpu usage has been monitored along the way. Noticeable peaks have been noticed in the most changed areas (with respect to the .data .posegraph map), around 4 cores are used during 10/15 seconds.
Mapping process : mapping has been done by passing several times in the same corridors and the passages are overlapping (a lot of points are really close to each others). Could it be possible that this has an impact on the localization's performances ?

The configuration used is :
```
Plugin params
solver_plugin: solver_plugins::CeresSolver
ceres_linear_solver: SPARSE_NORMAL_CHOLESKY
ceres_preconditioner: SCHUR_JACOBI
ceres_trust_strategy: LEVENBERG_MARQUARDT
ceres_dogleg_type: TRADITIONAL_DOGLEG
ceres_loss_function: None
```

ROS Parameters odom_frame: odom map_frame: map base_frame: base_footprint scan_topic: /scan mode: localization

debug_logging: false throttle_scans: 1 transform_publish_period: 0.02 #if 0 never publishes odometry map_update_interval: 5.0 resolution: 0.05 max_laser_range: 40.0 #for rastering images minimum_time_interval: 0.5 transform_timeout: 0.2 tf_buffer_duration: 30. stack_size_to_use: 40000000 #// program needs a larger stack size to serialize large maps

General Parameters use_scan_matching: true use_scan_barycenter: true minimum_travel_distance: 0.2 minimum_travel_heading: 0.1 scan_buffer_size: 10 scan_buffer_maximum_scan_distance: 10 link_match_minimum_response_fine: 0.1 link_scan_maximum_distance: 1.5 do_loop_closing: true loop_match_minimum_chain_size: 3 loop_match_maximum_variance_coarse: 3.0 loop_match_minimum_response_coarse: 0.35 loop_match_minimum_response_fine: 0.45

Correlation Parameters - Correlation Parameters correlation_search_space_dimension: 0.8 correlation_search_space_resolution: 0.05 correlation_search_space_smear_deviation: 0.3

Correlation Parameters - Loop Closure Parameters loop_search_space_dimension: 7.0 loop_search_space_resolution: 0.05 loop_search_space_smear_deviation: 0.3 loop_search_maximum_distance: 7.0

Scan Matcher Parameters distance_variance_penalty: 0.5 angle_variance_penalty: 1.0

fine_search_angle_offset: 0.00349 coarse_search_angle_offset: 0.43 coarse_angle_resolution: 0.0349 minimum_angle_penalty: 0.9 minimum_distance_penalty: 0.5 use_response_expansion: true

SteveMacenski commented 3 years ago

Have you printed to see if the initial pose message is being received and processed? You say after 20 minutes it stops localizing as well, but does it recover if you keep moving around? If it doesn't recover, we should look into it more. If it does, then that's potentially part of the technique if you ran into a patch of significant odometry drift and the local window was too large so it took a few updates to clear out that bad data chain. I won't say that this is an end-all-be-all solution suited for every person. This really does work best with good, regular odometry where you don't expect substantial slippage and you're just dealing with typical drift.

Unlike an MCL, there's no particle filter in the backend (its similar to an EKF as the optimizer backend) - it has to let the bad data run its course through the system if it was so bad the matcher couldn't account for it with the given settings.

kenza-ifollow commented 3 years ago

@SteveMacenski , first of all, thank you very much for your answer and sharing this repository. Here are some observations we have made on this issue :

In the situation that I have described above, /initialpose was being processed (apparition of the LocalizePoseCallback logs) and moving around would not help with relocalization.
Based on your suggestion, we have tried to localize the robot on the same path with a better odometry (without remapping though) and have witnessed the same issue.
We have also that by disabling loop closure, the robot relies on odometry only to localize itself and in this case, we do not have this issue anymore.
In addition, we have seen that you push a new feature on melodic to clear the localization buffer while giving a /initialpose and we have tried it, without any improvement on this issue.

We have investigated our code and realized that the problem might come from the fact that the posegraph has been shifted along the path. Indeed, during the past months, we had the same issue as the one described here : because Karto through loop closures slightly modify the posegraph which might not be aligned with the one which is uploaded when we launch the localization node. Therefore , the map would get distorded and would not overlap anymore with our static navigation path (with respect to the world/map frame). Because of that, the robot would follow a wrong path in reality while localized in rviz. To temporary solve this, we disabled the map's update (because it was indicated that the update was only for visualization purposes) but did not take into account that the graph was moving as well as the map.

--> The issue that we see on the image above is therefore the result of the map being shifted with respect to the posegraph : on the non-updated map, the robot is not localized but in reality it is well localized with respect to the updated posegraph.

You said that this could be a result of a bad odometry. Our odometry is acceptable but our path is more than 100 m long with no loop closure possible which affect the quality of the map. Do you have any empirical observations to improve the mapping ? Do you confirm that our problem could come from that ? If so, do you see any leads to prevent the modification of the posegraph overtime during localization ? (we would like the posegraph to stay just as it is when we launch slam_toolbox's node) Thank you again for your time !

SteveMacenski commented 3 years ago

We have also that by disabling loop closure, the robot relies on odometry only to localize itself and in this case, we do not have this issue anymore.

Huh, that is interesting. Does this result in any worse localization performance? That could be a very real solution to this issue since localization is just a rolling chain of local scans - if you do not see downsides. If you're only keeping some very local set of localization scans, there's no reason to loop close within the localization rolling chain. Loop closure against the localization chain and the base map... I do not believe... has any significant value to the localization mode. You're still scan matching against the map to correct drift and I could potentially see how updating the pose-graph many times and then removing the constraints could push the optimization problem into another local minima, showing "drift". I'm not sure I see any upside of enabling loop closure in localization mode.

We have investigated our code and realized that the problem might come from the fact that the posegraph has been shifted along the path.

That's odd, this is supposed to lock the pose / orientation of the graph in place https://github.com/SteveMacenski/slam_toolbox/blob/melodic-devel/slam_toolbox/solvers/ceres_solver.cpp#L183 since that first node cannot move. That should be preventing drift. I'm now wondering if for some reason that isn't called on deserialization in localization mode, but I'm almost certain that I validated that. So perhaps that's not doing "enough"?

To temporary solve this, we disabled the map's update (because it was indicated that the update was only for visualization purposes) but did not take into account that the graph was moving as well as the map.

Can you explain how this solved your problem? How does not publishing /map mess with loop closure drift in the pose-graph? I understand how loop-closures would resolve it. In localization mode, removing loop closures actually seems like a very reasonable action to take.

You said that this could be a result of a bad odometry.

I think you've done enough debugging and the visualizations you provided make me think that's not the case. Disregard those comments.

do you see any leads to prevent the modification of the posegraph overtime during localization ?

I think you should try checking that that Ceres first node doesn't move. If it does, we definitely have a problem to look into. I'm also trying to think about ways we can verify what's drifting (is it the pose-graph or the occupancy grid, or what). Have you found some evidence pointing you specifically to the pose-graph? I think that's also likely the case, but I want to verify before going down that rabbit hole. If you had a map, I'd try publishing the pose of some nodes far away from the robot (such that it wouldn't be effected by local updates) and publish the XY pose of it and see how / if it drifts with the map to show that the entire pose-graph is drifting. Though from what I talked about above, I could see it happening from a local minima issue.

SteveMacenski commented 3 years ago

@kenza-ifollow @cblesing any follow up here? Is that a reasonable / functional work around? I can add some docs on this if so

kenza-ifollow commented 3 years ago

Hi @SteveMacenski,

Indeed, we went on the field last week to gather more information about the issue raised here.

We could indeed see a visible drift of the posegraph, as seen on this images below : at_launch_posegraph_overlap_voronoi

Here we can see that at launch, the posegraph (in red) is overlapping with the navigation graph (in blue) that is suppose to be followed by the robot during navigation. This graph is fixed in the world frame so it cannot move during navigation.

Here is the same place, at the end of the path of the robot : at_end_no_overlapping

We can see that the graph has shifted about 40 cm down. I've tried to understand why it would move like this so I followed the movements of the posegraph during the path of the robot, with the help of the debug logs of slam_toolbox. I've noticed that the posegraph would shift when the robot abruptly relocalized by slam_toolbox, as we can see on the image below (it is given with the logs at the same time) :

solver_jump_beginning_corridor

So I've linked this issue to the solver issued and I went back to the slam_toolbox's issues and I found the one about the Huberloss function. There is no glass in this environment but still, changing to Huberloss function helped with this issue as you can this on this image, representing the end of the same path (from rosbag) using Huberloss function :

with_huberloss_fct

On this rosbag, it worked perfectly but on others, in which the robot got slightly delocalized again, it still got shifted.

I did not really have time to do further tests, but I've still checked that the first node is indeed fixed which is the case.

Also, As mentioned in this recent issue, I also had recurrent "mutex lock failed" crashes, that I once thought was liked to available memory on the robot (get less occurrence when disk was far from being full) but also happened when the memory was half full at another time so this is not concluant.

I had this problem when I've set size_buffer_scan above 10 in localization but it is not impossible that I've set size_buffer_scan > 10 while mapping, I will try which would explain the crashes, I will try to remap and retest slam_toolbox in this environment to see if I still notice crashes.

I should have more time to do additional test in the following weeks and would update you about this situation

SteveMacenski commented 3 years ago

changing to Huberloss function helped with this issue as you can this on this image, representing the end of the same path (from rosbag) using Huberloss function

This tells me that this is probably a local minima issue and that this loss function is better at removing the optimization problem's instability around some solution. A change in loss function can do wonders for removing the symptoms of a problem (until it grows to a point it can't since it looks like a legit source of pose error in the system) but its not a solution in it of itself for anything.

Did you try setting the loop closures to false like we talked about above? Did that fully solve the issue? Any side effects? If the optimizer never triggers, then this would never be a problem. This is obviously not a real solution for SLAM since you NEED those, but for pure localization, I don't actually foresee any downside. If you don't see any downside, I can document this feature of localization mode with loop closure and change the default localization parameters and we're done with this issue.

I had this problem when I've set size_buffer_scan above 10 in localization but it is not impossible that I've set size_buffer_scan > 10 while mapping, I will try which would explain the crashes, I will try to remap and retest slam_toolbox in this environment to see if I still notice crashes.

Yeah sorry about that, if you leave it at 10, you'll see the same behavior you were getting before (since that parameter wasn't being read in). We're working on it in the other ticket, but its unrelated to this issue.

PGotzmann commented 3 years ago

Just a short side note to avoid confusion here:

Yeah sorry about that, if you leave it at 10, you'll see the same behavior you were getting before (since that parameter wasn't being read in). We're working on it in the other ticket, but its unrelated to this issue.

If I understand this correctly @kenza-ifollow opened this issue for eloquent so he should not have been effected by the ROS2 read parameter issue (#331) you mentioned here. Never the less using a scan_buffer_size in localization mode greater or equal than the one used for mapping seems to avoid the "mutex lock failed" issue (#355), but let's keep the discussion separated :wink:

kenza-ifollow commented 3 years ago

@PGotzmann yes indeed I am not using ROS2 so the read parameter issue should not affect me and is not related to this issue :)

@SteveMacenski

Did you try setting the loop closures to false like we talked about above? Did that fully solve the issue? Any side effects?

I did try without loop closing in localization mode and I could not notice any shift in the posegraph but the localization was really bad, as if it was completely relying on the odometry and never trying to correct the pose of the robot.

Normally I would be able to record a bag of this behavior next week

SteveMacenski commented 3 years ago

@PGotzmann yes indeed I am not using ROS2 so the read parameter issue should not affect me and is not related to this issue :)

But if you used 10 for mapping, you'll still have the exact same issue as #355 in ROS1. So its really "dont use anything for scan_buffer_size less than that of which you mapped" for all versions until we figure it out and backport a solution. It was just exposed in ROS2 by that parameter update ticket. We can continue discussion over there, but lets not mix these.

Without loop closure was bad? I would have expected that without scan matching enabled, but I don't know why without loop closure that would happen. Didn't you mention above what you tested that briefly before and got reasonable results?

SteveMacenski commented 3 years ago

Any word?

kenza-ifollow commented 3 years ago

Sorry for the long delay @SteveMacenski , we had a lot of other tests to do on the field.

To show you what happen when loop closure is disabled in localization, here are some captions of a tests we have made .

begin_bag At the beginning of the test, the robot is perfectly well localized (red arrow) as we can see on the picture above : the lidar'points cloud (purple points) are overlapping with the map.

The robot navigate at a pretty low speed (less than 1 m/s) for about 30 seconds and everything is still good : I've monitored the tfs between map and odom and it is still published still_good_bag

After a while, we can see that there is a small orientation problem, which is not getting corrected : lidars do not match the map anymore on the right things_getting_bad_bag

This is never getting corrected and less than a minute after the navigation has begun, we have a delocalization bad_behavior_bag

I did not get time to dig deeper in it but some leads :

In a first iteration, I have changed some mapping's parameters such as scan_buffer_size. Regarding our conversation I thought that it might be the problem so I've rerunned a mapping from a rosbag with new params to prevent pthread issues. Then, I've played the same rosbag with the old and the new map and I've noticed that the delocalization happen quicker in the first case
In some places, slam_toolbox's localization without loop closure works well (it seems like the position gets corrected) and it is areas in which I think that odometry is better : could it be that without loop closure, slam_toolbox can only correct position's drift in some extend ?

Regarding what I have tested that got reasonable results : it was changing ceres_loss_function loss function to Huberloss

Here is my config file for localization :

# Plugin params
solver_plugin: solver_plugins::CeresSolver
ceres_linear_solver: SPARSE_NORMAL_CHOLESKY
ceres_preconditioner: SCHUR_JACOBI
ceres_trust_strategy: LEVENBERG_MARQUARDT
ceres_dogleg_type: TRADITIONAL_DOGLEG
ceres_loss_function: None #HuberLoss

mode: localization #mapping

debug_logging: false
throttle_scans: 1
transform_publish_period: 0.02 #if 0 never publishes odometry
map_update_interval: 5.0
resolution: 0.05
max_laser_range: 45.0 #for rastering images
minimum_time_interval: 0.5
transform_timeout: 0.2
tf_buffer_duration: 30.
stack_size_to_use: 40000000 #// program needs a larger stack size to serialize large maps

# General Parameters
use_scan_matching: true
use_scan_barycenter: true
minimum_travel_distance: 0.2
minimum_travel_heading: 0.1
scan_buffer_size: 10
scan_buffer_maximum_scan_distance: 10
link_match_minimum_response_fine: 0.1
link_scan_maximum_distance: 1.5
do_loop_closing: false
loop_match_minimum_chain_size: 3
loop_match_maximum_variance_coarse: 3.0
loop_match_minimum_response_coarse: 0.35
loop_match_minimum_response_fine: 0.45

# Correlation Parameters - Correlation Parameters
correlation_search_space_dimension: 0.5
correlation_search_space_resolution: 0.01
correlation_search_space_smear_deviation: 0.1

# Correlation Parameters - Loop Closure Parameters
loop_search_space_dimension: 7.0
loop_search_space_resolution: 0.05
loop_search_space_smear_deviation: 0.03
loop_search_maximum_distance: 5.0

# Scan Matcher Parameters
distance_variance_penalty: 0.5
angle_variance_penalty: 1.0

fine_search_angle_offset: 0.00349
coarse_search_angle_offset: 0.43
coarse_angle_resolution: 0.0349
minimum_angle_penalty: 0.9
minimum_distance_penalty: 0.5
use_response_expansion: true

SteveMacenski commented 3 years ago

Got it - so perhaps a combination of the loop closure off + HuberLoss could help? I fully recognize whenever the loss function "fixes" an issue, that means there's another real issue to be fixed and that its just doing a good job at masking it. However, if the optimizer isn't running at all, there should be no reason any of the original graph nodes should move (you can inspect the code yourself to verify that claim) I do not believe. As such, I'm a little at a loss as to where that issue could be showing up, unless the do_loop_closure parameter wasn't properly disabling the optimizer. I do not believe there is anywhere else in the code where any object has the ability to change the positions of the nodes to cause the "pose-graph-drift" behavior that you've reported. Adding nodes to the graph wouldn't change the original node's properties at all unless an optimization run was executed.

Kaetera commented 3 years ago

Hi, I get the same behavior on my robot: the localization "drift over time" and slam_toolbox fails to re-localize it (typically we get 0.5m-1m offset after 20min of navigation in our office environment). The drift is firstly most visible on the heading.

Operating System: Ubuntu 20.04 LTS
Installation type: apt
ROS Version: ROS2 Foxy
Version or commit hash: 2.4.1-1focal.20211006.001625
Laser unit: Hokuyo UST-20LX

I have params do_loop_closing: false and ceres_loss_function: None.

One interesting thing I noted is that even if my robot is left still, in a purely static environment (no people nor stuff moving around), I get a progressive drift on the heading:

I expected the slam_toolbox to correct this drift. On the image above, I use pure laser odometry. There is then a small noise on the estimated velocities, in the order of magnitude of x: 0.00149702 m/s, y: -4.279126e-07 m/s, yaw: -0.0034228 rad/s.

I suspected the noise to have some effect, so I did the same test with pure wheel odometry, to have these velocities exactly at 0.0, without noise (since no pulses on the encoders). The behavior change a little: the location of the robot keep slightly jumping around the same point (that was not the case with pure laser odometry). At the beginning (t = 0 min), there is nearly no jump, but the jumps starts to be bigger and bigger as the time goes. The progressive increase of the heading offset seems to be at the same rate as the test with pure laser odometry. Like if the heading was still drifting but this time the slam_toolbox succeed in relocalize it, but the drift is not "reset" (after about 15 min in the image below):

2021_10_21pure_wheel_odom_jumps

tugbakara commented 1 year ago

To temporary solve this, we disabled the map's update (because it was indicated that the update was only for visualization purposes) but did not take into account that the graph was moving as well as the map.

which parameter did you change to achieve this?

NhatTanXT3 commented 11 months ago

I just want to add more info about the effect of disabling /map update on the localization mode. For my use case (ROS1 noetic branch):

When enabled, the updateMap() function uses 100% 1 core CPU at about 1 sec for my map size ~630 vertices. From the publishVisualizations loop, it will lock the main scan processing loop

So I implemented a dirty change of first_map_only feature that publishes /map one time only. Something like below

Kaetera commented 6 months ago

While investigating some TF timing issues on my system, I noticed that the process time took by addScan() increases progressively with time (timed with two get_clock()->now() before and after): Few sec after navigation start, process time is about 150ms After 30s, it is around 400ms After 1min, around 600ms After 1m30, around 1s (and then everything breaks on my system due to obsolete TF timestamps)

Just dropping this here, may be a clue. Like if something was accumulating somewhere. (nb: my system has changed a lot since my last post: more nodes which consume more CPU, higher resolution lidar, slam_toolbox v2.6.8 for Humble)

EDIT1: It's actually ProcessLocalization() from Karto that takes more and more time.

EDIT2: Ok my bad. It happened we had the param 'scan_buffer_size' set to the huge value of 50. Then as scans accumuate in the buffer, link chain processing took more and more time until the process time exceed the 'transform_timeout', hence my issue. May not be related to this original ticket.