Ros2 Robot Localization package problems at high frequencies

JoeFrancis7 commented 1 year ago

Hi, I am using the robot localization package for ros2 galactic on the (fix/galactic/load_parameters branch and tried also the galactic-devel branch). I have encountered an issue with the running frequency. Initially, the frequencies in both config yaml files for ekf node and navsat transform node are settled at 30 hz. If I keep the frequencies the same or go lower, everything is fine.

The problem is when increasing the frequency above 30 hz, the filtered topics stops publishing at all. I noticed, when increasing gradually for example at 50 hz, the topics publish fine for short moments and stop completly afterwards. Then, at 80 hz , they publish for shorter moment and stop, and at 100 hz sometimes publish for only 1 sec and most of the time they cannot publish at all.

I tried also to set the nodedelay for the inputs to True to disable Nagle’s algorithm, which has no effect.

My cpu usage is very well below the limit.

I am running data (imu, gps,..) from ros2 bag db3 file having all data recorded at 100 hz and I can see streaming at same frequency.

My question is, does the ros2 robot localization package capable to run at 100 hz ? As I am interested to run it at least at 100 hz using ROS2 Galactic.

ayrton04 commented 1 year ago

This should probably be on answers.ros.org. It may already be; I get pretty much zero time to support this package.

What I will say is that I've run the EKF with five IMUs, each running at 100 Hz, and the package didn't even break a sweat.

We're not using ROS 2 (or r_l), but the core of the package is the same between ROS 1 and ROS 2, so my money is on something ROS 2-specific, like the DDS configuration. It should be easy enough to add some log statements in critical locations to track down what's delayed.

JoeFrancis7 commented 1 year ago

Thanks for the response. In fact, it does not seems that is something related to DDS configuration as I tried different configs before. In the past days, I did excessive testings. In summary, it seems the ekf_node probably stuck due to numerical stability issue.

I have 3 inputs to the ekf_node: 2 odom topics at 30 hz and one IMU at 100 hz.

the ekf_node can run reliably when its frequency is set to values below 70 hz. Surprisingly, also runs reliably at frequency around 300 hz without any issues but for sure with more noise and that is normal.

The big issue is when the frequency of the EKF is set to run at values around 100 hz. the filter stuck and stop publishing.

After looking back to the logs, the filter stuck when it gave a very strange big output, highly different from the previous ones. here, the filter stuck or sometimes continue but drops to 5 hz as he tries to converge back to the normal value, but it keeps running at very low frequency.

and here an example, how the filter stuck when the output from the EKF becomes suddenly very strange:

[INFO 1679252054.030418765]: odom_filtered_processed: x_m=202.875045, y_m=671.698274, Vx_mps=49.858315 [INFO 1679252054.040244311]: odom_filtered_processed: x_m=202.456659, y_m=670.798500, Vx_mps=49.712010 [INFO 1679252054.050041336]: odom_filtered_processed: x_m=202.469011, y_m=670.736804, Vx_mps=49.712723 [INFO 1679252054.060192011]: odom_filtered_processed: x_m=202.221122, y_m=670.267018, Vx_mps=49.714091 [INFO 1679252054.071162777]: odom_filtered_processed: x_m=-379580260536456.500000, y_m=167726813056023.937500, Vx_mps=1986287.756188

Additionaly, I have log statements on all sensor inputs to the filter to check for any anomalies at these time stamps, I found nothing there as the sensors values are very normal.

Moreover, if I set "permit_corrected_publication" parameter to true, the filter will not stop at all and continue to run at 100 hz, but it diverges when it outputs a very strange values like above and cannot converge back.

The filter appears to have problem when it operates at same high frequency as the sensors like at 100 hz. It seems more to do with the filter itself and its stability or the frequency response domain....

I will continue investigating the issue and it would be nice if you have some thoughts to share about this particular issue.

ayrton04 commented 1 year ago

Ah, I have seen this behavior before when covariance starts to explode. I'm more convinced that this is a ROS Answers question, but if we solve it here, maybe you'd be willing to repost it there for the benefit of others.

Can you share your full config and sample sensor messages?

JoeFrancis7 commented 1 year ago

Yes true, I have asked previously this on ROS Answers. Sure, I will update there once we reach a solution.

Regarding the issue, in the past days I have found a possible solution.

I have introduced in the ekf.yaml , the parameter "predict_to_current_time" and I set it to true before launching the ekf node. As, by default this parameter is set to false in the filter, especially in the "ros_filter.cpp" . I noticed there that there is an if statement to trigger this parameter to true if delta become large, but it seems this parameter is not triggered successfully when need. by setting this parameter to true in the config before launching the node, it solved the issue and now the filter run very well at 100 hz and at any other frequencies higher/lower without any issues. Maybe, I am not sure this will cause more load on the filter and very rarely I get a warning to reduce frequency or nb of sensors as filter took very little more time than the operating frequency, but this has no effect at all as the filter continue to operate very well at the selected frequency.

by the way, I have attached here my current config and a recorded ros2 bag from 3 sensors, so you can try to run the ekf based on that and see the behavior. currently in the attached ekf.yaml, the parameter "predict_to_current_time" is set to true so everything should run fine. If this parameter is removed, you can notice the filter stucking behavior as described in my previous comment.

it could be a ros2 bag / or communication issue but the filter should not stuck and the covariances explode in case there is some delays. I think the trick is around this parameter that should be triggered automatically when there is a need, but for now it can be a possible solution to keep it to true in the config as this as said worked well for me.

I hope all this description will help ! ekf_100hz_issue.zip

ayrton04 commented 1 year ago

it could be a ros2 bag / or communication issue but the filter should not stuck and the covariances explode in case there is some delays

I am not going to have time to review your bag files, but I fundamentally disagree with your stance that covariance should not explode if some sensors are delayed. If you are running predict/update cycles on your filter and you are failing to measure some dimensions of the state space, then the covariance will (and mathematically, must) explode without bound. The correlation between state variables in the state transition matrix will cause those errors to propagage throughout the system. That's just how Kalman filters work.

If there's a set of parameters that works, then I doubt it's a communication issue, but I really doubt it's a processor load issue. You can easily verify this via htop. Every time I've seen this behavior, it's always been an issue with covariance, whether from the sensor measurements or from incomplete state measurement.

What you need to do is set the filter's config to the values that caused it to lock up, then paste the last few filtered output messages (with the covariance!) here. You should also provide sample sensor messages.

If I had to guess, this is happening:

Lagged or incorrect measurements (with bad covariance) are causing your state estimate's covariance to explode
That is causing the state projection step to produce massive values for orientation variables (e.g., roll values of 1e20 or something insane)
This code is being used to wrap the state angles:

https://github.com/cra-ros-pkg/robot_localization/blob/galactic-devel/include/robot_localization/filter_utilities.hpp#L67-L78

Those while loops will take forever to wrap extremely large angles. This is admittedly a very silly way for me to be wrapping state angles; it should be using the angles library. I can fix that, for sure (I had it in my head that I already had), but the root of the issue, at least with the information I have so far, sounds very much like a covariance problem.

JoeFrancis7 commented 1 year ago

Hi again @ayrton04, First I am very sorry for the very delayed response from my side as I was very occupied in the last period.

The issue was already solved by my side from a while.

Regarding your last comment: yes, I absolutely agree with you on the fundamentals of the Kalman filter, that how it works. Maybe I did not clearly state my previous stance, but what I mean exactly by "the filter should not lock up when measurement delays are detected" is that the filter should continue estimating by projecting the state forward automatically to continue the estimation cycle , and when meas return the state will be corrected and continue as always predict/update cycles.

In the past, I have did many investigations on different setup to exactly figure out the issue: the issue is not related to any particular setup, like specific config or covariance values. I am already been sure, everything is measurable and setup correctly and my covariance are tuned very well. Also it is not related to wrap angles issue. I do not have a problem in that. When my filter was locking up , the last estimated angles was normal values nothing insane there.

I am totally sure , it is due to some very tiny measurements delays (speaking about fraction of milliseconds) causing the filter to lock up , by introducing several logging statements in ros_filter.cpp, I saw when sensor timeout is detected , there is param "predict_to_current_time" should be set to true automatically to trigger projecting the state forward and continue estimation. the param was been able to correctly set to true automatically when timeout detected but the triggering of project sate forward was not achieved due to small issue regarding variable declaration. In that case, the was filter was ending up by doing nothing when timeout detected causing the covariance to explode and stuck.

Maybe this issue was not encountered too much before as most users run the filter at low frequency of something like 30-50 Hz, and if there is tiny delay in meas, it was not captured during predict/update cycle so the filter will continue working as normal. but when going to operate at high frequency in the range of 100 Hz, here there is a very big probability to capture this tiny meas delay during one estimation cycle, and if one is lucky enough and he has no delays at all, nothing will happen. but in several case scenarios those tiny meas delays can happen very frequently.

To be more sure, I have also done one more precise test, i have controlled the pause/resume cycle of the bag file feeding the filter with measurements, so when this cycle is less than the filter cycle period nothing happen, and when this cycle is just slightly higher than the filter cycle , the filter will stuck. This can happen at any frequency low/high but for sure at low frequency the timeout of meas should be much higher for this lock up issue to happen.

In summary, this issue is fixed in my recent pull request #819, now the filter can run robustly and reliably at any frequency without any issue at all.

ayrton04 commented 1 year ago

due to small issue regarding variable declaration.

What was the issue with the variable declaration, sorry?

JoeFrancis7 commented 1 year ago

yes the issue is mainly related to the intended purpose of the param "predict_to_current_time" is to be set permanently or just keep switching depending on the condition. I have explained in details the consequences of this in my recent comment here: https://github.com/cra-ros-pkg/robot_localization/pull/819#discussion_r1244013130

cra-ros-pkg / robot_localization

Ros2 Robot Localization package problems at high frequencies #798