SteveMacenski / slam_toolbox

Slam Toolbox for lifelong mapping and localization in potentially massive maps with ROS

GNU Lesser General Public License v2.1

1.59k stars 507 forks source link

[rplidar users] Potential errors with drivers + 360 lidar support #198

Closed saschroeder closed 4 years ago

saschroeder commented 4 years ago

Required Info:

Operating System:
- Ubuntu 16.04
Installation type:
- from source
ROS Version
- ROS Kinetic
Version or commit hash:
- kinetic-devel, latest commit (9b14cf8369a63264fd9935be4cca5cf14386bfc5)

Steps to reproduce issue

I created a minimal repo with bags, config and launch files:
https://github.com/saschroeder/turtlebot_slam_testing

The bug always occurs with rosbags and sometimes in simulation, too. Disabling loop closure prevents the bug.
I can upload more rosbags if necessary.

Expected behavior

Record an accurate map and correctly use loop closure to improve the map.

Actual behavior

After some time the map is rotated around 180° (may be related or not) and some time after this (seconds to minutes) the pose graph is optimized in a wrong way, causing the map to be broken (see images below).

Additional information

Example 1 (before/after optimization):
ex1_before_optimization ex1_after_optimization

Example 2 (before/ after optimization): ex2_before_optimization ex2_after_optimization

SteveMacenski commented 4 years ago

First off, I recommend the Melodic branch. It has a bunch more features and is better structured. It will work in Kinetic.

Second, it would be helpful to provide me with a bag file I can use to reproduce.

My steps to reproduce:

Downloaded your bag file

Ran

rosbag filter bag_name.bag output.bag 'topic != "/tf" or topic == "/tf" and m.transforms[0].header.frame_id != "map" and m.transforms[0].child_frame_id != "odom"'
output.bag

on the bag file, to remove the transformations your recorded from your SLAM/localization runs, so I was just left with the raw data to provide my own.

Ran: roslaunch slam_toolbox online_sync.launch with a roscore and sim time.

So make sure you see this error when following those steps on your machines.

Third, how did you install Ceres? I've heard of similar issues that were the result of a dirty Ceres install that comes with Google Cartographer. If you've installed Google Cartographer, you will need to remove that Ceres install and install it again via rosdep. Ceres that ships with Cartographer is only suitable for cartographer. This thread https://github.com/SteveMacenski/slam_toolbox/issues/167 may be helpful.

saschroeder commented 4 years ago

Thanks for the quick response.

1) Sounds good, I will have a look at it.

2) I did literally the same, except that I used the launch file provided in the repo (https://github.com/saschroeder/turtlebot_slam_testing) which loads a slightly different config file. Just in case, I reset all parameters to match the config file used by online_sync.launch (except for laser_frame and max_laser_range) and tested again. In fact it took much longer for the error to occur (~10min), but it still did.

3) I installed Ceres via apt (libceres-dev, libceres1). Cartographer is not and was never installed. However, after reading #167 I'm wondering if I have a similar issue and need to invert the laser. I'm gonna test this on monday and give you feedback.

saschroeder commented 4 years ago

Update:

Nothing wrong with the laser frame.

I did some further testing. All with the same rosbag and the steps you described above. I tested both the sync/ async nodes with the default configs (adapted laser frame and range). 1) Previous system (Ubuntu 16.04, ROS Kinetic, libceres-dev 1.12.0)

purged and installed ceres via rosdep
kinetic-devel and melodic-devel branch 2) Clean ROS Melodic install (Ubuntu 18.04, libceres-dev 1.13.0)
ceres via rosdep
melodic-devel branch

The error occured in any case. Are you using one of these Ceres versions @SteveMacenski ?

SteveMacenski commented 4 years ago

I have 1.13.0+dfsg0-2 installed via rosdep.

Please post a bag file and I can take a look. Include your tf and tf static and laser scan. Specify how your laser is mounted on the robot. Also specify how you've installed this.

saschroeder commented 4 years ago

This is the robot: The laser is an RPLIDAR A2M8. It's installed with the cable pointing to the robot's back.

This is my launch file for the lidar: https://github.com/saschroeder/rplidar_ros/blob/master/launch/rplidar.launch

This is the complete rosbag (previously cut to fit on github): https://drive.google.com/file/d/1Kly5dnjB9VTyjvI6uO32_Yzn-tEqmYA_/view?usp=sharing

SteveMacenski commented 4 years ago

What is angle_compensate? I can't seem to find any documentation about what it does and the logic in the code for rplidar isn't simple enough for me to just glance to understand.

Do you see any logging / errors from SLAM Toolbox or anything else when running?

Your TF tree also looks weird, why is your laser oriented backwards in frame to the robot? Your wheel encoder TF also updates insanely slow, looks like only 2-3 hz (non-scientifically from looking). I don't know if that's related, but that's far too slow for quality autonomous navigation. You should be getting it at 50hz with the kobuki if memory serves.

I'm getting past where you had issues in your first example images, no issue.

Also made it past your second issue example, no problem.

I see a good loop closure here.

I got through the loop closure of the main corridor and a bit more and finally had it happen around 610 seconds into the bag. And I get that pretty deterministically from the 3 times I ran in a row.

I reset all parameters to match the config file used by online_sync.launch (except for laser_frame and max_laser_range)

Where do you see a laser_frame param? That's ancient. You shouldn't even know about that - are you using Melodic devel like I suggested? You shouldn't need to set any laser frame.

Do you reset odometry or otherwise do anything to effect TF / laser scans in your stack? IMU off nuts? Anything? I've mapped spaces 10x larger than this on a regular basis and never had an issue. This is unique and the other users that have reported similarly (if you check the closed tickets) are all results of messed up URDFs or inverted lasers or something wacky. I also don't have any robots with 360 lasers so perhaps there's something there to look at, there was a couple of PRs to enable 360 lidars but maybe something got messed up.

My recommendation would be to first critically analyze your URDF, frames, odom, and transformations since that seems to fix all the users so far that have reported similar things. My second recommendation would be to use laser_filters to transition your 360 lidar into a 270 lidar and feed that scan into SLAM Toolbox as a test. If no issues, then its something 360 lidar related. If it still happens, then its robot related.

I ran 2 much larger bags when I went out for dinner that I had laying around and I confirmed that mapping a much larger space over the course of several hours is still working fine as a sanity check. Both of these bags were collected with different robots and different lasers from different manufacturers and exceeded 2 hours in length each.

tik0 commented 4 years ago

I am using vanilla Ubuntu 16.04/ROS Kinetic and slam_toolbox on melodic-devel as of today. I know the building and it is literally a glasshouse. I've inspected the bag topics and transforms, and everything looks OK: /odom publishes with 20 Hz and also the "inverted" frame rplidar_frame is correctly in place. My first guess is, that the solver is not optimally configured for this recording, as broken loop-closures happen only in very particular situations. However, I ran

$ rosbag filter part1.bag output.bag 'topic != "/tf" or topic == "/tf" and m.transforms[0].header.frame_id != "map" and m.transforms[0].child_frame_id != "odom"'
$ roscore
$ rosparam set use_sim_time true
$ rosbag play --clock output.bag
$ roslaunch slam_toolbox online_sync.launch

If I use vanilla online_sync.launch with ceres_loss_function=None (aka squared loss) I get the same reproducible result after ~600 seconds:

Screenshot from 2020-05-03 20-16-59

if I change ceres_loss_function=HuberLoss (which is more robust to outliers), it looks quite good:

Screenshot from 2020-05-03 20-10-48

@saschroeder and @SteveMacenski can you confirm this? @saschroeder can you elaborate on this behavour on other bags?

SteveMacenski commented 4 years ago

That is really, really interesting. Is that area that you're in also made of glass where the error happens? I could see how this could happen if it gets some scans that look flipped from the reflection from the glass and then when it looks to loop close, it matches those warped scans and when Ceres runs, it has some internal conflicts.

A different loss function would help with that fact. I had not included one prior because I felt that each "measurement" was supposed to be "legit" and so we shouldn't remove some outliers. However, with glass or mirrors, you're right in thinking that would totally break my assumptions that they're all "legit".

I would be interested to hear if that fixes your issues in other bag files as well. If so, that seems to be a completely valid solution to the problem.

I'd appreciate it if you submitted a PR to mention that behavior in the README.md file. That way future folks can learn from our experiences that in environments with glass/mirrors we should consider using a loss function. Thanks for the debugging @tik0. I also believe in trying to give folks a really excellent out of box experience. I would be OK also with that PR changing to the HuberLoss as the new default. It may be useful as well to expose the loss function weight (currently hardcoded to 0.7 which I tested to generally be good for my testing while developing) so that you can tune this as well since that capability seems necessary for your requirements.

I can confirm that the loss function fixes it for me too. I stopped around 1000 seconds because it seemed fine and I needed my CPU back.

PS: I'm reasonably happy with the performance of the RP lidar in this environment / dataset. That wasn't expected.

tik0 commented 4 years ago

I'd like to add some more information to keep this topic alive. I managed to get very stable results on 2020-04-08-11-01-32.bag (both cleaned up packages available here) after tuning the optimizers' parameters one after another as follows:

Change in mapper_params_online_sync.yaml
ceres_loss_function: NONE
to
ceres_loss_function: HuberLoss
stabilizes big loop-closure in the early stage of recording

Change in mapper_params_online_sync.yaml
ceres_trust_strategy: LEVENBERG_MARQUARDT
to
ceres_trust_strategy: DOGLEG
stabelizes smaller loop-closure in the later stage of recording

Change in solvers/ceres_solver.cpp
loss_function_ = new ceres::HuberLoss(0.7);
to
loss_function_ = new ceres::HuberLoss(50.0);
stabilizes very small loop-closures in the later stage of recording
(Yes 50. I first thought that then HuberLoss might be replaceable by NONE, but this is not the case. It might be very beneficial, to see a distribution to find optimal values. Values over 70 let the graph collide in loop closures)

Change in mapper_params_online_sync.yaml
scan_buffer_size: 10
to
scan_buffer_size: 20
stabilizes some medium size loop-closure in the later stage of recording (values like 30 still managed the loop closures, but caused misaligned walls)

Maybe @SteveMacenski can tell a story about the parameters in my empiric observations? However, I've made the most intriguing observations during the exploitation of the bag's replay speed. As soon as I keep the CPU load below the number of real cores, everything looks perfect, and ceres never taint a loop-closure. As soon as I exceed the CPU load with too high replay speeds, ceres mess up loop-closures very frequently like seen in the posts before.

SteveMacenski commented 4 years ago

ceres_loss_function: HuberLoss

I think we've talked about this one already so I'll skip

ceres_trust_strategy: LEVENBERG_MARQUARDT

I'm really surprised you had to change that. I don't have a good answer off the cuff on that. Are you sure after changing all the other things that this actually made a beneficial impact?

lossfunction = new ceres::HuberLoss(50.0);

Like I mentioned above, I really didn't expect this to need to be used. I work with mostly professional long range lidars so this is not much of an issue. As you mention, there could be some benefits to adding this for mirrors/glass/cheap(?) lidars. Can you please submit a PR exposing this parameter (the value to assign to the loss function)? This makes sense to me as something to optimize if you have graph closure problems from outliers. Higher here will mean more rigor applied to each point about whether it really fits the graph.

scan_buffer_size: 20

From the description

    "Scan buffer size is the length of the scan chain stored for scan "
    "matching. \"ScanBufferSize\" should be set to approximately "
    "\"ScanBufferMaximumScanDistance\" / \"MinimumTravelDistance\". The "
    "idea is to get an area approximately 20 meters long for scan "
    "matching. For example, if we add scans every MinimumTravelDistance == "
    "0.3 meters, then \"scanBufferSize\" should be 20 / 0.3 = 67.)",

This lets you have a large window of memory to match against, but also what this does ( and what I think you see happening) is that this also effects the minimim chain size for what is considered a loop closure. Ex if this was 5, on scan 6 you could "loop close" against scan 1. If this is 100, you cannot loop close from scans 1-99 on scan 100 since they're all chained together. Making this larger will effective disable smaller loop closures which could be beneficial to reduce the number of instances of this problem you see.

saschroeder commented 4 years ago

I tested the parameters proposed by @tik0 on my laptop, which worked well, as long as I played the bags with normal speed, even though the CPU load was temporarily higher than the number of real cores. However, I'm still having issues with the bag 04-27-a.bag (https://drive.google.com/open?id=1BxGWGJKLLBVZoQjnzj2JAQtSkixL-gM-)

I already did some testing based on @tik0's first comment and found that the HuberLoss function does have an interesting effect on the mapping behavior:

I still get the initial 180 degree rotation of the map in every run (due to a 180 degree jump in the map to odom transform)
besides this, the behavior is less deterministic (there are distinct areas where a loop closure might break the map, but it's possible to have a run without any issues)
these loop closures don't necessarily distort the map as we've experienced before, but the transform from map to odom is suddenly jumping by 180 degrees (+translation), usually causing the mapping to continue with a 180 degree offset

I uploaded some gifs to illustrate this: https://drive.google.com/open?id=1BxGWGJKLLBVZoQjnzj2JAQtSkixL-gM-

turn1.gif shows the initial 180 degree turn, which doesn't break anything because the whole map is adjusted
turn2.gif shows the next turn, which "breaks" the map
turn3.gif another loop closure fixes everything
turn4.gif next turn, breaks the map again (mapping continues, but no more turns so it's broken in the end) I'm not sure if there's a pattern of repetitive breaks and fixes or if I was just lucky in that case.

Seeing all these 180 degree turns and considering that I can reproduce this in the simulation, I don't think this is about an unreliable sensor/ reflections, but rather a systematic error. So, why does this problem occur with my setup and not with @SteveMacenski's? Option 1: 360 degree lidar Option 2: The fact that my lidar frame is oriented backwards. This is absolutely correct considering the frame definition from the rplidar_ros package and the lidar installation, but probably different from your setup @SteveMacenski and thus not tested?

1) I applied a laser filter to test with a range of 270, 230 and 180 degrees. For 270 I get the same initial rotation, but not with 230 and 180 degrees. I stopped the bag (2020-03-27-...) after several minutes as the map quality got really bad.

2) I used the gazebo sim for some quick testing. I set the lidar frame's z-axis rotation to 0, 90, 135 and 180 degrees compared to base_link and tried to reproduce the loop closures with 180° rotation. For 180° lidar rotation (as before) I could easily trigger these (first is fine, seconds breaks the map...). For 0° no strange rotations, but plenty of good loop closures. For 90° and 135° I could trigger one map rotation each, but this time about 90° and 135°. If that isn't suspicious...

I will do some more tests with the lidar frame facing forward (possibly on the robot) to verify that it solves the problem.

tik0 commented 4 years ago

I've written a small script to flip the tf by 180° and laserscans by -180° inside the bags. Therefore, I am doing virtually nothing to the data. However, after this, everything worked perfectly even with the standard parameters. I can also penetrate the CPU to even higher loads after the flip and slam_toolbox does its job.

There might be a real issue in slam_toolbox if the laser frame is not co-aligned with the base_link frame. This shouldn't actually be an issue if all transforms are defined correctly. But, in fact, it is.

import rosbag
from sensor_msgs.msg import LaserScan
from tf2_msgs.msg import TFMessage

baginname = '2020-03-27-13-47-13_filtered.bag'
bagoutname = '2020-03-27-13-47-13_filtered_flipped.bag'

bagout = rosbag.Bag(bagoutname, 'w')
bagin = rosbag.Bag(baginname)

for topic, msg, t in bagin.read_messages():
    if topic == '/scan':
        # just flip all scans by 180°
        msg.ranges = msg.ranges[180:] + msg.ranges[0:180]
        msg.intensities = msg.intensities[180:] + msg.intensities[0:180]
    elif topic == '/tf_static':
        for tf in msg.transforms:
            if tf.child_frame_id == 'rplidar_link' and tf.header.frame_id == 'base_link':
                # was 0 0 1 0 before, do we flip it by 180° around z-axis this way:
                tf.transform.rotation.x = 0.0
                tf.transform.rotation.y = 0.0
                tf.transform.rotation.z = 0.0
                tf.transform.rotation.w = 1.0
    bagout.write(topic, msg, t)

bagout.close()
bagin.close()

SteveMacenski commented 4 years ago

There might be a real issue in slam_toolbox if the laser frame is not co-aligned with the base_link frame

I take your bet and raise you a laser frame yaw parameterization

https://github.com/SteveMacenski/slam_toolbox/blob/eloquent-devel/src/laser_utils.cpp#L95-L127

tik0 commented 4 years ago

I printed all variables in the parametrization and they look ok in all cases. However, I've found another minor issue regarding the detection, if the scanner is 360°. I don't know if this holds for all 360°lidar, but for the rplidar A2M8 from the bags the following code changes should be considered:

The scanner has a valid 360° configuration and publishes as follows:
angle_min: -3.124139 (which is -180° - angle_increment)
angle_max: 3.141539 (which is the last reading at 180° that completes the circle)
angle_increment: 0.017453 (which is 1°)
len(ranges) = 360

Therefore, the condition in the parameterization does always fail because of std::fabs(scan_.angle_min + M_PI) < 1e-3. One should rather check, if the residual is less than the increment. Something like

std::fabs(std::fabs(scan_.angle_max - scan_.angle_min - 2. * M_PI) - scan_.angle_increment) < 1e-3

Furthermore, if the scanner is then correctly detected as a 360° scanner, this condition shows falsy behavior because of the later calculation of the number of scans. The residual should always be 1, because calculation does only calculate the spaces between the scans, even in the 360° case. However, fixing this has no impact on the erroneous behavior from @saschroeder .

@SteveMacenski as the scan matcher always seems to work, I assume that maybe the orientation of the lidar is not respected when ceres does the loop closure. Because falsy behavior only occurs in these cases. @SteveMacenski any ideas on that?

SteveMacenski commented 4 years ago

Therefore, the condition in the parameterization does always fail because of std::fabs(scan_.angle_min + M_PI) < 1e-3. One should rather check, if the residual is less than the increment. Something like

Please submit a PR and I'll merge that. Looks good to me. That was a user submitted issue with a RPLidar and apparently it didn't totally work out.

@SteveMacenski as the scan matcher always seems to work, I assume that maybe the orientation of the lidar is not respected when ceres does the loop closure. Because falsy behavior only occurs in these cases. @SteveMacenski any ideas on that?

I'm fairly certain it does, you can test yourself with the magazino bags here: https://google-cartographer-ros.readthedocs.io/en/latest/data.html#id78 their robot has 2 laser scanners at 90 degree angles from the base link and it builds fine. I used these bags in developing the (still incomplete) multi-robot synchronous distributed mapping features (so... damn... close... just need someone to help me trouble shoot some optimizer errors).

saschroeder commented 4 years ago

I ended up turning the lidar on my robot as this was the fastest way to fix it. As described by @tik0 everythings works fine now. I think the overall performance has improved, too.

Out of curiosity I tested the hallway_return.bag from the cartographer docs with scan_front/ scan_rear each and the mapping does look strange to me. Though the map itself is build nicely you can see in rviz that the laser scans are misaligned to the map and I see the same characteristic map rotation, that I described in my previous comments, shortly after the robot turns back in the hallway. To me this is a strong indicator that there might be more issues. I guess the bag is just to short or the map to simply for more issues. You could also use @tik0 script to shift the laser frames in your own bags and try to reproduce the issues with those.

SteveMacenski commented 4 years ago

Hi,

I renamed the ticket to reflect that (mostly because that original name was non-descript).

This is an interesting topic, and maybe there is some issue. @saschroeder @tik0 do either of you see where that laser utils for handing the angle may be wrong? Or tracing back into Karto-land where that may not be properly utilized? I'd love to help fix this, but without dropping everything else to try to debug this, I'm not sure I can. If we can at least identify an area that might be wrong, that's something I can sit down and attack.

I think the overall performance has improved, too.

Can you qualify that a little? (or quantify, though I think probably not :-) )

... and I see the same characteristic map rotation

I'd like also a description on what the "same" you see. Particularly in pertaining to the behaviors you see in this that you also saw in your setup. We might be able to get at something here.

Keep in mind that that bag as TF built into it. That offset I think is because you didn't use the rosbag filter script I showed above to remove the existing TF odom->map transforms. I've seen that as well, but as a result of not conditioning the bag for "raw" data without SLAM or localization TF.

joekeo commented 4 years ago

I think this is the same issue I was having in: https://github.com/SteveMacenski/slam_toolbox/issues/167 And ended up playing a lot with the URDF files to have the lidar pose modified and matching the IMU for that. I currently came back to this, so i will post some updates with my own robot.

joekeo commented 4 years ago

So we are using the RPLidar S1 on some of our robots. And for some reason it provides the laser scan rotated by 180 degrees. When running slam_tollbox with loop_closure set to true, the map gets flipped 180 degrees and then, after a while, it becomes very inaccurate. I just flipped the lidar 180 degrees in the robot (physically, along with the URDF) and now slam_toolbox works properly, loop closure is stable.

SteveMacenski commented 4 years ago

I just flipped the lidar 180 degrees in the robot

Mhm... that also makes me suspect that maybe the laser scans are backwards (though logic would say flipping both 180 should be identical). There are standards about how the laser scan messages should be formulated (e.g. sweep direction). If flipping it works, maybe that relates?

The annoying thing is that this is only reported on RP-Lidars and no-one else. Their ROS driver is really bad. It sounds like this may be the result of multiple issues in the drivers, not just the frame. I'm not sure what I can do here until someone here with a sensor finds what's wrong with either the driver or this package regarding it, or someone with another non-RP-lidar or its derivatives 360 sensor verify it happens as well. This isn't something I can debug without a robot equipped with this sensor in front of me. I'd be more than happy to help whoever is willing to work on it any way I can including explaining things, looking at datasets and seeing if I can find the issue, etc. @tik0 was pretty on point in figuring out the other issues, maybe they're interested?

joekeo commented 4 years ago

I will dig in the RP lidar S1 part, will try to fix the frame issue and see if that sorts it.

SteveMacenski commented 4 years ago

If you need justification to upstream that fix REP-103 has your back https://www.ros.org/reps/rep-0103.html RP-lidar should be in compliance with it.

chrisl8 commented 4 years ago

I also have an RPLIDAR. I also have the cord facing the back of the robot. I am also seeing this same behavior.

Is flipping my RPLIDAR around on the robot to face the cord forward the best solution? Or is that only going to partially help?
Is there anything I can add in the way of testing, uploads, etc. that could help anybody else here to find the solution?

Unfortunately I am somewhat of a novice at this stuff along with this being a hobby project so I don't have the time or skills to dig into the RPLIDAR driver myself.

I don't hold out a lot of hope at the moment that Slamtec will fix their driver. The last responses I see from the maintainer to an issue on that repo is from 2018. I did do some diving into the "Network Graph" and found a few forks that look interesting. I may try some of them out to see if they behave differently. I can post more info here, but I didn't know if this was the right place to have that conversation or not.

joekeo commented 4 years ago

Hello, we solved this issue by modifying the rplidar ros package, we made a fork of it and modded how the laser scan is read, and exposed the change via a parameter that can be set on the launch file. The fork is here: https://github.com/Ross-Robotics/rplidar_ros

and for it to do the 'flipping'g you need to set the parameter rotate_scan to true

this is how our launch file looks like:

<launch>
    <arg name="frame_id" default="horizontal_laser_link"/>
    <arg name="serial_port" default="/dev/RP_LiDAR"/>
    <arg name="remap_as_raw" default="false"/>

    <node name="rplidarNode" pkg="rplidar_ros" type="rplidarNode" output="screen">
        <param name="serial_port" type="string" value="$(arg serial_port)"/>
        <param name="serial_baudrate" type="int" value="256000"/>
        <param name="frame_id" type="string" value="$(arg frame_id)"/>
        <param name="inverted" type="bool" value="false"/>
        <param name="angle_compensate" type="bool" value="true"/>
        <param name="scan_mode" type="string" value="Standard"/>
        <remap from="/scan" to="/scan_raw" if="$(arg remap_as_raw)"/>
        <param name="rotate_scan" type="bool" value="true"/>
    </node>

</launch>

Copy the repository to your work space, make a catkin clean then catkin build, the source (source devel/setup.bash) and it should work. Let me know if it works for you.

chrisl8 commented 4 years ago

I tried the patch recommended by @joekeo and I think it changed things for me, but didn't fix them. Before, the entire map would rotate and offset from the obvious position of the world based on the scan data.

Now, instead of rotating, it just shifts.

It always happens though after 10 to 15 minutes of running.

Unfortunately I don't have a bag file for you, but I will work on that. What I do have is an image. I run the robot back and forth over this area repeatedly using 2D Nav Goal input. Despite the messy environment, it maps very well and navigates very well using teb_local_planner. However, as I said, after about 10 to 15 minutes, the map offsets from the obvious scan position and navigation then becomes impossible. Screenshot from 2020-07-25 02-35-01

It is obvious in the image that it is offset in one direction significantly.

This was harder to see before when the map would rotate instead of only shift.

I may just try rotating the scanner on the robot too, to eliminate needing to perform code tricks to deal with this, although I'm also concerned this may not be the issues. Already it was thought by some to be a bad URDF file and a bad Cartographer install, both of which turned out to be false.

I appreciate any input you may have for how I can debug this further, and I understand that I haven't given you enough information to actually diagnose this yet. I wanted to share my findings before I forgot though.

Update It still rotates also. My issue might not even be related to this? Interestingly on the last run, it shifted almost twice the width of the map, but then recovered after a while, and then later rotated about 90 degrees and could not recover. I did make a bag file. I will take a look at it and see what I can discover. Screenshot from 2020-07-25 03-05-53

Update: What happens at each of these points (pictured above) is that the transform from map to odom suddenly becomes all zeros. So the "offset" is whatever the accumulated odometry drift has been since the start, which is why it seems somewhat "random" between runs and is sometimes highly rotated and other times just offset.

---                                                                                           
transforms:                                                                                   
  -                                                                                           
    header:                                                                                   
      seq: 0                                                                                  
      stamp:                                                                                  
        secs: 1595664325                                                                      
        nsecs: 766761376                                                                      
      frame_id: "map"                                                                         
    child_frame_id: "odom"                                                                    
    transform:                                                                                
      translation:                                                                            
        x: -2.3483314927042844                                                                
        y: -4.429779868765097                                                                 
        z: 0.0                                                                                
      rotation:                                                                               
        x: 0.0                                                                                        
        y: 0.0                                                                                
        z: -0.5969529408360271                                                                
        w: 0.8022762531866555                                                                 
---                                                                                           
transforms:                                                                                   
  -                                                                                               
      header:                                                                                   
      seq: 0                                                                                  
      stamp:                                                                                          
        secs: 1595664325                                                                      
        nsecs: 846779109                                                                      
      frame_id: "map"                                                                         
    child_frame_id: "odom"                                                                        
    transform:                                                                                
      translation:                                                                            
        x: -0.0                                                                                       
        y: -0.0                                                                               
        z: -0.0                                                                               
      rotation:                                                                               
        x: 0.0                                                                                
        y: 0.0                                                                                
        z: 0.0
        w: 1.0
---

Update: It is 100% repeatable and consistent with the bag file.
I am using the Melodic ROS Slam Toolbox package on the robot and the Noetic package on my laptop, with the same results.
The package version says 1.5.4, so I believe it includes the 360 lidar changes from January.

chrisl8 commented 4 years ago

Update with bag file, config, and video showing what happens and when. SLAM Failure 2020-07-27 16-35 I am using ROS Noetic installed via apt. This also happens on ROS Melodic. This image is from playing the bag file.

Here is the bag file: https://www.dropbox.com/s/fuma3x9c57os3iw/slamIssue-07-25-2020.bag?dl=1

Here is the configuration file used: https://www.dropbox.com/s/l1l2b6n3xod9jr2/slamIssue-07-25-2020-mapper_params_online_sync.yaml?dl=1

The RPLIDAR A3 is mounted backwards. This happens with or without using the RPLIDAR driver code that flips the lidar image 180 degrees, but this bag file is using that code, hence the transform for the LIDAR appears facing forward.

SteveMacenski commented 4 years ago

1.5.4 was incremented a month ago so yeah, its pretty new.

Overall, this issue is only being reported by people with RP-lidars (or potentially 360 2D lidars, but so far only RP-lidar users have commented on this thread). I have run this package for now hundreds of hours across dozens (of not low-hundreds) of facilities of various shapes and sizes using typical 2D laser scanners (sick, hokuyo, & similar) without seeing an issue so there's so far not much I can do since I cannot reproduce it.

I'd love to merge in a change to fix what you're seeing, but that requires someone to do the leg work to identify the issue and potentially resolve it. But even just telling what the specific issue is could be something I may be able to fix and have a user test.

That identify transform is interesting. Are you seeing errors like: https://github.com/SteveMacenski/slam_toolbox/blob/noetic-devel/slam_toolbox/src/slam_toolbox_common.cpp#L358? That's the only place an identity transform would be set at runtime after the first iteration. I looked through the convert() method in TF2 and I don't see anywhere where it might set an identity if it fails (you may want to double check me), though I didn't track down the transform() method.

If you can deterministically cause this to happen, awesome. I'd print or set break points in setTransformFromPoses() method and check the values at various steps. Maybe you'll see a clear location where it occurs.

chrisl8 commented 4 years ago

I will keep digging into it. I have this bag file now that does the same thing every time, so it will be my test data.

I appreciate any input on where to log out from the code. I will start with your suggestions above.

My hope is that your natural curiosity will compel you to at least try out the bag file and look for anything obvious.

I would not be surprised if it is me doing something stupid, but whatever it is, must be in that bag file.

SteveMacenski commented 4 years ago

I'd try that function first. See if the inputs are OK then see if the conversions / TF transforms / inversions or anything there causes the issue. If the inputs are bad, then tracking back from that.

I might have time later this week to look but this is not super high on the queue right now (#180 is next in my book after I fix a nav2 issue I've been working on for a week). Best guess when I could even start is early next week. But the more we can isolate the problem before then the better (maybe we'll find it before then)

casiarobot commented 4 years ago

Using opeperl Fuchs md25m-r2000 and hokuyo utm-30lx, I've encounter this situation when the map has a loop.

1) The z-axis of md25m-r2000 is up. The x-axis is the opposite direction of robot x. 2) Utm-30lx reverses on the robot z-axis, x axis is the x-direction of the robot.

chrisl8 commented 4 years ago

Update: I am a little confused because I feel like my results are not consistent now, but perhaps I was overlooking something before.

It appear now to be related to high CPU load. Up until Tuesday, it failed consistently on both the robot and my workstation laptops. On Tuesday it would not fail on my workstation laptop. Today I started artificially loading up the CPU on my i78850H (6 cores/12 threads, up to 4Ghz), and that does force the problem to happen.

If it was CPU related, that would explain why it is a bit random. My robot's CPU is slower, and even my powerful laptop is subject to both whatever else I'm running and thermal throttling.

I didn't think I saw this error before, but now I see it every time before the map "goes out of sync":

[ERROR] [1596036554.073900666, 1595664385.405057103]: Transform from base_link to odom failed: Lookup would require extrapolation into the past.  Requested time 1595664310.081902091 but the earliest data is at time 1595664355.599803672, when looking up transform from frame [base_footprint] to frame [odom]

and as you know this is from line https://github.com/SteveMacenski/slam_toolbox/blob/noetic-devel/slam_toolbox/src/slam_toolbox_common.cpp#L358 after which it sets an "identity frame" and now the map to odometry transform is zeroed out and nothing can proceed.

So two thoughts, assuming this is really where things go badly:

I presume this should run fine with the default config on modern but not insane computer? My only two changes to the config from default are:
```
resolution: 0.025 # 0.05
minimum_travel_distance: 0.1 # 0.5
minimum_travel_heading: 0.1 # 0.5
```
I increased the resolution because it appears to help with path planning through doorways. They aren't particularly narrow, but the path planners get hung up on corners when the pixels are big.

I decreased the travel distance because the path planner seemed to get stalled when the map wasn't updating during slow movement.

Are either of these adding an exponential increase in CPU usage? If so, do you have some guidance on tuning them? Like, how much is too much?

Is this the best or only possible response to this failure? I feel like wiping out the entire map to odom transform is draconian, but obviously I'm looking at one very specific situation, and I presume you understand the reasoning better. In my very naive view of the situation, I would want to just not send out any update or repeat the previous transform until I had data I could rely on, rather than making a very real and drastic change to set the transform to all zeros.

I have seen the map "lose sync" before and then recover, but I have not been able to reproduce that via the bag file yet. So maybe you expect that to happen, but it isn't for some reason?

I am going to play around with some of my naive ideas, and I will update you again if I find either more data or a solution that at least patches this up for me.

Meanwhile, when you do get to this (and I am grateful if you get to it any time in the next few months, no rush at all), hopefully this tells you where to look, and that you will have to try it on a slow or overloaded CPU.

Update:

Reducing the resolution to 0.05 does decrease the CPU load, and hence requires me to load up the CPU a lot more before I can cause the symptom, but it is still possible and repeatable.
If I stop all CPU loading after the map to odom transform was zeroed out, it does sometimes recover.
I honestly don't know what this has to do with the RPLIDAR. Perhaps I erroneously jumped on to this issue when mine is unrelated? Unless there is a convergence of things that I'm not aware of. I presume this reset of the transform is what happens to anybody when their CPU is busy?

Also, this code is 10 years old, so I guess it is battle tested?

I'm going to dig into the error a bit more, because I don't fully understand it yet. Maybe when I do I can understand the process better:

[ERROR] [1596041206.551335169, 1595664078.661087607]: Transform from base_link to odom failed: Lookup would require extrapolation into the past.  Requested time 1595664048.240699958 but the 
earliest data is at time 1595664048.876555413, when looking up transform from frame [base_footprint] to frame [odom]

SteveMacenski commented 4 years ago

and as you know this is from line https://github.com/SteveMacenski/slam_toolbox/blob/noetic-devel/slam_toolbox/src/slam_toolbox_common.cpp#L358 after which it sets an "identity frame" and now the map to odometry transform is zeroed out and nothing can proceed.

Mhm. 2 thoughts. We should probably do both:

A: we have that failure return the function so we don't set anything at that point. Honestly that seems reasonable.
B: TF transform a few lines up we give it a transform timeout period to try to get it from rather than only at instantaneous time.

I'd say either way, its really not good if you have TF networking timeout issues, that's hallmark of other issues in your system (usually things not publishing fast enough or you're really over working your CPU and basic things aren't getting through). I'd look into resolving that in your system but also we can fix the symptoms here too!

Are either of these adding an exponential increase in CPU usage? If so, do you have some guidance on tuning them? Like, how much is too much?

Not exponential, but those are 3 of the highest CPU impacting dials to turn. You're telling SLAM Toolbox to update 5x as often at 2x the increased resolution. So yeah, you should expect on the order of 10x the CPU. You should not do that if you don't have the compute juice to handle it. You proabably shouldn't be updating your map that often. Matching based slam algorithms rely on about a 70% new features 30% old features ratio to work well. If you do that often its more like 80% old features and 20% new and you're liable to decrease map quality. But whatever floats your needs.

Please profile your CPU usage and if its over 80% steady state, you're way dangerous and likely other processes are having problems too you just haven't noticed yet.

Reducing the resolution to 0.05 does decrease the CPU load, and hence requires me to load up the CPU a lot more before I can cause the symptom, but it is still possible and repeatable.

Makes sense.

If I stop all CPU loading after the map to odom transform was zeroed out, it does sometimes recover.

Also makes sense, since that update was only for exactly 1 cycle unless it happens multiple times in a row, which you'd know by that warning.

when looking up transform from frame [base_footprint] to frame [odom]

Your odometry is publishing too slow or isn't being received because your CPU is on fire.

tl;dr: try adding a return after that LOG_ERROR() and see if you can reproduce. That will mask the problem, though you still have a real problem to solve on your end. If that works let us know and submit PRs against the branches so we can call this one done.

Thanks for the diligent debugging efforts! CC @saschroeder @tik0 @joekeo does this sound like what you were experiencing too?

chrisl8 commented 4 years ago

I did some testing last night and here are my thoughts.

Failure Mode

By setting the minimum_travel_distance and minimum_travel_heading to 0.1 (NOTE: BAD IDEA!) and artificially loading the CPU with stress --cpu 12, I can consistently create the failure scenario that results in the Transform from base_link to dom failed error, and having the code do an early return instead of broadcast an "identity" transform is a far superior user experience. If I kill the stress process, then everything recovers much more gracefully, even if the "robot" (bag file data) does not stop moving.

I'm inclined to agree that this is as far as it should go. No need to increase transform timeouts or anything else, as this is clearly a failure scenario and the early return simply provides a gentler response while not attempting to ignore or cover up the underlying problem.

I have forked this repository and created a branches to replace this single line:

https://github.com/SteveMacenski/slam_toolbox/blob/173ac394eb246fd3a92ef39cbaa37015eeb698e7/slam_toolbox/src/slam_toolbox_common.cpp#L359

with:

return odom_to_map;

I will submit a PR for this to the melodic-devel, noetic-devel, and foxy-devel branches this weekend.

Configuration thoughts

The default setting for minimum_travel_distance and minimum_travel_heading are 100% good and fine for my use. Clearly I was motivated by some unrelated factors to tweak those settings, and then I should have put them back, but I thought they were helping me when they were not.
Even with resolution: 0.025 and all other settings at default, I can 100% load all 12 threads of my i7 CPU with stress --cpu 12 and Slam Toolbox still works fine and never actually fails. Slam Toolbox is impressively efficient, and my settings were off the charts.
- I will still test more to confirm whether I even need to set resolution: 0.025, but it was working so well last night that I didn't mess with it.
- It appears to be the minimum_travel_distance and minimum_travel_heading that create the biggest spike in CPU usage (and as you pointed out, probably degrade mapping accuracy as well).

I apologize for not making my configuration edits clear sooner in this conversation. I attached my config file, but there was no reason to assume you would be digging into it.

Documentation

A note on documentation, and chide me if I just failed to RTFM,
but I thought that:

minimum_travel_distance: 0.1 # 0.5
minimum_travel_heading: 0.1 # 0.5

meant it would wait to move at least 0.1 before doing anything, not that it would do something every 0.1. I assumed once it met the minimum travel distance, it would work at a pace set elsewhere. Clearly I misunderstood.

I also wonder how the minimum_time_interval plays into the above, and if increasing the minimum_time_interval could make up for "letting" it update on shorter distances?

Any clarification on how these parameters affect when the map updates would be appreciated. (Although my testing shows there is zero need to tweak these anyway.)

I think an edit of this text would be excellent for the Wiki or Readme of this package, especially the part about how many new vs. old features SLAM wants to see:

...those are 3 of the highest CPU impacting dials to turn. You're telling SLAM Toolbox to update 5x as often at 2x the increased resolution. So yeah, you should expect on the order of 10x the CPU. You should not do that if you don't have the compute juice to handle it. You probably shouldn't be updating your map that often. Matching based slam algorithms rely on about a 70% new features 30% old features ratio to work well. If you do that often its more like 80% old features and 20% new and you're liable to decrease map quality. But whatever floats your needs.

I can add a PR for the Readme if you like, and you can discuss edits to it in the PR.

Please profile your CPU usage and if its over 80% steady state, you're way dangerous and likely other processes are having problems too you just haven't noticed yet.

Do you have any input on how to do this? I find the CPU usage to fluctuate wildly during operation. It certainly is never steady state 80%, but clearly it spikes periodically, as this proved.

Telemetry

Finally, since Slam Toolbox can catch up if the robot just chills for a moment, is there any way to get a signal out of it when this is happening? Perhaps on a topic or something? As it stands, if I automate the process of selecting destinations, then it will essentially continue to overwhelm the CPU forever (or until the path planner times out, which isn't ideal), but if I could check periodically to see if Slam Toolbox was failing in this way, I could automatically remove the goal, wait for it to catch up, and then reset the goal.

It may also be that even only being able to monitor the Queue size (PR #226) via a topic or something would be enough/better as I think the queue always backs up either before or at least when this failure occurs.

chrisl8 commented 4 years ago

I want to test some of the above .bag files to see if:

We are actually solving the original issues.
If this whole "backwards RPLIDAR" thing is actually an issue, or if it was a red herring.

So with 100% default settings the 2020-03-27-13-47-13.bag file still fails badly:

However, by just adding ceres_loss_function: HuberLoss, as suggested here, it looks beautiful, at lest for one run: At least I think it does. Not knowing what the building looks like, this could be a nice looking, but ultimately incorrect view, like if some large area is duplicated.

Or maybe it looks like this?

I will test some of the other bag files next . . .

SteveMacenski commented 4 years ago

I will submit a PR for this to the melodic-devel, noetic-devel, and foxy-devel branches this weekend.

I'll merge that. I'm not sure why I thought that was OK to do. My mistake.

I also wonder how the minimum_time_interval plays into the above, and if increasing the minimum_time_interval could make up for "letting" it update on shorter distances?

Time supersedes all, that time will block until complete regardless of the distance / orientation change. Its mostly there to make sure if you have a really fast moving robot you don't update every 100ms or something at 5m/s. Though if met, it won't necessarily trigger until the distance ones are met so you could in theory set that to 0.0001 or something super small and disables it effectively. Your new understanding of the minimum distance parameters are correct.

I can add a PR for the Readme if you like, and you can discuss edits to it in the PR.

Not sure what in the blob is worth readme-ing?

Do you have any input on how to do this? I find the CPU usage to fluctuate wildly during operation.

C++ gives you tools to do this, but for a more external look, I like htop.

Finally, since Slam Toolbox can catch up if the robot just chills for a moment, is there any way to get a signal out of it when this is happening?

In synchronous mode, yes. In async its "best effort" so that it can't get backed up to need to "catch up". This is helpful for offline processing where you want to squeeze in every scan you can or if you have a CPU with enough juice to be OK. I suppose we could expose a service to query the queue size (don't I have one?) and there's already a reset queue service.

However, by just adding ceres_loss_function: HuberLoss, as suggested here, it looks beautiful, at lest for one run:

Please submit in your PR changing the default loss function to HuberLoss. I thought I'd asked them to do that already but apparently we all forgot :-)

If this whole "backwards RPLIDAR" thing is actually an issue, or if it was a red herring.

I still don't know about that - all of the people in this thread continue to be RPlidar /360 lidar users. But more than happy to take a win to solve a particular problem. Maybe RPlidar users are running on tiny CPUs and that's what's causing these issues. Though the title "Potential errors in lidars not facing forward in optimizer" still hasn't been addressed. See https://github.com/SteveMacenski/slam_toolbox/issues/198#issuecomment-625533177

@tik0 did you submit a PR for https://github.com/SteveMacenski/slam_toolbox/issues/198#issuecomment-625783433? I don't recall

SteveMacenski commented 4 years ago

@tik0 please confirm this 360 check PR resolves your concerns https://github.com/SteveMacenski/slam_toolbox/pull/257

SteveMacenski commented 4 years ago

From reviewing this I think we had the following issues which are being resolved or are resolved:

Some users had high CPU loads resulting in dropped TF causing the map_to_odom to be zeroed out causing issues FIX: @chrisl8 to submit PR to change to return
I think from the script @tik0 creates flipping the data, that means that the RP lidar is likely reporting data in the wrong direction (e.g. its spinning in the opposite direction it should be), In that case that fix or the fix described by @joekeo would be a valid work around. fix: RP lidar or work around
The 360 parameter readin is wrong fix: PRs pending to update the logic to the new one
Sometimes maps don't do well in glass spaces fix: PR needs to be made to change default loss function

I think that fixes all the errors that this might have caused, and once done we can close this ticket - do you think that's fair everyone?

I renamed to summarize our conversation here better and I'll leave pinned for future rplidar users because that work around will still be required since their drivers / TF / something is obviously wrong. I believe the other fixes in the bullets above fix all over issues for all other users

tik0 commented 4 years ago

@chrisl8 The last image shows the correct floorplan of the building. However, there are some missed loop closures in this one. @SteveMacenski I regret that I am not able to work on or test regarding this issue any time soon.

SteveMacenski commented 4 years ago

Agreed - @chrisl8 what's the difference between the last image and the prior one? You didn't mention a change.

@tik0 thanks for the update. Its fine to not test, but can you look at that PR https://github.com/SteveMacenski/slam_toolbox/pull/257 and just give me the OK that this resolves you is_360 detection so I can merge? (its 1 line)

tik0 commented 4 years ago

@SteveMacenski sure, just checked 360_check_m and reported positive results on is_360 in #257. Thanks for the fix!

chrisl8 commented 4 years ago

Agreed - @chrisl8 what's the difference between the last image and the prior one? You didn't mention a change.

@SteveMacenski Just look at them. :) Those are both from the same bag file run from start to finish with the exact same settings, yet they clearly result in different ideas of how the building is shaped. However, we already know this may be the RPLIDAR issue? I was going to try testing out the various bag files posted in this issue, as well as my own, with and without using the "flip" script posted by @tik0 , but the script is not working for me. It results in a /tf_static topic that creates a broken set of transforms. I haven't had time to debug it.

I worry that there is something else wrong with the RPLIDAR, or Slam Toolbox's interaction with it, that will cause intermittent issues in the future, however, If everyone here is:

Convinced there is something wrong with the RPLIDAR and/or the driver,
Satisfied with just hacking the driver or physically reversing the unit as a reliable solution then I will just turn my RPLIDAR around on my robot and be done with that.

@SteveMacenski Take this thought or drop it as you wish: I'm not wealthy enough to buy you an RPLIDAR at the moment, but if you want to set up a fund I will certainly contribute to it. Obviously it is up to you if you feel motivated to chase down a specific piece of hardware. I think you will find more and more of these, and similar "low cost" 360 spinning "Lidar" devices, showing up in hobby and academic ROS projects. Anyway, I fully understand if you don't have time or interest. I just wanted to suggest the idea.

I do agree that we've beat this subject quite to death, and it is time to close it once your suggested PRs are merged. It might even be logical to move further discussion about the RPLIDAR issues to https://answers.ros.org/ if they come up.

SteveMacenski commented 4 years ago

Those are both from the same bag file run from start to finish with the exact same settings

Interesting. I've run datasets before with some limited differences due to scan throttling or minimum distance / time parameters making things not exactly the same, but I don't recall ever re-running a dataset multiple times and getting such massive differences. I'd have to look in the backend for the loss functions and ceres to see if there's anything there super sensitive that could cause issues. I suppose also the parameters used would largely matter. Because something like that could only happen from a number of false loop closures, I'd bet the parameters were stoked up (small loop chains, large search areas, etc).

I don't think that's an RP-LIDAR specific issue, I think that's a tuning and a bit of how the library works issues.

I'm not dying for an RP-lidar and I think hour-for-hour my time is better spent on other tasks than supporting 1 particular hardware manufacturer. As far as I know they're only really used in R&D and this is aiming to be a production SLAM library for very large spaces that would ultimately likely not be an appropriate deployment for that vendor.

I think with PRs on all of these largely merged (need a few more distros covered), we can close this. Future readers: This issue should be resolved with the RPlidar patches mentioned above with rotation direction inversion scripts / URDF changes.

chrisl8 commented 4 years ago

I'm not sure where you want me to put this, but PR #257 causes me to receive this error continuously when I launch slam_toolbox:

LaserRangeScan contains 1440 range readings, expected 1439

Using my RPLIDAR A3, and having made no other changes.

Reverting to the previous commit to that PR and rebuilding removes the error, and things work again as before.

I did some searching in the Issues for this error, and found some bread crumbs, but no clear idea of what to try next.

It seems odd that this change was added to accommodate an RPLIDAR, and yet, it is actually causing failure. Scratching my head.

@tik0 When you tested that patch, did you see any LaserRangeScan... errors? If not, can you tell me what fork and branch of the RPLIDAR driver you were using?

SteveMacenski commented 4 years ago

Your drivers aren't correctly implemented then or the math for the number of readings is incorrect. If its a 3D lidar, the message fills the number of readings based on the minimum/maximum ranges and the angle increment.

See https://github.com/SteveMacenski/slam_toolbox/blob/foxy-devel/lib/karto_sdk/src/Karto.cpp#L209 and https://github.com/SteveMacenski/slam_toolbox/blob/foxy-devel/lib/karto_sdk/include/karto_sdk/Karto.h#L4297

tik0 commented 4 years ago

@chrisl8 I am using the RPLIDAR A2M8 and rplidar_ros on this commit. I had also issues with a falsy calculation of the residual scan, which I have explained here in detail. However, PR #257 does not show this effect on my scanner.

chrisl8 commented 4 years ago

Here are my summarized findings for future RPLIDAR users who may find themselves here

Map catastrophic rotation due to RPLIDAR mounting axis

Sometimes the Map rotates itself 180 degrees underneath you. I have had this happen even in Localization mode: ROS Map Flipping

I currently have a ROS Answers post open on this issues in case it has anything to do with Slam Toolbox: https://answers.ros.org/question/359654/what-may-be-causing-my-map-to-sometimes-rotates-180-degrees-during-operation/

A similar or possibly the same problem is noted under Issue #281

This appears to be caused by the RPLIDAR being mounted "backwards", that is, with the cord toward the back of the robot.
The RPLIDAR's front is the point where the cord exits the unit. Many people have solved this issue by simply ensuring the RPLIDAR is mounted with the hardware "front" toward the literal front of the robot. That is the, the edge the cord extends from is oriented toward the front of the robot.

~My RPLIDAR is currently mounted backwards, and this happens to me. I am currently researching the cause, but if I get tired of that I will rotate the RPLIDAR myself.~ My RPLIDAR is now facing "forward", and this has not happened again.

TL;DR: Ensure the "front" (cord end) of your RPLIDAR is oriented toward the FRONT of your robot. Nobody quite knows why, but it fixes this.

NOTE: As of September 2020 there are some alternate solutions if you are using the ROS2 driver provided by allenh1 as of this PR.

Angle Increment Calculation

You may see this error as soon as you start up Slam Toolbox: LaserRangeScan contains <number> range readings, expected <number-1> which prevents mapping

As of September 2020 this issue is fixed by this PR.

Sweep Direction

The Sweep Direction of the RPLIDAR was called into question a few times.

I used this code to hack the output into a curve starting at the first point in the scan distance output array:

float read_value = (float) nodes[i].dist_mm_q2/4.0f/1000;
// TODO: This will clearly demonstrate the sweep start, end, and direction.
read_value = scan_msg.range_min + (0.001 * i);

and the output is in a counter-clockwise direction as ROS expects: RPLidarSweepDirection-InvertedFalse NOTE: The RED axis is "pointing" in the direction of the cord on the RPLIDAR A3 unit.

It may not be immediately obviouse, but the code actually lays down the data into the ranges list backwards, so that it ends up being "counter-clockwise":

header:                                                                                                                                      [510/3169]
  seq: 50                                                                                                                                              
  stamp:                                                                                                                                               
    secs: 1596956675                                                                                                                                   
    nsecs: 279228181                                                                                                                                   
  frame_id: "rplidar"                                                                                                                                  
angle_min: -3.12413907051                                                                                                                              
angle_max: 3.14159274101                                                                                                                               
angle_increment: 0.00435422640294                                                                                                                      
time_increment: 5.38406420674e-05                                                                                                                      
scan_time: 0.0774766877294                                                                                                                             
range_min: 0.15000000596                                                                                                                               
range_max: 25.0                                                                                                                                        
ranges: [1.5889999866485596, 1.5880000591278076, 1.5870000123977661, 1.5859999656677246,

Clearly the 1.5... meter entry is the first in the list.

Based on some .bag files I downloaded this seems to be the same data order that the Hokuyo uses, with the angle_min being negative, and to the "right" of center, and ending with angle_max as positive and being to the left of center.

The sweep direction from the RPLidar driver is fine.

If anybody finds themselves here and needing help with RPLIDAR and Slam Toolbox, feel free to ping me or contact me directly and I will share anything further I've found, or personal help on hacking things up to where I am at.

For me, with my RPLIDAR driver edits, Slam Toolbox works very well. If you have trouble, be sure that you are sticking to the default settings for Slam Toolbox, and don't mount it backwards.

SteveMacenski commented 4 years ago

I don’t think the cable forward is 100% necessary, but if you don’t, you must correct the wrong axes on the rplidar driver since they haven’t patched it themselves. Both the axes and the swapping the readings with the python script above would have a similar output.

SteveMacenski commented 3 years ago

As an update @WLwind has found something interesting in #325. I need to track through it, but it looks like why the rplidar had issues being "backwards" with its origin frame (which by the way still should be fixed by RPlidar to be REP105 compatible!) is the openKarto assumed it to be aligned with the base frame. I need to verify that, since I thought I'd checked that and the GetSensorPose should contain orientation information, but assuming this PR works, that should put the final nail in this issue without specific rplidar driver fixes (though they should still be made).

I'm a little uncertain then why on some of my testing bag files with the robot having lidars at 45 degree angles to the forward that this never came up. They're relatively short bags (15 minutes) so maybe the optimizer is just good at correcting for it? Maybe why changing the loss function had an outsized impact at hiding the issue?

@chrisl8 can you test this PR from your last comment it sounded like you wanted to have the lidar mounted the opposing way. This might be a good way to see if this fixes that issue for you once and for all with map flipping!

chrisl8 commented 3 years ago

@SteveMacenski I rotated my RPLIDAR to set the cord toward the front of my robot last year, which solved the spurious map rotation issue for me.

I will try running some bag files through the new and old code this weekend to see if I can obtain a consistent positive and negative result with the old/new code.

chrisl8 commented 3 years ago

Noetic #325 appears to entirely fix the reported issue!

I ran the original bag file from this issue through the current main branch a few times and got different results every time, none of which looked reasonable:

slam_toolbox_previous-Mach2021a

slam_toolbox_previous-March2021b

Then I used Noetic #325 many times and obtained this result every time:

slam_toolbox_fixed-March2021

The fact that Noetic #325 not only creates a map that is reasonable, but that it is also 100% consistent on every run for this bag file is wonderful!