PepperlFuchs / pf_lidar_ros_driver

ROS driver for Pepperl+Fuchs R2000 and R2300 laser scanners
https://www.pepperl-fuchs.com/global/en/23097.htm
Apache License 2.0
39 stars 38 forks source link

R2000 Driver produces corrupt data under high load #9

Closed bk-mtg closed 4 years ago

bk-mtg commented 4 years ago

We experienced an issue during field testing a week ago where the system was unable to keep up with the log data rate for whatever reason, and this appears to have led to corrupt data logged from the R2000 sensor attached. Missing data would obviously be expected if the system is unable to keep up, however in this case it appears that data is incorrectly assigned to points in the wrong laserscan or something, leading to things like the ground plane of the scan being duplicated and mis-aligned.

I haven't had time to reproduce this in the lab yet, but it appears to me that the current driver code (e.g. here) doesn't do any validation that packets are arriving and being parsed in order without missing packets, and so if packets are out of order or dropped, the subsequent scan points end up being incorrectly assigned to the wrong angular slots in a ScanData object. I am hoping to have time later this afternoon to do some further testing of this in the office.

Of interest: the R2300 we were logging at the same time doesn't appear to have any issues with corrupt data - unclear to me exactly why that is; maybe each scan fits entirely in one packet or something?

hsd-dev commented 4 years ago

@bk-mtg thanks for reporting the issue. Please update the thread once you have more information. I'll try to include packet validation.

maxlein commented 4 years ago

I have these problems also and added a quick fix for not publishing corrupt data.
But I don't know what triggers this. I am using the TCP Connection btw.

        auto numberOfRangeReadings = static_cast<int32_t>( round((scanmsg.angle_max - scanmsg.angle_min) / scanmsg.angle_increment));
        if( numberOfRangeReadings != num_points)
        {
          log_error("num_points({}) differs from expected ({}), spipping scan", num_points, numberOfRangeReadings);
          return;
        } 
r2000_node-2] [02/03/20 08:50:23:444:444588] [default] [error] num_points(3984) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:50:24:187:187075] [default] [error] num_points(944) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:50:25:391:391355] [default] [error] num_points(1608) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:50:27:036:036870] [default] [error] num_points(2272) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:50:30:351:351188] [default] [error] num_points(4648) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:50:32:003:003775] [default] [error] num_points(2272) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:50:32:486:486738] [default] [error] num_points(664) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:50:34:143:143148] [default] [error] num_points(2272) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:50:35:793:793388] [default] [error] num_points(2272) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:50:36:304:304518] [default] [error] num_points(664) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:50:37:484:484872] [default] [error] num_points(1608) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:50:38:669:669769] [default] [error] num_points(1608) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:50:41:585:585493] [default] [error] num_points(3880) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:50:47:492:492695] [default] [error] num_points(8248) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:50:48:913:913824] [default] [error] num_points(1992) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:50:55:572:572637] [default] [error] num_points(9192) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:50:59:677:677294] [default] [error] num_points(5592) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:51:05:382:382893] [default] [error] num_points(7968) differs from expected (3600), spipping scan
[r2000_node-2] [02/03/20 08:51:05:858:858530] [default] [error] num_points(664) differs from expected (3600), spipping scan
hsd-dev commented 4 years ago

@maxlein thanks for the information. This should help in debugging the problem. Could you please report on how you create a "high load" scenario?

maxlein commented 4 years ago

Sorry, missed your question.
I don't think we have a "high load" scenario.
It just happens under "normal" usage, although we have another scanner ( Sick microscan ) on the bus.
But we are not maxing out network bandwidth.

To sum it up, I can't really tell you what leads to these errors or how to reproduce them other than starting the laser node.

bk-mtg commented 4 years ago

Wonder if maybe this is somehow related to TCP retransmissions or something? e.g. if a second device is sending on the network, you may occasionally get packet collisions and the R2000 may re-send packets, which could arrive out of order or similar?

Unfortunately, I won't be able to reproduce this for the foreseeable future, as our project involving R2000/R2300 is on hold.

@maxlein if you are able to reproduce this reliably, maybe a packet capture with wireshark or tcpdump might be helpful for debugging?

hsd-dev commented 4 years ago

I have recently added quite a few commits. Could you please pull the latest master and try this again and report back your finding?

bk-mtg commented 4 years ago

Don't currently have a ROS/R2000 setup handy, but it looks to me like #14 addresses the core bug I thought I saw in the older version of the code.

That said (based on a quick read-through of the code, so I may have missed some things), I think there are still some issues when packets go missing or arrive out of order:

  1. No data is ever written in to the slots in the LaserScan message corresponding to a missing packet (as far as I can tell)
  2. If the scan never receives its final packet, it never gets published; could add that someplace like here for example
  3. If the first packet of a scan never arrives (or arrives late), a new message is not started and any other packets from that scan are discarded

I think the correct behavior when 100% of packets arrive (and are in order) is pretty easy, but things get a little more difficult if packets go missing. Should partial scans be returned at all? (I suspect yes.) If there are missing packets in a scan, should we give up on them immediately (when we start receiving the next scan) or wait a while to see if they turn up? (I'd be inclined to be lazy and not wait.) So then, the algorithm might be something like:

If I get some time in the next week or two, I'll try to spin up a copy of ROS so that I can do some testing.

Edit: BTW, nice work here! A lot of the refactoring in #14 looks like it really cleans things up.

bk-mtg commented 4 years ago

Been doing some more thinking about this, and I'm not sure my suggestion of waiting for the scan number to increment is a good one. Raises all sorts of tricky questions like what if you get unplugged and miss all the packets from one or more scans. Or what if the lidar gets unplugged and resets its scan number/timestamp. And so on and so forth. Hm... Maybe it doesn't matter too much, since any scan that has all its packets should work fine anyway.

bk-mtg commented 4 years ago

Got a test setup for this this evening. For future reference, a good way to induce fun packet orders seems to be to connect to the lidar via a wifi link.

Seems like the original issue where after some time the entire data stream tends to become corrupt is fixed, but that there's still some weird things going on. Specifically,

  1. I seemed to be getting quite a lot of missing data
  2. The rviz plot shows lots of spurious data points dancing around the outside; they seem to show up at one or two radius values, which makes me think they're bad data rather than
  3. With the older version of the driver, rviz was able to do rainbow coloring based on intensity, but the new version seems to break that somehow?
  4. Loads of messages in the console about Received data smaller than header size.

In addition into an issue with something in the build. I'm not familiar enough with ROS to fix this properly (I assume it involves re-ordering CMakeLists.txt?) but when I did the upgrade, it was complaining about not being able to find PFR2000Header.h, which AFAICT is generated from the corresponding .msg file. I manually invoked cd build && pf_driver_generate_messages and after that I was able to catkin_make successfully.

I'll see if I can do a little troubleshooting (and I'll try to take a tcpdump or something at least), but I'm likely going to need to turn in the lidar I'm borrowing for this sometime early next week, so I wanted to ping you about it now with the hope that you might have some time to look at it soon.

bk-mtg commented 4 years ago

Submitted #24 to fix the issue I had with missing PFR2000Header.h.

bk-mtg commented 4 years ago

I have fixes for several of the other protocol issues I've seen; will submit a PR for them sometime in the next day or so (have one more issue I want to track down first).

bk-mtg commented 4 years ago

Just submitted #26 with fixes for most of the outstanding issues here. I think the core issue (assigning points the wrong angles) has already been fixed, so I'm going to go ahead and close this issue, but definitely take a look at that PR.