Scanner timeout in standby mode

jashangills commented 1 month ago

Hi, just need some advice regarding the following

When the LiDAR is in standby mode, either after the stopMeas command has been called (I am using SickScanApi with python 3.12) or before the point cloud messages have been published, the TCP receive thread finishes after an inconsistent amount of time I get the following error

Is there a way to override or change the timeout period that causes the TCP receive thread to trigger and deinitialize the scanner? If so, how can this be achieved? Also is there a certain time after which this is expected?

Thanks in advance

rostest commented 1 month ago

Thanks for your feedback. The sick_scan_xd driver monitors the lidar telegrams and restarts and reinitializes after a timeout of 150 seconds by default. You can disable the monitoring with message_monitoring_enabled=False and/or setting the timeout to the max. value 2147483647 in the launchfile:

<param name="message_monitoring_enabled" type="bool" value="False" />     <!-- Enable message monitoring with reconnect+reinit in case of timeouts, default: true -->
<param name="read_timeout_millisec_kill_node" type="int" value="2147483647"/> <!-- 150 sec pointcloud timeout, ros node will be killed if no point cloud published within the last 150 sec., default: 150000 milliseconds -->

See https://github.com/SICKAG/sick_scan_xd/blob/develop/USAGE.md#driver-states-timeouts for details.

jashangills commented 1 month ago

Thank you for your response. We’ve encountered the same error both during startup and intermittently during measurements. Could this issue be related to a network problem?

rostest commented 1 month ago

Thank you for your reply. Network problems can indeed cause timeout errors.

Wireshark is a powerful network diagnostic tool. Start Wireshark, select the ethernet interface used for the lidar connection and watch the network traffic between sick_scan_xd and your lidar. As long as the lidar is sending data via tcp (or udp in case of multiScan or picoScan), sick_scan_xd should not timeout. If Wireshark does not see any tcp data from the lidar, the most likely cause is a network error.

If you encounter sick_scan_xd timeout errors while your lidar is still sending data, please let us know. In this case, please save and send a complete sick_scan_xd logfile and the network traffic captured by wireshark as a pcapng file.

Which lidar are you using? If you are using an older version of the sick_scan_xd driver, we recommend that you update to the latest version 3.5.

jashangills commented 1 month ago

Thank you for your reply. We are in the middle of testing and if the problem persists I will let you know We are using LMS41xxx

spark-res commented 1 week ago

Hi @rostest, I'm working with @jashangills on this topic. We're still seeing these intermittent TCP issues and trying to narrow down the cause.

In our deployment environment we are encountering this issue intermittently during an active measurement (possibly standby too).

image (6)

In this instance, no stop command has been issued by the software - the TcpRecvThread finishing is unexpected.

Things we have tried:

Swapping Lidar power source.
Reducing callback rate (mean filter >1). All other indicators suggest software is keeping up.
Disabling the message monitoring via the launch file.
Running the same code (same driver/python versions) on another PC (separate network). All ok.
Capturing traffic using Wireshark:
- Shows Lidar TCP messages do stop.
- Exit seems graceful (FIN, ACK) messages?
- Raw pcapng file. Note: lidar should be scanning during the entirety of this recording, but stops mid way through due to issue above.
  - Lidar IP is at .102
  - troubleshooting-lidar-timeout-12-no-stop.pcapng

Currently our next step is to bypass the network/switch and go straight Lidar to PC. Is there any other troubleshooting steps you might be able to suggest?

Thanks!

rostest commented 1 week ago

Thanks for following up and further information! Your investigation and procedure look very solid and thorough.

The pcapng file shows the last scandata from the lidar at time 27.1394 followed by a [FIN] message from the PC (i.e. sick_scan_xd) to the lidar at 27.1399:

This does not look like an exit due to timeouts. Can you see some error message or any other reason that might have caused this exit? Did sick_scan_xd stop and exit or re-initialize the TCP socket? Can you post the full logfile (i.e. the console output your screenshot shows including previous messages), if it is still available?

spark-res commented 1 week ago

Thanks @rostest,

Here is a full screen capture of the console output, but unfortunately no text log. (Interleaved other app messages).

It seems to be only the TcpRecvThread message. No additional exit or re-initialise messages.

After the TcpRecvThread message the app still runs, but no more pointclouds are generated.

rostest commented 1 week ago

Thanks for your reply. The output in the screenshot is unexpected; there should be some error or informational messages before sick_generic_laser and TcpRecvThread exit.

To reduce complexity and narrow down potential errors, please update to the latest release 3.5 (if not already done), rebuild the sick_scan_xd library and run sick_scan_xd_api_test.py on your target:

export LD_LIBRARY_PATH=.:./build:$LD_LIBRARY_PATH
export PYTHONPATH=.:./python/api:$PYTHONPATH
python3 ./test/python/sick_scan_xd_api/sick_scan_xd_api_test.py ./launch/sick_lms_4xxx.launch hostname:=10.95.76.102 2>&1 | tee -a sick_scan_xd_api_test_lms_4xxx.log

The python script sick_scan_xd_api_test.py just initializes the lidar, registers callbacks for point cloud and log messages and prints a short text in the callbacks. Can you reproduce the error in this reduced test case? If so, please post a full logfile of the console messages. Is your target the NVIDIA Jetson Xavier as mentioned in #376?

spark-res commented 5 days ago

Hi @rostest,

Appreciate the help so far.

Attached are the results from running the above. This is three runs, they all ended prematurely I think due to segmentation faults (not caught in logs) but visible in console.

run_1_sick_scan_xd_api_test_lms_4xxx.log run_2_sick_scan_xd_api_test_lms_4xxx.log run_3_sick_scan_xd_api_test_lms_4xxx.log

The SickScanApiSendSOPAS is causing a segmentation fault as per this issue https://github.com/SICKAG/sick_scan_xd/issues/376 - Could the TcpRecvThread issue be related? I'll run through the gdb steps and see where that leads too.

I commented out the SickScanApiSendSOPAS command and it runs ok. I've run some longer tests with no TcpRecvThread issue, but need to do more testing to rule it out as it's quite intermittent at times.

rostest commented 5 days ago

Thanks for your reply. The logfiles do not show abnormalities. This issue can indeed be related to #376.

spark-res commented 2 days ago

Hi @rostest,

Looks like were still seeing the TcpRecvThread exiting early issue. Although it is looking like an interaction with the OS or other parts of our code. We don't see it in the _sick_scan_xd_apitest or even our own simplified examples.

It seems to reliably occur when we call a specific function in our overall application (a function that's working with a copy of the point cloud data). This function in theory shouldn't interact with the sick_scan_api thread or data.

We added the diagnostic call-back and these are the messages we see:

Is there any way to determine where the exit status code = 4 came from?

CPU/Memory usage looks ok.

rostest commented 1 day ago

Thanks for your reply. A diagnostic message with status code = 4 just means that sick_scan_xd is currently exiting. Just a thought: Did you make a deep copy of the scan data received in the python API callback? Note that the memory of the scan data buffer (type ctypes.POINTER) is released after the callback has been executed. Using a shallow copy instead of a deep copy may produce unexpected results. See https://github.com/SICKAG/sick_scan_xd/blob/develop/doc/sick_scan_api/sick_scan_api.md#usage-example for the memory layout and deep copies of the point cloud.

SICKAG / sick_scan_xd

Scanner timeout in standby mode #361