Very high cpu usage with R2300

Arthav24 commented 3 years ago

I am using R2300 with intel i5-6500TE. At the htop I can see pf driver consuming almost 80-90% CPU. Does anyone facing the same because I haven't modified the driver. Please guide me on how to debug the cause. Just launching the r2300.launch increases this much CPU usage. Not even launching the data processing node. High CPU usage causing computational bottlenecks with my usage so looking forward to reducing it to considerable levels. System details: Ubuntu 18.04, ROS melodic, driver v1.1.1

also since I am for now using only one of the planes out of four is there any way to optimize in terms of reducing load and data.

ptruka commented 3 years ago

Hello Arthav, I'm having a i5-8350U here. Even if I start the driver in a virtual machine, I don't get such a high CPU usage. Is RViz running? If so, how does the system behave when RViz is closed?

If you are only using one layer, you could disable the unused ones. This can be done via dynamic_reconfigure of ROS, or directly via http (example): http:///cmd/set_parameter?layer_enable=on,off,off,off Further information can be found in section "2.6.7 Layer configuration (layer_enable)" of the R2300 Ethernet communication protocol. Please let us know if this setting changes anything about your CPU load.

Arthav24 commented 3 years ago

Hey, No I am not running rviz in normal operation. Just for visualising sometimes I use rviz. Rviz doesn't seems to affect the CPU usage. Initially I did the same switched off the unused planes and also commented static tf publisher in launch file for unused planes that rather increased the usage +10%. Reverting back brought it down to 80s%.

hsd-dev commented 3 years ago

@Arthav24 are you using the binaries or you have build the repo from source? If you are building from source, please build it in debug mode

catkin build -DCMAKE_BUILD_TYPE=Debug

and use valgrind to report if there are any memory leaks:

launch-prefix="valgrind"

So the node in the launch file would look something like

<node pkg="pf_driver" type="ros_main" name="r2300_node" output="screen" launch-prefix="valgrind --tool=memcheck --leak-check=yes  --log-file=memleak.log">

Please report here what is the log output from valgrind

Arthav24 commented 3 years ago

memleak_old_4tf.log memleak_old_1tf.log memleak_latest_4tf.log memleak_latest_1tf.log

Please find the attached logs. memleak_old_4tf.log - tag 1.1.1 and enabled all the tfs in launch file ( ring1/2/3/4 ) memleak_old_1tf.log - tag 1.1.1 and enabled only one of the plane I use.

memleak_latest_4tf.log - main branch and enabled all the tfs in launch file ( ring1/2/3/4 ) memleak_latest_1tf.log - main branch and enable only one of the plane I use.

hsd-dev commented 3 years ago

The logs don't seem to be alarming. I will do a test with

launch-prefix="valgrind --tool=callgrind --callgrind-out-file=callgrind.log"

and then we can compare the logs maybe

Arthav24 commented 3 years ago

hey @ipa-hsd any update from your test ?

hsd-dev commented 3 years ago

@Arthav24 when I compile the latest main branch with

catkin build -DCMAKE_BUILD_TYPE=Release

the CPU usage fluctuates between 30-60%. Not as high as you reported, but it can be improved maybe.

By compiling with Debug option and using callgrind, the log file is callgrind.log. I will need to go through the log in detail, but a quick look seems like most of the time consuming calls are ROS-specific.

How long did you run the driver after which you noticed the high CPU usage? Or it's the case since the beginning?

hsd-dev commented 3 years ago

I just realized you mentioned that you are working with version 1.1.1, but we fixed some issues with the low frequency after the release: https://github.com/PepperlFuchs/pf_lidar_ros_driver/commits/main. Could you try pulling in the latest main branch and compiling with Release option?

Arthav24 commented 3 years ago

yeah, but for me the CPU usage was always >60%. When I was using the driver in my stack then I had BUILD_TYPE=Release but had similar results. I had been using driver for very long runtime but the usage was high from the beginning as soon as the node starts.

yes, I did check the latest commits in main branch for fixing frequency and accumulation. and tried but saw similar results. This too was build with Release option.

hsd-dev commented 3 years ago

Can you try with

launch-prefix="valgrind --tool=callgrind --callgrind-out-file=callgrind.log"

and share the log?

hsd-dev commented 3 years ago

I had another look at the callgrind.log and looks like this while loop is the culprit. It's a fast while loop without any sleep and the loop tries to read data even if there isn't any. For 25Hz data, this is quite a lot.

hsd-dev commented 3 years ago

@Arthav24 could you please try https://github.com/PepperlFuchs/pf_lidar_ros_driver/pull/65? For me the consumption has come down from 30-60% to less than 10%. This is the log file after the fix: callgrind_fix.log. I have marked the PR with [WIP] because there might be some possibility to improve it slightly more.

Arthav24 commented 3 years ago

Sorry for the delayed response, I see you have come up with an amazing solution which I will definitely give a try and report back.

Definitely, I think it could be improved because I have also used R2000 but a different driver and for that, I never saw usage >5%. Thanks to @konu-droid for providing logs and checking things out.

hsd-dev commented 3 years ago

@Arthav24 any updates?

Arthav24 commented 3 years ago

Again sorry for the delayed response, These are the observation from 2 days of testing : 1) CPU usage dropped to 13% with all the layers enabled. 👍 2) CPU usage jumped to 80%( peak of 98%) if we disable 3 layers from both the sensor and launch file. 3) Reverting 2) brought back to 13%. 👎 4) The usage in 2) seems to creep up slowly as initially after launching it was around 13-15% but after sometime it went up to that level. 👎 5) The fix in while loop seems to affect the data frequency as we noticed that at 50Hz mode actual rate was 6Hz and at 100Hz mode rate was 12Hz, just half of expected in both cases. 6) While debugging we tried to cut half the wait time you introduced in the fix to 5 ms that broght back the expected frequencies in both modes.

Once again thanks @konu-droid for running all the tests.

hsd-dev commented 3 years ago

Thanks for the detailed analysis!

While debugging we tried to cut half the wait time you introduced in the fix to 5 ms that broght back the expected frequencies in both modes.

here you mean? https://github.com/PepperlFuchs/pf_lidar_ros_driver/blob/2162e5d203291e01de30f7774a7366a06cde3b4b/pf_driver/include/pf_driver/pf/pipeline.h#L137

Maybe the sleep was not required in that loop in the 1st place. I will revert it back.

CPU usage jumped to 80%( peak of 98%) if we disable 3 layers from both the sensor and launch file.

That seems strange. But in any case, I should introduce a conditional variable which wakes up the reader thread only when there is data available.

The usage in 2) seems to creep up slowly as initially after launching it was around 13-15% but after sometime it went up to that level.

That sounds more like a memory leak (?)

Arthav24 commented 3 years ago

Exactly this. std::this_thread::sleep_for(std::chrono::milliseconds(5));

I also thought that sleep is not required but cutting it half brought the frequencies back to expected so didn't try removing it completely. That would be perfect that threads are spawned only for the enabled layers. Yes, it looks and sounds like a memory leak, like something piling up but couldn't debug the same so right now using it with all layers enabled.

hsd-dev commented 2 years ago

CPU usage jumped to 80%( peak of 98%) if we disable 3 layers from both the sensor and launch file. Reverting 2) brought back to 13%. -1

Below is the section that is most likely culprit of this.

https://github.com/PepperlFuchs/pf_lidar_ros_driver/blob/1e14af5bdd5778b405ae4bea2d00083a6306a1e7/pf_driver/src/ros/scan_publisher.cpp#L139-L152

hsd-dev commented 2 years ago

Pushed new changes to https://github.com/PepperlFuchs/pf_lidar_ros_driver/pull/65 which maintains the frequency as expected. Also, the consumption is around 10-15% in case of all layers enabled and 5-8% in case of only one layer enabled. The memory leak is currently fixed by resetting the cloud message when the scan parameters change (e.g. if layers_enabled is changed before all layers have been published for the current scan).

PepperlFuchs / pf_lidar_ros_driver

Very high cpu usage with R2300 #63