ethz-asl / ethzasl_icp_mapping

3D mapping tools for robotic applications
273 stars 156 forks source link

Hotspots in function point_cloud/rosMsgToPointMatcherCloud(sensor_msgs::PointCloud2, bool) #81

Closed YoshuaNava closed 4 years ago

YoshuaNava commented 4 years ago

Hi, As part of my efforts to benchmark libpointmatcher, I ran a ROS node that employs libpointmatcher_ros/point_cloud to serialize and deserialize point cloud data. I implemented a ROS node that receives a point cloud message, deserializes it, and applies a few filters, to finally publish the resulting point cloud, run for 100+ seconds.

I found head-first that the most expensive method called in my program (even more than a surface normal data points filter run every iteration) was rosMsgToPointMatcherCloud(sensor_msgs::PointCloud2, bool) from point_cloud.cpp.

I used Intel VTune community edition for finding hotspots and Intel Advisor for vectorization advice. In the following lines I describe my search for hotspots and a short analysis.

Hotspots

CPU

Screenshot_2020-07-20_20-12-38

Memory access

Screenshot_2020-07-20_20-14-37

Memory writing

Screenshot_2020-07-20_20-18-39

Vectorization advice

Screenshot_2020-07-20_20-27-49

Analysis

I found 3 main CPU-time hostpots:

  1. Casting of contiguous values from an array to fill up features/descriptors is done in a cuatri-loop (ros Msg fields -> point cloud height -> point cloud width -> data length). This takes ~12% of CPU time. (Line 283 of point_cloud.cpp)
  2. Pre-filling of empty point cloud containers with "padding" values. (This takes ~3% of CPU time. (Line 114 of point_cloud.cpp)
  3. NAN filter is always applied, assuming that our point clouds are dense-a-la-PCL. This filter in particular seems to be inefficient so ~10% of CPU time is lost here. (Line 315 of point_cloud.cpp).

In terms of memory access, number 1 from the above list is also a strong hotspot. When it comes to memory writing, all paged memory is cleared by the function, and the allocations are neither big or too many (comparing to other methods, e.g. ROS TCP)

Intel Advisor recommends optimizing the "RGB loop" first of all, the cuatri-loop described in point 1 of the CPU hotspots, as well as a loop in libnabo.

pomerlef commented 4 years ago

it doesn't come as a big surprise as that code was originally intended for my personal research projects. The ROS layer was never really optimized, just patched through time for different needs.

This open another topic: I discontinued the support for that code since a while.

We move the connection ROS <-> libpointmatcher to a separate repo: https://github.com/norlab-ulaval/libpointmatcher_ros

But, when looking around, there are couple of those repo around...

YoshuaNava commented 4 years ago

Hi @pomerlef,

Thank you for your very fast response :slightly_smiling_face:

I understand that it was used for your personal projects. I report it to motivate optimizing it and also as part of the tests I mentioned I would be doing mid-year.

Would you like me to move this issue to the new repo? I was about to write a short proposals for optimizing this repo.

On a more general note: I'm extracting similar information from other filters. Would it be helpful if I open similar issues describing the performance of each?

pomerlef commented 4 years ago

If you don't mind, I would prefer to carry on developpement over there. I'll give the proper access.

Of course! Every bits of data and analysis are super useful! It's already good to know that the binder is a huge bottleneck.

YoshuaNava commented 4 years ago

Will do! Thank you.