doisyg commented 5 years ago

As pointed in https://github.com/SteveMacenski/spatio_temporal_voxel_layer/issues/110, under Melodic, we observe a serious performance issue that we traced to the PCL fonction pcl::fromPCLPointCloud2. Possibly caused by the change of PCL version between Kinetic and Melodic (1.7.x to 1.8.x) but not investigated further. With Melodic, on a i5 computer move_base completely overload a core the CPU with a 30Hz 320x240 poincloud (about 74MB/s). Whereas with kinetic and the same config, move_base use about 40% of a core.

To avoid using pcl::fromPCLPointCloud2, I did a test by modifying the Pointcloud2 Callback: https://github.com/SteveMacenski/spatio_temporal_voxel_layer/blob/faaf1498649e666bbd6901a98ae135c1d51a5536/src/spatio_temporal_voxel_layer.cpp#L326-L328 to get the message directly in the pcl::PointCloud<pcl::PointXYZ> format, and let roscpp take care of the converstion (see http://wiki.ros.org/pcl_ros#ROS_C.2B-.2B-_interface).

void SpatioTemporalVoxelLayer::PointCloud2Callback( \
                const pcl::PointCloud<pcl::PointXYZ>::ConstPtr& message, \
                const boost::shared_ptr<buffer::MeasurementBuffer>& buffer)

The performance is better and makes STVL usable with melodic but there is still a decrease of performance compared to Kinetic (65% CPU). I can do a quick PR if someone is interested.

SteveMacenski commented 5 years ago

@doisyg have you looked at the migration for TF -> TF2 (also just melodic) in other projects? Costmap 2d comes to mind and this is the way their buffer handles it (see alot of similarities ;) )

https://github.com/moriarty/navigation/blob/57399e086e8bd4174f6f3f8c938bc8cefe69c5b0/costmap_2d/src/observation_buffer.cpp#L108

and its PR https://github.com/ros-planning/navigation/pull/755/files

I wouldnt recommend necessarily working with the pcl::pointcloud as the raw input since that's not the API anyone else uses and depth cameras themselves by an REP standard requires sensor_msgs::PointCloud2.

You sure its not just an issue with how its being done? I don't see any melodic reports of their buffers getting heavier with melodic

SteveMacenski commented 5 years ago

If we get that squared up, I can run a bloom release on the melodic side, I think that's the only standing issue in the melodic migration

doisyg commented 5 years ago

Hum, thanks for the pointers. It is interesting to see that the costmap_2d observation buffer switched completely to sensor_msgs::PointCloud2 and hence got rid of the problematic pcl::fromPCLPointCloud2 conversion function. This could be a solution for STLV but this would require a major rework of the current code (at least the way I understand it) as STVL is based entirely on the pcl::PointCloud<pcl::PointXYZ> structure: https://github.com/SteveMacenski/spatio_temporal_voxel_layer/blob/faaf1498649e666bbd6901a98ae135c1d51a5536/include/spatio_temporal_voxel_layer/measurement_buffer.hpp#L123 and https://github.com/SteveMacenski/spatio_temporal_voxel_layer/blob/faaf1498649e666bbd6901a98ae135c1d51a5536/include/spatio_temporal_voxel_layer/measurement_reading.h#L53-L102

A simpler solution would be to make PCL (by PCL I mean mainly the pcl::fromPCLPointCloud2 function ) work as efficiently under melodic as in kinetic. I still doesn't understand why this specific function (which can be isolated from STVL) has a 30x drop of performance in melodic. I would be interested if anybody can replicate the issue. Though migrating STVL to sensor_msgs::PointCloud2 and drop the PCL dependency could be a worthwhile investment for the future. What do you think? I may try and begin the work.

SteveMacenski commented 5 years ago

I'm curious about the conversions being used from the ROS side vs. the PCL side. That function is from PCL but ROS also provides pcl_conversions which from a little reading seems to do functionally the same thing, moveFromROSMsg:

http://docs.ros.org/hydro/api/pcl_conversions/html/namespacepcl.html

That is what I expect under the hood the ROS conversion is doing when subscribing to the pointcloud2. I would ideally like to do that conversion in the code so its above board for everyone what's happening and no one's left confused by they're publishing a PC2 and yet magically working with a pcl pointcloud.

The question still remains why you see a jump from 40% to 75% of a core on the migration. I'm interested to know if anything else has changed in your configuration from Kinetic to Melodic to result in that increase (longer range, more cameras, higher resolution, higher frequency,etc). Even if 35% is tractable, that's a very large increase.

110 addressed the transformation of the pointcloud into the global frame, not anything to do with the PC2->Pcl PC conversion. Are you sure its in that sensor_msgs to PCL conversion and not still in the transformation in the global frame? (by the way, good catch that `transformMsgToTF` doesnt transfer the header information, that's very lazy of them)

doisyg commented 5 years ago

I'm curious about the conversions being used from the ROS side vs. the PCL side. That function is from PCL but ROS also provides pcl_conversions which from a little reading seems to do functionally the same thing, moveFromROSMsg:

http://docs.ros.org/hydro/api/pcl_conversions/html/namespacepcl.html

The pcl_conversions methods are actually calling the problematic fonction pcl::fromPCLPointCloud2; see https://github.com/ros-perception/perception_pcl/blob/melodic-devel/pcl_conversions/include/pcl_conversions/pcl_conversions.h, L550 I actually tried with the same results.

That is what I expect under the hood the ROS conversion is doing when subscribing to the pointcloud2. I would ideally like to do that conversion in the code so its above board for everyone what's happening and no one's left confused by they're publishing a PC2 and yet magically working with a pcl pointcloud.

Agreed

The question still remains why you see a jump from 40% to 75% of a core on the migration. I'm interested to know if anything else has changed in your configuration from Kinetic to Melodic to result in that increase (longer range, more cameras, higher resolution, higher frequency,etc). Even if 35% is tractable, that's a very large increase.

It is a mystery to me too. It is actually the exact same hardware.

Are you sure its in that sensor_msgs to PCL conversion and not still in the transformation in the global frame? (by the way, good catch that transformMsgToTF doesnt transfer the header information, that's very lazy of them)

100% sure that the blocking point is the pcl::fromPCLPointCloud2, diagnosed with sysprof, and by measuring its execution time. 30x longer on melodic.

and its PR https://github.com/ros-planning/navigation/pull/755/files

That was very interesting to see how they did it: they got rid of the format pcl::PointCloud<pcl::PointXYZ>and rebased the observation buffer on sensor_msgs::PointCloud2. Also replaced pcl::PointCloud<pcl::PointXYZ> with std::list<geometry_msgs::Point32> when better suited (in the grid). This is what I did in this new PR: https://github.com/SteveMacenski/spatio_temporal_voxel_layer/pull/120/commits/90ee8db0958c6fec615665fc3852553158422386 And when needing of PCL filters, I converted sensor_msgs::PointCloud2 to pcl::PCLPointCloud2 with pcl_conversions::toPCL (this one actually works ok). I then used native PCL PCLPointCloud2 filters and did a bit of optimization (voxel and passthrough filters can be combined). It seemed to solve the performance issue and even better improves it: I am below 30% CPU on the same hardware (against around 40% in kinetic). Side effect, we can natively tf transform the observations since they are now in the supported sensor_msgs::PointCloud2 format. This of course needs to be cross tested but it is an encouraging first attempt.

SteveMacenski commented 5 years ago

Do you know what ROS uses under the hood then to do that conversion when you publish a PC2?

I see the PR but it will take me some cycles to get through it, there's alot going on there. I was out last week deploying some robots so I'm in catch up mode right now. Have you tested sufficiently on your side to say "checkmark, works for me"?

I still see a toPCL for the filters, if you're going to PCL why go back? I see you're also changing the approx. voxel filter to the exact one and removing the nan-removal. I did that to be lighter weight, are you seeing different results?

doisyg commented 5 years ago

Do you know what ROS uses under the hood then to do that conversion when you publish a PC2?

Not really, but the functionality is enabled by including this header: https://github.com/ros-perception/perception_pcl/blob/melodic-devel/pcl_ros/include/pcl_ros/point_cloud.h

I see the PR but it will take me some cycles to get through it, there's alot going on there. I was out last week deploying some robots so I'm in catch up mode right now. Have you tested sufficiently on your side to say "checkmark, works for me"?

I will not say it was tested sufficiently enough on our side, I just did one round of navigation to check it was working. But a least, this shows that the performance issue on melodic can be solved and even better, that the CPU usage can be optimized. I will report progress and experiment here.

I still see a toPCL for the filters, if you're going to PCL why go back? I see you're also changing the approx. voxel filter to the exact one and removing the nan-removal. I did that to be lighter weight, are you seeing different results?

The way I understand it, PCL dropped the dependencies on ROS by creating an equivalent of sensor_msgs::PointCloud2 : pcl::PCLPointCloud2. Conversion between these types is simple and doesn't involve remapping the data of the cloud, hence it costs almost no CPU (it is the conversion to pcl::PointCloud< PointT > which is costly). We need the conversion to pcl::PCLPointCloud2 to use the native PCL filters, however the list of available filters for the type pcl::PCLPointCloud2 and pcl::PointCloud< PointT > are not strictly the same, see here: http://docs.pointclouds.org/1.8.1/group__filters.html We miss the filters pcl::ApproximateVoxelGrid and pcl::removeNaNFromPointCloud for the type pcl::PCLPointCloud2. However pcl::VoxelGrid< pcl::PCLPointCloud2 > seems to be equivalent, no more costly and, similarly to pcl::PassThrough< pcl::PCLPointCloud2 > that I use to crop the Z values (height restriction) instead of the for loop, I believe it removes NaN from its output. In any case, I observe a reduction of CPU usage of 25% compared to the kinetic version with these changes.

SteveMacenski commented 5 years ago

Its the write and read methods: https://github.com/ros-perception/perception_pcl/blob/melodic-devel/pcl_ros/include/pcl_ros/point_cloud.h#L159. Yeah, that's more convoluted than I'm really looking to introduce :)

Ok, I think that makes sense. Perhaps I need to start following the PCL community closer, but I have no concept of why they would release things so much heavier on basic methods.

I'll move comments to the PR now, I understand the decisions made. With that said, I dont have any melodic robots running around right now so I'm not going to very easily be able to independently verify.

SteveMacenski commented 5 years ago

I filed a ticket on PCL https://github.com/PointCloudLibrary/pcl/issues/2923

doisyg commented 5 years ago

Thanks for the code review, I am away from robots for the moment and will respond next week. I should have probably started by opening a ticket on PCL in the first place but was not sure that it could be reproduced on other setups. Btw, it is not a 30% increase of CPU that I observe but an increase of 30 times (which seems extra weird). Can you edit the ticket?

SteveMacenski commented 5 years ago

I will thanks for the correction

SteveMacenski commented 5 years ago

Follwup from the PCL ticket, what version of PCL exactly are you using? And are you compiling from source or from the ROS install / changing the default PCL version from the ROS-shipped version?

Anything else non-standard worth noting in your configuration? PCL folks make a good point that their code hasnt changed in a very long time. I looked pretty deep into the dependencies of that function and its called functions and I dont see anything in there that could have changed to make anything heavier, its just a bunch of memcopys and other castings that shouldn't functionally change.

doisyg commented 5 years ago

I replied on the PCL ticket, I could not reproduce the issue outside PCL with minimal code. And I also replied on the PR about your review. I believe getting rid of the old PCL format could led to performance improvement and code simplification independently from the difference between kinetic and melodic. I tested it a bit more in our facilities on a real robot, it shows very good real time performance.

SteveMacenski commented 5 years ago

Ok, I'd like to understand why it seems to have been an issue for you. While getting rid of PCL for what this does today makes sense if possible, there may be future capabilities that require more PCL tools than offered by the shim class. If this is just a passing issue on your platform that would be good to know vs. a deeper navigation issue.

In theory I have no problem with your PR, I'd like to understand what's happening though, and have enough testing done to make sure that it's not going to breaking anything, especially if I bloom release it to the larger community. From reading it over I don't see any reason for things to be broken but I am a little nervous that you're seeing this behavior and right now I don't have a concrete way of independently verifying.

doisyg commented 5 years ago

I agree that the root cause of the performance issue should be found. I don't have that much time to investigate more immediately but I will keep it in mind. In the meantime, if somebody else can reproduce the issue, it would be really helpful.

SteveMacenski / spatio_temporal_voxel_layer

[melodic] pcl::fromPCLPointCloud2 performance issue #117