[Eloquent] ROS2 CPU efficiency compare to ROS1

BarzelS commented 3 years ago

Hi,

I am using the ROS2 version of this package. The cpu consumption is more than 160% where in ROS1 with same configuration the CPU usage was about 40%. I'm using the voxel map publishing for collision detection but even when I turn it off still the cpu usage compare to ROS1 is very high ~140%

The sensor is: Realsense D435i

settings:

local_costmap:
  local_costmap:
    ros__parameters:
      plugin_names: ["static_layer", "stvl_layer"] # For Foxy and earlier
      global_frame: odom
      plugins: ["static_layer", "stvl_layer"] # For Galactic and later
      plugin_types: ["nav2_costmap_2d::StaticLayer", "spatio_temporal_voxel_layer/SpatioTemporalVoxelLayer"] # For Foxy and earlier
      robot_base_frame: camera_link
      update_frequency: 15.0
      rolling_window: True
      static_map: False
      stvl_layer/width: 0.5
      stvl_layer/height: 0.5
      stvl_layer/resolution: 0.15
      raytrace_range: 5.5
      robot_radius: 0.5
      inflation_radius: 0.55
      stvl_layer:
        plugin: "spatio_temporal_voxel_layer/SpatioTemporalVoxelLayer" # For Galactic and later
        enabled: true
        voxel_decay: 30.
        decay_model: 0
        voxel_size: 0.05
        track_unknown_space: true
        max_obstacle_height: 30.0
        unknown_threshold: 15
        mark_threshold: 0
        update_footprint_enabled: true
        combination_method: 1
        origin_z: 0.0
        publish_voxel_map: false
        transform_tolerance: 0.2
        mapping_mode: false
        map_save_duration: 60.0
        observation_sources: pointcloud
        pointcloud:
          data_type: PointCloud2
          topic: /camera/depth/color/points
          marking: true
          clearing: true
          obstacle_range: 5.0
          min_obstacle_height: -30.0
          max_obstacle_height: 30.0
          expected_update_rate: 0.0
          observation_persistence: 0.0
          inf_is_valid: false
          voxel_filter: true
          clear_after_reading: true
          max_z: -30.0
          min_z: 30.0
          vertical_fov_angle: 0.7
          horizontal_fov_angle: 1.048
          decay_acceleration: 1.
          model_type: 0

BTW I have posted multiple questions in ROS answers but there were no answers, thats the reason I'm sending it here

SteveMacenski commented 3 years ago

I'd recommend profiling it to figure out where the exact time is spent. I'm going to guess it relates to PCL versions since that's been a cause for some users to experience higher CPU loads in the past, but it would be good to know where the issue is to fix it.

Are you comparing with the exact same parameters in both?


      stvl_layer/width: 0.5
      stvl_layer/height: 0.5
      stvl_layer/resolution: 0.15

These aren't in the namespace

BarzelS commented 3 years ago

I'd recommend profiling it to figure out where the exact time is spent. I'm going to guess it relates to PCL versions since that's been a cause for some users to experience higher CPU loads in the past, but it would be good to know where the issue is to fix it.

Are you comparing with the exact same parameters in both?
      stvl_layer/width: 0.5
      stvl_layer/height: 0.5
      stvl_layer/resolution: 0.15
These aren't in the namespace

How can I profile the node? I tried searching for an app that could perform call stack analysis but could not find one that works with ROS2 nodes. Maybe you have any recommendation in this regard?
Which version of PCL do you recommend? (I'm using 1.8.1)
Yes, I'm comparing with the exact same parameters in both ROS and ROS2, I removed the namespace and still I'm facing the same problems. actually I'm using just the 3d voxel layer for collision detection so the costmap size doesn't matter.

Thanks

SteveMacenski commented 3 years ago

The way you'd profile any other C++ code, with valgrind / callgrind. It should give you some files that are visualizable where the CPU time is spent and you can compare from ROS1 to ROS2 where that difference in time is coming from.

I don't have a particular PCL version to recommend beyond whatever is shipping with your ROS distribution. In general it's very hard to have multiple next to each other without issues.

The fact that your costmap is only 0.5x0.5 make me really curious how this is working for you. That's a very peculiar configuration.

BarzelS commented 3 years ago

The way you'd profile any other C++ code, with valgrind / callgrind. It should give you some files that are visualizable where the CPU time is spent and you can compare from ROS1 to ROS2 where that difference in time is coming from.

I don't have a particular PCL version to recommend beyond whatever is shipping with your ROS distribution. In general it's very hard to have multiple next to each other without issues.

The fact that your costmap is only 0.5x0.5 make me really curious how this is working for you. That's a very peculiar configuration.

Thanks @SteveMacenski

I have used callgrind with KCachegrind and got this results: (ROS2)

If I understand the output correctly it seems that the bottleneck is some DDS usage of converting pointcloud messages, right? BTW, this output was with the RTI's RMW, but I have also tried FASTRTPS which produces even higher CPU usage.

As I said, I'm using only the created voxel map (pointcloud) so the size of the costmap(0.5 x 0.5) really doesn't matter. But it will be very useful for me that I will have the option to perform rolling window and define the size of the voxel map the same way its working with the 2d costmap(I need it in 3d) I have already asked it here https://answers.ros.org/question/367209/rolling-window-in-spatio-temporal-voxel-layer/ but there was no answer(The question was asked with reference to ROS1, but I need it in ROS2)

Thanks

SteveMacenski commented 3 years ago

I agree from that entry, what about the ones upper to it? How about trying Cyclone just to round off the available options? I don't think that will change anything but just to verify.

I'd also be curious if you messed with the packet sizes in your DDS configuration how much that would help. Pointclouds are heavy to publish even in ROS1. ROS2 on DDS actually makes that a bit worse out of the box without some tweaking of the message fragmentation size.

@EduPonz @JaimeMartin this is definitely an issue, can you give us some feedback on how FastDDS can be configured to make this reasonable? Publishing a pointcloud shouldn't take this kind of time. @SBarzz there's almost no chance we can get any support from RTI on this so I suggest you work with eProsima or Cyclone.

SteveMacenski commented 3 years ago

Do you have ROS 2 security enabled on that topic?

BarzelS commented 3 years ago

Do you have ROS 2 security enabled on that topic?

I'm not sure how I can check if I enabled security on that topic:

Basically I'm just using the intel ros realsense ros wrapper to publish the point cloud to the stvl(nav2):

These are the configurations I have now at the base_realsense_node.cpp: rclcpp::QoS m_qos(rclcpp::QoSInitialization::from_rmw(rmw_qos_profile_sensor_data)); _pointcloud_publisher = _node.create_publisher("depth/color/points", m_qos);

Do you refer the security described here: https://design.ros2.org/articles/ros2_dds_security.html ?

Thanks

SteveMacenski commented 3 years ago

If you don't know how, you didn't do it :smile: so no worries. Just asking in case you did, as that would have a particularly powerful impact on CPU load on large topics.

BarzelS commented 3 years ago

I agree from that entry, what about the ones upper to it? How about trying Cyclone just to round off the available options? I don't think that will change anything but just to verify.

I'd also be curious if you messed with the packet sizes in your DDS configuration how much that would help. Pointclouds are heavy to publish even in ROS1. ROS2 on DDS actually makes that a bit worse out of the box without some tweaking of the message fragmentation size.

@EduPonz @JaimeMartin this is definitely an issue, can you give us some feedback on how FastDDS can be configured to make this reasonable? Publishing a pointcloud shouldn't take this kind of time. @SBarzz there's almost no chance we can get any support from RTI on this so I suggest you work with eProsima or Cyclone.

Hi @SteveMacenski, Thanks for the support!

I've tried also using eProsima(fastrtps), the cpu usage using eProsima is higher then RTI(210% vs 120%). The output off callgrind when using eProsima is shown in the image below: (Very hard to understand)

BTW, here: _voxel_pub = rclcppnode->create_publisher( "voxel_grid", rclcpp::QoS(1)); Why you used qos(1)? What does it mean? which configuration. I've tried to change it to sensor data qos but there was no change in cpu usage

EduPonz commented 3 years ago

Hi @SBarzz ,

From your capture I can see that you're using Fast DDS v1.9.3. Are you using Eloquent? Have you tried to reproduced the issue on Foxy or Rolling maybe?

BarzelS commented 3 years ago

Hi @SBarzz ,

From your capture I can see that you're using Fast DDS v1.9.3. Are you using Eloquent? Have you tried to reproduced the issue on Foxy or Rolling maybe?

It will be problematic for me cause I'm using Jetson TX2 which supports only ubuntu 18.04 at the moment

SteveMacenski commented 3 years ago

Try docker, unfortunately Eloquent is EOL so neither Eduardo nor I will be able to reproduce for you in Eloquent. There were a bunch of updates into Foxy that might actually just resolve this, I'm not sure though.

Don't worry about the QoS at this point, as you mention, changing that didn't seem to impact your results. More info can be found here.

vanem commented 3 years ago

I might know what your problem is: There was a configuration change between the current branch heads. For Melodic: voxel_filter: false # default off, apply voxel filter to sensor, recommend on became for Noetic and ROS2: filter: "passthrough" # default passthrough, apply "voxel", "passthrough", or no filter to sensor data, recommend on

So if you use "voxel_filter: true" for ROS2, it will be ignored, and instead no voxel filter will be applied, which leads to very high CPU load (in my case move_base jumps from 80% to 140% with STVL active in the local and global costmaps, on a Core i7 Gen9 laptop, with a single ZED Mini Camera ). Instead you want "filter: "voxel" " on ROS2. I just started using STVL myself.

Is 80% percent load with 2 costmaps a reasonable and expected? I'm worried that if I will add more cameras the CPU load will shoot up. Same if I switch from the laptop to and Nvidia Xavier NX, (although I did not test that setup yet).
How do I keep the load down? So far it seems to me that aside the disabling STVL on the global costmap (which I obiously don't want to do), the other settings that I've tried (voxel_decay, decay_acceleration, decay_acceleration, voxel_size, publish_voxel_map:false) have minimal impact on the CPU load.

BarzelS commented 3 years ago

I might know what your problem is: There was a configuration change between the current branch heads. For Melodic: voxel_filter: false # default off, apply voxel filter to sensor, recommend on became for Noetic and ROS2: filter: "passthrough" # default passthrough, apply "voxel", "passthrough", or no filter to sensor data, recommend on

So if you use "voxel_filter: true" for ROS2, it will be ignored, and instead no voxel filter will be applied, which leads to very high CPU load (in my case move_base jumps from 80% to 140% with STVL active in the local and global costmaps, on a Core i7 Gen9 laptop, with a single ZED Mini Camera ). Instead you want "filter: "voxel" " on ROS2. I just started using STVL myself.

Is 80% percent load with 2 costmaps a reasonable and expected? I'm worried that if I will add more cameras the CPU load will shoot up. Same if I switch from the laptop to and Nvidia Xavier NX, (although I did not test that setup yet).

How do I keep the load down? So far it seems to me that aside the disabling STVL on the global costmap (which I obiously don't want to do), the other settings that I've tried (voxel_decay, decay_acceleration, decay_acceleration, voxel_size, publish_voxel_map:false) have minimal impact on the CPU load.

Hi @vanem Thanks for your answer, I'm using the eloquent version of STVL and I think you are talking about the foxy version, right? Are you talking about this part of the code:

 if (_voxel_filter) {
      pcl::VoxelGrid<pcl::PCLPointCloud2> sor;
      sor.setInputCloud(cloud_pcl);
      sor.setFilterFieldName("z");
      sor.setFilterLimits(_min_obstacle_height, _max_obstacle_height);
      sor.setDownsampleAllData(false);
      float v_s = static_cast<float>(_voxel_size);
      sor.setLeafSize(v_s, v_s, v_s);
      sor.setMinimumPointsNumberPerVoxel(static_cast<unsigned int>(_voxel_min_points));
      sor.filter(*cloud_filtered);
    } else {
      pcl::PassThrough<pcl::PCLPointCloud2> pass_through_filter;
      pass_through_filter.setInputCloud(cloud_pcl);
      pass_through_filter.setKeepOrganized(false);
      pass_through_filter.setFilterFieldName("z");
      pass_through_filter.setFilterLimits(
        _min_obstacle_height, _max_obstacle_height);
      pass_through_filter.filter(*cloud_filtered);
    }

If you are talking about this so in the eloquent version just by passing the voxel_filter:true it will set the flag "_voxel_filter" to true. But thanks for trying to help Actually I've just figured out that on my laptop the cpu usage is very low comparing to the usage I presented(It was on my Jetson TX2) You are not encountering any cpu problems on your Xavier?

SteveMacenski commented 3 years ago

You're aware the Jetson boards have much weaker CPUs than your computer. Are you comparing CPU % on the same CPUs?

BarzelS commented 3 years ago

You're aware the Jetson boards have much weaker CPUs than your computer. Are you comparing CPU % on the same CPUs?

Yes the comparison between Ros and Ros2 performed both on the same jetson.

SteveMacenski commented 3 years ago

How are you installing STVL on Eloquent? Could you try the Foxy branch with the changes in config that @vanem suggests? I think from your profiling we've identified it as probably being a DDS related jump, but @vanem is saying that he's getting things to work fine (what ROS version are you comparing @vanem?) after that change.

If it is indeed DDS related, there's not much I can suggest here specifically. You'll need to dig into your DDS configs to optimize for the larger pointcloud movements. But like both @vanem and I suggest, try moving to Foxy. Substantial DDS improvements have been made to all the Tier 1 DDS vendors and this might be moot because they were solved.

BarzelS commented 3 years ago

How are you installing STVL on Eloquent? Could you try the Foxy branch with the changes in config that @vanem suggests? I think from your profiling we've identified it as probably being a DDS related jump, but @vanem is saying that he's getting things to work fine (what ROS version are you comparing @vanem?) after that change.

If it is indeed DDS related, there's not much I can suggest here specifically. You'll need to dig into your DDS configs to optimize for the larger pointcloud movements. But like both @vanem and I suggest, try moving to Foxy. Substantial DDS improvements have been made to all the Tier 1 DDS vendors and this might be moot because they were solved.

I'm compiling STVL from source.
Unfortunately due to aspects of convenience development working from docker is not possible so I can only work in eloquent at the moment in the Jetson TX2.

SteveMacenski commented 3 years ago

We can't offer you any support then. Eloquent is EOL.

SteveMacenski commented 3 years ago

My understanding of this ticket is that its DDS performance related, which is not in the scope of this project. Improving the RMWs / reporting poor performance to DDS vendors seems like the more appropriate action since this isn't a problem with STVL, but in working with pointclouds in ROS2 DDS

Closing under that understanding, if that's not accurate and there's something in STVL that's significantly heavier, then we can reopen and discuss

SteveMacenski / spatio_temporal_voxel_layer

[Eloquent] ROS2 CPU efficiency compare to ROS1 #188