Major performance problem with align_depth option

dgrnbrg commented 3 years ago

Hello, I am using a realsense d435i on a raspberry pi 4, with the intended use case streaming back to a nearby computer to run mapping and object detection.

I am seeing very, very poor performance with the align_depth option. Initially, I thought I had network issues, but I've reproduced the performance solely on the raspberry pi 4 (so I've eliminated network issues as a cause).

When I look at rostopic hz /camera/depth/image_rect_raw on the rpi4, I see ~15 images per second. When I look at rostopic hz /camera/color/image_raw on the rpi4, I also see ~15 images per second. However, when I look at rostopic hz /camera/aligned_depth_to_color/image_raw on the rpi4, I only see ~1.2 images per second. Clearly, the aligned topic has much lower performance. Looking at the system metrics, I have 7.3G of memory free and I'm using about 250% of the 400% total CPUs available.

What can I do to get aligned images so that I can use RTabMap, but without the major performance loss? Is the aligning something that happens on the host CPU and not on the realsense hardware? If so, are there build flags I should use to ensure I get reasonable performance [1]?

Thank you for your help.

[1] This seems strange, because I do not see the CPUs even close to full utilization.

doronhi commented 3 years ago

Trying to isolate the issue, do you see the same performance issue running rs-align? It is available after installing librealsense2-utils. rs-align does not show FPS but the difference between 15 and 1.2 fps is visible enough and you can also check the CPU usage. Also, What method did you use to install the realsense2_camera wrapper?

dgrnbrg commented 3 years ago

Do you have a link for how to run rs-align? Is that something that's part of realsense-viewer, or part of the ros wrapper?

I installed the camera & the wrapper from source. I'm using librealsense 2.45.0 and realsense-ros 2.3.0. I configured librealsense with cmake .. -DBUILD_EXAMPLES=true -DCMAKE_BUILD_TYPE=Release -DFORCE_LIBUVC=true

doronhi commented 3 years ago

Once you built with -DBUILD_EXAMPLES=true, rs align should be found here: /build/examples/align/rs-align Also, as a side note, the -DFORCE_LIBUVC is deprecated and in the future, you should use -DFORCE_RSUSB_BACKEND instead.

dgrnbrg commented 3 years ago

I checked out rs-align, and whether I select "align to depth" or "align to color", it is snappy enough (definitely in the 10+ FPS range, never in the 1 FPS range). It is noticeably slower when I have "align to color" selected, but it's like ~20 fps with "align to depth" vs ~10fps with "align to color". I used htop to watch the CPU usage, and I see that the rs-align process consistently uses around 185%cpu, which should be fine (this device has 4 cpus).

p.s. Sorry it took me so long to reply--I forgot to enable notifications. I'll be able to respond more quickly going forward.

dgrnbrg commented 3 years ago

Hey @doronhi -- what would you suggest as a next debugging step?

doronhi commented 3 years ago

The wrapper uses librealsense's align filter to align the depth image. From then onwards there is nothing different from the way other frames are being handled. Assuming you have just this one copy of librealsense2, which rs-align also uses with much better performance, I can't think of a source to the difference. Did you specify higher resolutions than the defaults which rs-align uses?

Another thought I had was that rs_align uses a rs2::pipeline object to synchronize the depth and color images while realsense2_camera uses class "rs2::asynchronous_syncer" directly. An issue with the syncer or the timestamps could cause frames to arrive separately most of the time so aligning is not happening most of the time - hence reduced FPS with no CPU usage. You can test this hypothesis by changing ROS_DEBUG to ROS_INFO here and here and see. Do you receive some "Single video frame arrived" messages, and if so, at what rate, or just "Frameset arrived." messages, as expected?

dgrnbrg commented 3 years ago

I will investigate this and get back to you asap with this result.

Also, if it's helpful, I can share my raspberry pi image with all of the software built & installed, so that you can see the exact launches/behavior I'm seeing.

dgrnbrg commented 3 years ago

Here is the launch file I'm using:

<launch>
    <arg name="output"                  default="screen"/>        <!-- Control node output (screen or log) -->
  <!-- Localization-only mode -->
  <arg name="localization"            default="false"/>
  <arg name="wait_for_transform"      default="0.2"/>

    <arg name="simulated_hardware"          default="false"/>
    <arg name="static_body_transforms" default="true"/>
    <arg name="realsense_sync" default="false"/>
    <arg name="inertial_odom_fusion" default="false"/>
    <include unless="$(arg simulated_hardware)"
        file="$(find realsense2_camera)/launch/rs_camera.launch">
        <arg name="align_depth" value="true"/>
        <arg name="linear_accel_cov" value="1.0"/>
        <arg name="enable_gyro" value="true"/>
        <arg name="enable_accel" value="true"/>
        <!--<arg name="initial_reset" value="true"/>-->
        <arg name="enable_sync" value="$(arg realsense_sync)"/> <!-- only syncs the cameras -->
        <arg name="unite_imu_method" value="linear_interpolation"/>
    </include>

    <node pkg="imu_filter_madgwick" type="imu_filter_node" name="ImuFilter">
        <param name="use_mag" type="bool" value="false" />
        <param name="publish_tf" type="bool" value="false" />
        <param name="world_frame" type="string" value="enu" />
        <remap if="$(arg simulated_hardware)" from="/imu/data_raw" to="/camera/imu/sample"/>
        <remap unless="$(arg simulated_hardware)" from="/imu/data_raw" to="/camera/imu"/>
        <remap from="/imu/data" to="/imu/data/madgwick"/>
    </node>

    <!-- sync the rgbd topics into a new topic /camera/rgbd_relay -->
    <node pkg="nodelet" type="nodelet" name="rgbd_sync" args="standalone rtabmap_ros/rgbd_sync" output="$(arg output)">
      <remap from="rgb/image"       to="/camera/color/image_raw"/>
      <remap from="depth/image"     to="/camera/aligned_depth_to_color/image_raw"/>
      <remap from="rgb/camera_info" to="/camera/color/camera_info"/>
      <remap from="rgbd_image"      to="/camera/rgbd_relay"/>
      <param name="approx_sync"     type="bool"   value="false"/>
    </node>
</launch>

It's using the default resolution from rs_camera.launch, which is 640x480 I believe. Do you think this is too high? I don't see where the resolution is set in rs-align.cpp.

I made the logging changes, I only only see Frameset arrived. Every 50 framesets or so, I see ERROR [2512385040] (uvc-streamer.cpp:106) uvc streamer watchdog triggered on endpoint: 130. It's too bad this isn't the problem :(

dgrnbrg commented 3 years ago

I did some digging just to check into the launch files (since obviously the issue lies somewhere in the ROS bridge code), and I determined this: the color & aligned_depth images topics are publishing 1280x720 images, and the depth topic is publishing an 848x480 image. I'm not sure if this is useful information, but I want to provide as much detail as I can :)

doronhi commented 3 years ago

Seeing your launch file, could you try with something simpler? roslaunch realsense2_camera rs_camera.launch align_depth:=true and then rostopic hz /camera/aligned_depth_to_color/image_raw Also, for the record, what versions of librealsense2 and realsense2_camera do you use and what method did you use to install them?

dgrnbrg commented 3 years ago

Ok, when I run roslaunch realsense2_camera rs_camera.launch align_depth:=true, for:

rostopic hz /camera/depth/image_rect_raw: I get ~50hz
rostopic hz /camera/aligned_depth_to_color/image_raw: I get ~3hz
rostopic hz /camera/color/image_raw: I get 15hz

While doing this, I did some very unscientific and inaccurate profiling with htop (totally not a profiler), and it seems like maybe there's something funky going on with the alignment step. The impression I have is that when I use rostopic hz /camera/aligned_depth_to_color/image_raw, I always see one core pegged around 95-100%, while for the other 2 topics I see the cores sometimes all drop to 40-60% usage. I'm assuming that the calculations only happen when the topic has a subscriber (this could be incorrect). Perhaps the latency is because the alignment step is running single-threaded on the CPU sequentially with other realsense-related processing, and so this bottlenecks the processing.

If this hypothesis makes sense, perhaps I could write some code to do the alignment on another thread, configure the camera to output lower-resolution/correctly-aligned frames without software processing, write some code to leverage the PI's GPU, or perhaps the ARM compiler options are not generating SIMD instructions that would speed up the current implementation. The processor is a Cortex A72 @ 1.5GHz. Do you think any of this theories are worth following up? If you point me to the functions to look in to, I'd be happy to take a shot at implementing some of my suggestions.

I installed the camera & the wrapper from source (since there aren't raspberry pi packages). I'm using librealsense 2.45.0 and realsense-ros 2.3.0. I configured librealsense with cmake .. -DBUILD_EXAMPLES=true -DCMAKE_BUILD_TYPE=Release -DFORCE_LIBUVC=true. I build realsense-ros just by doing catkin_make in the workspace.

RealSenseSupport commented 3 years ago

@dgrnbrg Could you please try with the lower resolution and turn off IMU data if you don't need it?

dgrnbrg commented 3 years ago

@RealSenseSupport Which of the resolution-related parameters should I use? Also, I do need the IMU data, as the application is for a mobile robotic platform.

Also, would you suggest a resolution setting?

doronhi commented 3 years ago

Regarding operating the filters (align, colorizer, etc.) on a different thread, it sounds like a good idea. It should be configurable for some users may not wish to allow the node to use all the available CPUs though.

Regarding the compiler options and GPU usage, most of the hard work is done in the librealsense2 library, not in the realsense2_camera node which is essentially a wrapper. I haven't tried it myself but from what I understand, building librealsense2 with a cmake flag -DBUILD_GLSL_EXTENSIONS=true creates an object rs2::gl::align that can replace the rs2::align used in realsense2_camera (inside base_realsense_node.cpp). I would start by trying to build librealsense2 example: rs-gl. It uses rs2::gl::pointcloud and rs2::gl::colorizer but the idea to use rs2::gl::align should be the same. Also, it will be good to test the performance difference on the example before deciding to merge it into realsense2_camera. It sounds interesting and I am curious to learn the results.

dgrnbrg commented 3 years ago

I made significant progress in trying to get the GPU usage running, and I would appreciate your help in figuring out what the next debugging step is. Here's what I've done so far:

First, I rebuilt librealsense2 with -DBUILD_GLSL_EXTENSIONS=true, and then in ros-realsense, I added #include <librealsense2-gl/rs_processing_gl.hpp> to realsense_node_factory.h, and changed base_realsense_node.cpp to use rs2::gl::align. I didn't see a performance increase, but I determined that this is probably due to the "backup" node structure that automatically falls back to the CPU version in rs-gl.cpp in librealsense, so I hacked that out with this patch:

--- a/src/gl/rs-gl.cpp
+++ b/src/gl/rs-gl.cpp
@@ -135,11 +135,12 @@ rs2_processing_block* rs2_gl_create_align(int api_version, rs2_stream to, rs2_er
 {
     verify_version_compatibility(api_version);
     auto block = std::make_shared<librealsense::gl::align_gl>(to);
-    auto backup = std::make_shared<librealsense::align>(to);
-    auto dual = std::make_shared<librealsense::gl::dual_processing_block>();
-    dual->add(block);
-    dual->add(backup);
-    return new rs2_processing_block { dual };
+    //auto backup = std::make_shared<librealsense::align>(to);
+    //auto dual = std::make_shared<librealsense::gl::dual_processing_block>();
+    //dual->add(block);
+    //dual->add(backup);
+    //return new rs2_processing_block { dual };
+    return new rs2_processing_block { block };
 }

I validated that rs-gl could run, which required coaxing b/c the RPi4 has OpenGL ES, which wasn't passing the OpenGL version checking. I addressed this by exporting MESA_GL_VERSION_OVERRIDE=3.0 and MESA_GLSL_VERSION_OVERRIDE=130, and I saw rs-gl run and display an image.

Next, I convinced roslaunch realsense2_camera rs_camera.launch align_depth:=true to start by adding those environment variables, as well as export LD_PRELOAD=/usr/local/lib/librealsense2-gl.so.2.45 (I don't know CMake, so this got the linker satisfied at runtime).

At this point, I was seeing an unexpected crash at startup in the nodelet manager, so I added launch-prefix="xterm -e gdb --args" to the crashing nodelet manager in order to get a backtrace of the crash site. Can you help me figure out what is going on? I'm pretty sure that the RPi4's GPU is capable of running glsl-based aligner, but I need help understanding why the rs2::options::set_option/rs2::pointcloud::map_to is getting called by the align code.

#0  0xb6dec2b8 in rs2::options::set_option(rs2_option, float) const (this=0x0, value=30, option=RS2_OPTION_STREAM_FILTER)
    at /home/pi/librealsense/src/gl/../../include/librealsense2/hpp/rs_options.hpp:101
#1  0xb6dec2b8 in rs2::pointcloud::map_to(rs2::frame) (this=this@entry=0x0, mapped=...)
    at /home/pi/librealsense/src/gl/../../include/librealsense2/hpp/rs_processing.hpp:454
#2  0xb6de95b4 in librealsense::gl::align_gl::align_z_to_other(rs2::video_frame&, rs2::video_frame const&, rs2::video_stream_profile const&, float)
    (this=this@entry=0xa221f52c, aligned=..., depth=..., other_profile=..., z_scale=<optimized out>)
    at /home/pi/librealsense/src/gl/../../include/librealsense2/hpp/rs_frame.hpp:403
#3  0xb4f4a138 in librealsense::align::align_frames(rs2::video_frame&, rs2::video_frame const&, rs2::video_frame const&)
    (this=this@entry=0xa221f52c, aligned=..., from=..., to=...) at /home/pi/librealsense/src/proc/align.cpp:246
#4  0xb4f4c6ac in librealsense::align::process_frame(rs2::frame_source const&, rs2::frame const&) (this=0xa221f52c, source=..., f=...)
    at /home/pi/librealsense/src/proc/align.cpp:279
#5  0xb4f63b84 in librealsense::generic_processing_block::<lambda(rs2::frame, const rs2::frame_source&)>::operator() (__closure=0xa22ad9e4, source=..., f=...)
    at /home/pi/librealsense/src/proc/synthetic-stream.cpp:75
#6  0xb4f63b84 in rs2::frame_processor_callback<librealsense::generic_processing_block::generic_processing_block(char const*)::<lambda(rs2::frame, const rs2::frame_source&)> >::on_frame(rs2_frame *, rs2_source *) (this=0xa22ad9e0, f=<optimized out>, source=<optimized out>)
    at /home/pi/librealsense/build/../include/librealsense2/hpp/rs_processing.hpp:128
#7  0xb4f603f8 in librealsense::processing_block::invoke(librealsense::frame_holder) (this=0xa221f52c, f=...) at /home/pi/librealsense/src/proc/synthetic-stream.h:41
#8  0xb50c335c in rs2_process_frame(rs2_processing_block*, rs2_frame*, rs2_error**) (block=<optimized out>, frame=<optimized out>, error=error@entry=0x95bbd02c)
    at /home/pi/librealsense/src/core/streaming.h:139
#9  0xb6debbbc in rs2::processing_block::invoke(rs2::frame) const (f=..., this=0xa22b70d4)
    at /home/pi/librealsense/src/gl/../../include/librealsense2/hpp/rs_processing.hpp:303
#10 0xb6debbbc in rs2::filter::process(rs2::frame) const (this=0xa22b70d4, frame=...)
    at /home/pi/librealsense/src/gl/../../include/librealsense2/hpp/rs_processing.hpp:354
#11 0xaf5a6d10 in realsense2_camera::BaseRealSenseNode::frame_callback(rs2::frame) () at /home/pi/catkin_ws/devel/lib//librealsense2_camera.so
#12 0xaf5a8620 in rs2::frame_callback<realsense2_camera::BaseRealSenseNode::setupDevice()::{lambda(rs2::frame)#1}>::on_frame(rs2_frame*) ()
    at /home/pi/catkin_ws/devel/lib//librealsense2_camera.so
#13 0xb5111db0 in librealsense::frame_source::invoke_callback(librealsense::frame_holder) const (this=0xa2203c64, frame=...) at /home/pi/librealsense/src/source.cpp:125
#14 0xb4f5f7b8 in librealsense::synthetic_source::frame_ready(librealsense::frame_holder) (this=<optimized out>, result=...)
    at /home/pi/librealsense/src/core/streaming.h:147
#15 0xb4f6982c in librealsense::syncer_process_unit::<lambda(librealsense::frame_holder, librealsense::synthetic_source_interface*)>::operator()
    (__closure=0xa2223e64, __closure=0xa2223e64, frame=..., source=0xb4f6982c <librealsense::internal_frame_processor_callback<librealsense::syncer_process_unit::syncer_process_unit(std::initializer_list<std::shared_ptr<librealsense::bool_option> >, bool)::<lambda(librealsense::frame_holder, librealsense::synthetic_source_interface*)> >::on_frame(rs2_frame *, rs2_source *)+1172>) at /home/pi/librealsense/src/core/streaming.h:147
#16 0xb4f6982c in librealsense::internal_frame_processor_callback<librealsense::syncer_process_unit::syncer_process_unit(std::initializer_list<std::shared_ptr<librealsense::bool_option> >, bool)::<lambda(librealsense::frame_holder, librealsense::synthetic_source_interface*)> >::on_frame(rs2_frame *, rs2_source *)
    (this=0xa2223e60, f=<optimized out>, source=<optimized out>) at /home/pi/librealsense/src/core/processing.h:67
#17 0xb4f603f8 in librealsense::processing_block::invoke(librealsense::frame_holder) (this=0xa2203c18, f=...) at /home/pi/librealsense/src/proc/synthetic-stream.h:41
#18 0xb50c335c in rs2_process_frame(rs2_processing_block*, rs2_frame*, rs2_error**) (block=<optimized out>, frame=<optimized out>, error=0x95bbd61c)
    at /home/pi/librealsense/src/core/streaming.h:139
#19 0xaf5b568c in std::_Function_handler<void (rs2::frame), realsense2_camera::PipelineSyncer>::_M_invoke(std::_Any_data const&, rs2::frame&&) ()
    at /home/pi/catkin_ws/devel/lib//librealsense2_camera.so
#20 0xaf5b16c8 in rs2::frame_callback<std::function<void (rs2::frame)> >::on_frame(rs2_frame*) () at /home/pi/catkin_ws/devel/lib//librealsense2_camera.so
#21 0xb50f4d4c in librealsense::synthetic_sensor::<lambda(librealsense::frame_holder)>::operator() (__closure=0xa235ec34, f=...)
    at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#22 0xb50f4d4c in librealsense::internal_frame_callback<librealsense::synthetic_sensor::start(librealsense::frame_callback_ptr)::<lambda(librealsense::frame_holder)> >::on_frame(rs2_frame *) (this=0xa235ec30, fref=<optimized out>) at /home/pi/librealsense/src/types.h:969
#23 0xb5111db0 in librealsense::frame_source::invoke_callback(librealsense::frame_holder) const (this=0x99a397b0, frame=...) at /home/pi/librealsense/src/source.cpp:125
#24 0xb4f5f7b8 in librealsense::synthetic_source::frame_ready(librealsense::frame_holder) (this=<optimized out>, result=...)
    at /home/pi/librealsense/src/core/streaming.h:147
#25 0xb50c15f8 in rs2_synthetic_frame_ready(rs2_source*, rs2_frame*, rs2_error**) (source=<optimized out>, frame=<optimized out>,
    frame@entry=0x99ab41a0, error=error@entry=0x95bbd890) at /home/pi/librealsense/src/core/streaming.h:147
#26 0xb4f63e34 in rs2::frame_source::frame_ready(rs2::frame) const (this=0x95bbd858, result=...)
    at /home/pi/librealsense/build/../include/librealsense2/hpp/rs_frame.hpp:590

doronhi commented 3 years ago

Hi @dgrnbrg , I want you to know that from my point of view you are in uncharted waters. I think you are doing a great job and hope to find the time, learn from your experience, and make your work part of the official version. Align the image means calculating the 3d point for each pixel in camera1's coordinates and then project it onto camera2. In other words, it means to create a pointcloud based on camera1 and project it to camera2. Hence the usage of pointcloud.

You mentioned that rs-gl now runs. I think, perhaps the next step should be modifying the rs-align example to use rs::gl::align. That way you could both compare performance (which is important to decide whether you want to pursue this direction or not) and to have a more contained environment to debug rs::gl::align. I always find that debugging librealsense2 through running realsense2_camera node complicates things.

dgrnbrg commented 3 years ago

I understand I’m in uncharted waters :). I’m trying to thoroughly document my work so that it’s possible to retrace my steps. It’s been years since I’ve written c++, and I’m still not familiar with librealsense’s design. I’m not sure if it’s realistic, but do you have any ideas based on that backtrack, what could be causing that null pointer? I’d love to leverage your real sense expertise and bring my own knowledge in hacking stuff together (in this case, the glsl approach).

Ultimately, all I really care about is getting 10 FPS from the camera on the RPi4, so that I can use it for this robot platform to do mapping by streaming the feed over wifi to a faster computer nearby. If you think the threading approach might be easier to implement than debugging the glsl approach, I’m open to whichever gets that goal accomplished. And then, I’m happy to share all work and setup so that other customers can have this working out of the box.

dgrnbrg commented 3 years ago

I’m this situation, it seems like some “stream” is null. Is there docs on the programming model design? What’s a stream, and what could make it null without a failure during program initialization?

dgrnbrg commented 3 years ago

Hey @doronhi -- do you think I should stay on this issue, or open a new one on librealsense to understand why the OpenGL align filter segfaults?

RealSenseSupport commented 2 years ago

@dgrnbrg Sorry for late reply. I would suggest you to open a new ticket on librealsense project for OpenGL align filter segfaults. Thanks!

RealSenseSupport commented 2 years ago

@dgrnbrg Any other questions about this ticket? Looking forward to your update. Thanks!

RealSenseSupport commented 2 years ago

@dgrnbrg Any other questions for this ticket? Please note that this will be closed if we don't hear from you for another 7 days. Thanks!

paulacvalle commented 2 years ago

I am seeing very, very poor performance with the align_depth option. Initially, I thought I had network issues, but I've reproduced the performance solely on the raspberry pi 4 (so I've eliminated network issues as a cause).

Hello, I am having a similar issue with my D435 and Jetson NX (640X480 resolution, problem repeats calling "roslaunch realsense_camera rs_camera.launch"). It does not repeat when using my Linux PC.

From what I have tried here, the only conclusion I came to was to write a scrip based on "rs-align" to align the images outside of the camera main stream... Idk if it will work for my application.

It is kinda sad because using the "align_depth:=true" is such a simple solution.

Let me know if were able find any other solutions to this!

dgrnbrg commented 2 years ago

I bought a lattepanda alpha and was able to confirm that it can handily process at 15+fps with plenty of resources to spare.

@paulacvalle, if you know ARM assembly, you could try porting the x86 SIMD align depth to ARM.

paulacvalle commented 2 years ago

I bought a lattepanda alpha and was able to confirm that it can handily process at 15+fps with plenty of resources to spare.

@paulacvalle, if you know ARM assembly, you could try porting the x86 SIMD align depth to ARM.

@dgrnbrg Thank you for you suggestion! I was able to fix my problem by making sure I was installing everything from source and enabling CUDA... Which I thought I did before, but I was wrong.

Anyhow, nice to know that the lattepanda alpha can also handle this processing!

RealSenseSupport commented 2 years ago

@dgrnbrg Glad to know the issue resolved. Can we close this accordingly? Thanks!

I bought a lattepanda alpha and was able to confirm that it can handily process at 15+fps with plenty of resources to spare.

@paulacvalle, if you know ARM assembly, you could try porting the x86 SIMD align depth to ARM.

RealSenseSupport commented 2 years ago

@dgrnbrg Any other questions about this issue? Looking forward to your reply. Thanks!

RealSenseSupport commented 2 years ago

@dgrnbrg Issue resolved. Close the ticket accordingly. Thanks!

IntelRealSense / realsense-ros

Major performance problem with align_depth option #1929