NDT execution is very time-consuming

liuzf1988 commented 1 year ago

After porting NDT algorithm from ROS2 ecosystem to DORA framework, I conducted some tests on the performance of NDT algorithm. The results show that NDT execution is very time-consuming. I've been trying to find out the reason these days, but can't solve the problem until now. The following is the detailed debugging process.

NDT operator execution time on Dora side

First, add necessary print info in ndt_scan_matcher.cpp to obtain the time required for NDT execution.

// ndt_scan_matcher.cpp
geometry_msgs::msg::PoseWithCovarianceStamped NDTScanMatcher::alignUsingMonteCarlo(
  const std::shared_ptr<NormalDistributionsTransformBase<PointSource, PointTarget>> & ndt_ptr,
  const geometry_msgs::msg::PoseWithCovarianceStamped & initial_pose_with_cov)
{
    // ...
    for (unsigned int i = 0; i < initial_poses.size(); i++) {
        const auto exe_start_time = std::chrono::system_clock::now(); // for debug
        // ...
        ndt_ptr->align(*output_cloud, initial_pose_matrix);
        // ...
        const auto exe_end_time = std::chrono::system_clock::now(); // for debug
        const double exe_time = std::chrono::duration_cast<std::chrono::microseconds>(exe_end_time - exe_start_time).count() / 1000.0;
        std::cout << "The alignUsingMonteCarlo execute time is: "<< exe_time << " ms, " << "and NumIteration is: " << num_iteration << std::endl;  // for debug
        // ...
    }
}

Where ndt_ptr->align(*output_cloud, initial_pose_matrix); calls a multi-core version of NDT algorithm. This multi-core implementation actually uses the OpenMP based multi-threading mechanism. The corresponding parallel settings is in ndt_omp_impl.hpp, extracted as follows.

// ndt_omp_impl.hpp
template<typename PointSource, typename PointTarget> double
pclomp::NormalDistributionsTransform<PointSource, PointTarget>::computeDerivatives(Eigen::Matrix<double, 6, 1> &score_gradient,
    Eigen::Matrix<double, 6, 6> &hessian,
    PointCloudSource &trans_cloud,
    Eigen::Matrix<double, 6, 1> &p,
    bool compute_hessian)
{
    // ...
    // const auto parallel_start_time = std::chrono::system_clock::now(); // for debug
#pragma omp parallel for num_threads(num_threads_) schedule(guided, 8)
    for (std::size_t idx = 0; idx < input_->points.size(); idx++)
    {
        int thread_n = omp_get_thread_num();
        // ...
    }
    // const auto parallel_end_time = std::chrono::system_clock::now(); // for debug
    // const double parallel_exe_time = std::chrono::duration_cast<std::chrono::microseconds>(parallel_end_time - parallel_start_time).count() / 1000.0;
    // printf("---- computeDerivatives parallel part execute time is %f ms ----\n", parallel_exe_time); // for debug

    // const auto serial_start_time = std::chrono::system_clock::now(); // for debug
    for (std::size_t i = 0; i < input_->points.size(); i++) {
        // ...
    }
    if (regularization_pose_) {
        // ...
    }
    // const auto serial_end_time = std::chrono::system_clock::now(); // for debug
    // const double serial_exe_time = std::chrono::duration_cast<std::chrono::microseconds>(serial_end_time - serial_start_time).count() / 1000.0;
    // printf("---- computeDerivatives serial part execute time is %f ms ----\n", serial_exe_time); // for debug
}

The test results are shown in the figures below. It can be found that the NDT execution time is usually above 400 ms, far from meeting the real-time requirements of localization. And by using htop and pstree commands, it is indicated that the multi-threading mechanism works properly.

dora-side-ndt-exetime-a

dora-side-ndt-threads-a

dora-side-ndt-threads-b

To find out which code is time-consuming to execute, I added more print info to "ndt_omp_impl.hpp", seeing the above code commented with "//" for details. The result shows that the execution time of the "computeDerivatives" function is about 70 ms, and this function will be called many times each time the NDT algorithm is run. Below is the corresponding screenshot.

dora-side-ndt-exetime-b

NDT execution time on ROS2 side

For comparison, on ROS2 side, the screenshots of NDT execution time are as follows. It only takes about 0.7 ms for the "computeDerivatives" function to execute and the total NDT execution time is about 20 ms.

ros2-side-ndt-threads-a

ros2-side-ndt-exetime-b

NDT node execution time on Dora side

Considering that the Dora operator itself is a thread, in order to eliminate the influence of this factor on the multi-threading mechanism of ndt, I rewrite the "ndt_scan_matcher" as Dora node and record its execution time. The relevant results are as follows:

dora-side-ndtnode-exetime-a

dora-side-ndtnode-threads-a

dora-side-ndtnode-threads-b

dora-side-ndtnode-exetime-b

It remains similar to the case of porting "ndt_scan_matcher" as Dora operator.

I carefully analyzed the code in ndt_omp_impl.hpp, and there seems nothing special. So it is strange that the same code only takes about 20 ms to execute on the ROS2 side, but it takes about 1000 ms on the Dora side. Looking forward to your advice.

haixuanTao commented 1 year ago

I'm trying to reproduce your issue, but the build seems stuck at imu_corrector. How long should I wait? Screenshot from 2023-01-31 20-09-09

liuzf1988 commented 1 year ago

@haixuanTao Thanks. Has the compilation passed? According to my experience, there are two possibilities. One is that the first compilation really takes some time, about 20 minutes. Another possibility is that it will prompt "Blocking waiting for file lock on package cache", and at this time, you only need to temporarily disable the rust-analyzer plug-in and recompile the code. From the figure, it seems to be the first possibility.

haixuanTao commented 1 year ago

It has been running for hours. I have cancelled the build and started again, but it does not seem to make any progress... Mind sharing on what environment are you running this dataflow? Linux, x86, cpu...

liuzf1988 commented 1 year ago

It has been running for hours. I have cancelled the build and started again, but it does not seem to make any progress... Mind sharing on what environment are you running this dataflow? Linux, x86, cpu...

the running environment: Ubuntu 20.04, x86_64, ROS2 Galactic

running-env

The compilation process
- If this is the first time compilation, use the following command

cd dora/examples/autoware-dataflow/localization
source /opt/ros/galactic/setup.bash
cargo run --example localization-dataflow

If this is the second time compilation or later, add an additional command to recompile the ROS2 side code of Dora operators if necessary. The reason for adding this extra command is as follows: The build.rs file of each Dora operator in "dora/examples/autoware-dataflow/localization/src/operators" contains code like println!("cargo:rerun-if-changed=src/imu_corrector.cpp"); which can trigger recompilation if the traced files (e.g. "imu_corrector.cpp") changed. However sometimes the "build", "log" and "install" folders generated by ROS2 compilation may be deleted manually by the user during debugging, resulting in println!("cargo:rerun-if-changed=src/xxx.xxx"); in build.rs not triggering recompilation. At this time, we need to recompile the ROS2 side code explicitly.

cd dora/examples/autoware-dataflow/localization
source /opt/ros/galactic/setup.bash
colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release  # optional
cargo run --example localization-dataflow

Run the dataflow

cd dora/examples/autoware-dataflow/localization
source install/setup.bash
../../../target/debug/dora-coordinator --run-dataflow dataflow.yml --runtime ../../../target/debug/dora-runtime

haixuanTao commented 1 year ago

After discussion, we found the issue which was that the size of one buffer used to pass data in one node was not sized appropriately, making computation in the following node a lot more costly due to the bigger size.

The issue was not directly linked to Dora but having higher level types could potentially have helped not encounter this issue.

liuzf1988 commented 1 year ago

@haixuanTao Thanks for your great help, and now the error has been corrected. Actually, the issue is caused by incorrect understanding of the clear() function of "stringstream". Taking "pointcloud_downsample.cpp" as an example, the original purpose of using ss.clear() is to clear the buffer of ss, however the correct usage should be ss.str("") or define a new "stringstream" object (e.g, ss2) to construct the output of Dora nodes or operators. I was misled by the function name of clear().

dora-ndt-perf

dora-rs / dora

NDT execution is very time-consuming #179