autowarefoundation / autoware.universe

https://autowarefoundation.github.io/autoware.universe/
Apache License 2.0
976 stars 636 forks source link

Analyze and improve compile time with clang #2773

Closed soblin closed 1 year ago

soblin commented 1 year ago

Checklist

Description

Build process of autoware.universe is getting heavier as the codebase and its functionality grows. Especially packages like behavior_velocity_planner and behavior_path_planner consume a lot of memory and time during their build process. I believe analyzing the time build (compilation + link) process takes is beneficial for developers.

For the first step, I'll start by applying clang compiler's -ftime-trace option as well as Ninja build tool to autoware.universe to get access to detailed compile process information. They produce json files for each cmake project or source file, and we can visualize them on google chrome interactively.

https://opensource.adobe.com/lagrange-docs/dev/compilation-profiling/

Purpose

Possible approaches

I think the most simple and effective way to this is to reduce the size of each translation unit as much as possible by the following approaches:

The third item is also effective in mitigating the influence of header file changes in those library packages on other client packages. Since any small change of header files in motion_utils and tier4_autoware_utils propagates to almost all source files of autoware.universe (imagine how many packages include them), each client packages should also include only necessary headers from them.

For the fourth item, I'm guessing some developers are misunderstanding about inline functions that inlined functions are literally inlined in translation unit. But in my opinion that is fault, because the inline keyword is just a hint for the compiler and does not guarantee that to happen. If not inlined they are compiled in each translation unit to which they are propagated, assigned "W" (weak) symbol in the object file, and then unified by the linker. I rather think constexpr matches for most cases.

I also noticed some template functions in motion_utils are mostly instantiated for vector<PathPointWithLaneId> and vector<TrajectoryPoint> in planning modules. It is a waste of time for the compiler to instantiate those functions and compile the same content each time in each source file, so I suggest the usage of extern template feature.

If a template function is marked as extern template in a header file for T1, T2, etc., instantiated in a source file as explicit specifications for T1, T2, etc., and compiled to object file, then on the client side those functions are not instantiated for T1, T2, etc. and instead they are linked to the object file. This saves the time for compiling duplicate content many times (and link process).

I'm not sure of this, but the usage of Boost.Geometry functions is also one of the causes.

Definition of done

kenji-miyake commented 1 year ago

@soblin Thank you for creating this issue. :+1:

I generally agree with your thoughts. For specific changes, it would be great to split tasks into small parts and discuss them with the package maintainers in PRs. The reason is, I think that the following are usually just a trade-off between portability/maintainability and compilation time, in an extreme argument.

Also, talking about the history of why the implementations are so now, in the early stage of Autoware Core/Universe, we wrote many functions in header files considering the maintenance costs (we didn't want to write almost the same thing twice). However, since the functions are relatively stable lately, and since I know the compilation time is increasing, I think we can apply some optimization techniques that you proposed now.

Although I wrote a lot, my conclusion is "A super nice proposal! Let's make Autoware extremely faster!"

soblin commented 1 year ago

@kenji-miyake That's right, the portability/maintainability should be cautiously cared if I'm going to make PRs. And as a developer I totally agree that in a development phase we do not need to (or afford to actually haha) pay so much attention to things like compilation time, code quality, etc.

But yeah as I mentioned, I'll first get some statistics/evidence on compilation time and think about the room for improvements. Also it would be useful to make the analysis method a reusable one for future continuous integration.

soblin commented 1 year ago

This branch can be compiled by clang and Ninja using following command

colcon build --symlink-install --continue-on-error --event-handlers console_cohesion+  \
--cmake-args " -GNinja" -DCMAKE_BUILD_TYPE=RelWithDebInfo  \
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
-DCMAKE_CXX_COMPILER=/usr/bin/clang++ -DCMAKE_C_COMPILER=/usr/bin/clang \
-DCUDA_NVCC_FLAGS="-allow-unsupported-compiler" \
-DCMAKE_CXX_FLAGS="-ftime-trace"

This produces build/behavior_velocity_planner/.ninja_log file, which can be converted to trace.json using ninjatracing

./ninjatracing -e --showall build/behavior_velocity_planner/.ninja_log \
> trace.json

We can visualize this file on chrome://tracing.

VRichardJP commented 1 year ago

@soblin Could you consider pushing the fix you have made to compile with clang aswell? I know everyone compiles Autoware with gcc, but clang provides many interesting complementary diagnostic tools (such as -ftime-trace).

Moreover, it helps to stick closer to the C++ standard: if gcc accepts something clang does not, most likely it is a gcc only thing, not standard C++ (like the nullptr_t -> std::nullptr_t)

I don't advocate for an official support of clang, but rather to keep the code base compiler independant as much as we can.

soblin commented 1 year ago

@VRichardJP In terms of the code quality I totally agree with it ! In my branch I needed to relax the compile option to -Wno-everything so there should be a lot rooms.

kenji-miyake commented 1 year ago

For Clang support, it might be good to start with for example adding a daily-scheduled workflow .github/workflows/build-and-test-clang.yaml. Also, it's also good to add some rules to https://autowarefoundation.github.io/autoware-documentation/main/contributing/coding-guidelines/languages/cpp/ to support both gcc and clang.

soblin commented 1 year ago

Notes:

For behavior_path_planner, behavior_velocity_planner, dividing the build targets into several libraries using ament_auto_add_library increased the build time contrary to general workaround.

Weirdly these PRs adds up CMake processing time. This maybe because of the repeated calls to ament_auto_add_library, but it is not clear which part slows down CMake part.

VRichardJP commented 1 year ago

@soblin Sorry do you mean that the total build time has increased? or just cmake generation time?

soblin commented 1 year ago

@VRichardJP Yes the total build time increased, apparently due to the overhead from CMake. I have the suspicion that these lines are inhibiting parallel build of ament_auto targets.

If we divide the source files into libA, libB, and libC using ament_auto_add_library, those lines seem to add these targets to ${PROJECT_NAME}_LIBRARIES variable and then, the items in this variable are linked to later targets. Possibly

If that was the case it makes sense that each target is not processed in parallel.

ament_auto_find_build_dependencies (which is called in autoware_cmake() internally) command populates theses variables, so I think manually building each library target using these variables can help to parallelize the build process.

VRichardJP commented 1 year ago

@soblin Indeed, each call to ament_auto_add_library adds the library to ${PROJECT_NAME}_LIBRARIES and link the previous ones... I have to confess I don't see the point.

What about using ARG_NO_TARGET_LINK_LIBRARIES before calling ament_auto_add_library? It seems it could disable that behavior.

soblin commented 1 year ago

behavior_velocity_planner

CMake part takes 7.5 minutes unfortunately.

I'm guessing repeated call to ament_target_dependencies is slowing CMake process.

VRichardJP commented 1 year ago

I am checking this on my machine too. I also observe with behavior_path_planner that splitting the lib into multiple sub libraries greatly increase the overall build time (~from 3:30 to ~4:40 without tests). But when you think about it, why should it be otherwise? all the cpp.o files are compiled in parallel already, so splitting them into different libs does nothing but add more cmake generation and sync time, no?

However, I can see something interesting with the ninjatracing output: behavior_path_planner_node.cpp.o takes almost 2 minutes to compile (and a big share is optimizer passes). Most other object files are compiled in less than 30 seconds (except utilities.cpp.o which takes ~50s). The consequence is that during almost 1min30, 7 worker threads are waiting for behavior_path_planner_node.cpp.o task to finish.

How come? the file does not look that big though.

VRichardJP commented 1 year ago

Found the culprit: each create_subscription line takes 15 seconds to optimize...

Basically 1 create_subscription = +15s of compilation. That is absolutely insane...

Is it like this on all ROS2 packages?

Example with only the 7 create_subscription lines remaining and pretty much all the rest commented out.

image

EDIT: After few more tests, it seems to be just a clang thing. With gcc I don't have such inflated numbers. For instance, I have tested the 2 scenarios:

A. compile behavior_path_planner_node.cpp only B. compile behavior_path_planner_node.cpp only, with the create_subscription lines commented out.

And with the same build command (I have a few extra -Wno-* in autoware_cmake for clang only):

$ rm -r {build,install}/behavior_path_planner; CC=clang CXX=clang++ colcon build --symlink-install --continue-on-error --event-handlers console_cohesion+  --cmake-args " -GNinja" "-DBUILD_TESTING=0" -DCMAKE_BUILD_TYPE=RelWithDebInfo  -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCUDA_NVCC_FLAGS="-allow-unsupported-compiler" --packages-select behavior_path_planner

With clang I get: A. 2min 13s B. 1min 3s

And with gcc I get: A. 1min 36 B. 1min 3s

Although I cannot see where gcc spend its time, there is definitely a big difference here. Clang seems to be the only one to get stuck in endless optimization passes.

soblin commented 1 year ago

BTW I found out repeated calls to ament_auto macros in a CMakeLists.txt slow down cmake because these list variables are extended every time this macro is called, and it will contain duplicate elements by the numbers of its call.

The workaround is to add

list(REMOVE_DUPLICATES ${PROJECT_NAME}_FOUND_DEFINITIONS)

at the end of the macro. For example I could reduce cmake processing time in this PR.

I'll try making PRs to ament_cmake or ament_cmake_auto.

stale[bot] commented 1 year ago

This pull request has been automatically marked as stale because it has not had recent activity.

soblin commented 1 year ago

@VRichardJP Thanks for your proactive work !

I also tried making a PR to ament_cmake

As described here it looks like we cannot immediately merge these PRs to the next release of ROS2. For the improvement of ament_auto, we have the option to create a fork of ament_cmake under AWF group, add it to autoware.repos so we can use the improved ament_auto when we build Autoware.

Apart from it, I think we should put efforts into removing unnecessary/huge header files. I recently found out that only compiling behavior_velocity_planner/src/scene_module/intersection/scene_intersection.cpp takes a bit long time. Here is what I did:

  1. in compile_commands.json you can find a entry for the build command of behavior_velocity_planner/src/scene_module/intersection/scene_intersection.cpp.
  2. You can get a standalone build command for this as follows:
    g++ -DBOOST_ALLOW_DEPRECATED_HEADERS -DBOOST_ALL_NO_LIB -DBOOST_ATOMIC_DYN_LINK \
    -DBOOST_CHRONO_DYN_LINK -DBOOST_DATE_TIME_DYN_LINK -DBOOST_FILESYSTEM_DYN_LINK  \
    -DBOOST_GRAPH_DYN_LINK -DBOOST_IOSTREAMS_DYN_LINK -DBOOST_LOCALE_DYN_LINK \
    -DBOOST_LOG_DYN_LINK -DBOOST_LOG_SETUP_DYN_LINK -DBOOST_PRG_EXEC_MONITOR_DYN_LINK \
    -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_RANDOM_DYN_LINK -DBOOST_REGEX_DYN_LINK \
    -DBOOST_SERIALIZATION_DYN_LINK -DBOOST_SYSTEM_DYN_LINK -DBOOST_THREAD_DYN_LINK \
    -DBOOST_TIMER_DYN_LINK -DBOOST_UNIT_TEST_FRAMEWORK_DYN_LINK -DBOOST_WAVE_DYN_LINK \
    -DBOOST_WSERIALIZATION_DYN_LINK -DDEFAULT_RMW_IMPLEMENTATION=rmw_fastrtps_cpp -DDISABLE_PCAP \
    -DEIGEN_DENSEBASE_PLUGIN=\"grid_map_core/eigen_plugins/DenseBasePlugin.hpp\" \
    -DEIGEN_FUNCTORS_PLUGIN=\"grid_map_core/eigen_plugins/FunctorsPlugin.hpp\" -DQT_CORE_LIB -DQT_GUI_LIB \
    -DQT_NO_DEBUG -DQT_OPENGL_LIB -DQT_WIDGETS_LIB -DRCUTILS_ENABLE_FAULT_INJECTION \
    -DROS_DISTRO_HUMBLE -DROS_PACKAGE_NAME=\"behavior_velocity_planner\" \
    -Dbehavior_velocity_planner_EXPORTS -Dkiss_fft_scalar=double \
    -I/home/autoware/src/autoware/universe/planning/behavior_velocity_planner/include \
    -I/opt/ros/humble/include/rosidl_runtime_c -I/opt/ros/humble/include/rosidl_typesupport_interface \
    -I/opt/ros/humble/include/shape_msgs -I/opt/ros/humble/include/fastcdr \
    -I/opt/ros/humble/include/rosidl_runtime_cpp -I/opt/ros/humble/include/rosidl_typesupport_fastrtps_cpp \
    -I/opt/ros/humble/include/rmw -I/opt/ros/humble/include/rosidl_typesupport_fastrtps_c \
    -I/opt/ros/humble/include/rosidl_typesupport_introspection_c \
    -I/opt/ros/humble/include/rosidl_typesupport_introspection_cpp -I/opt/ros/humble/include/action_msgs \
    -I/opt/ros/humble/include/ament_index_cpp -I/opt/ros/humble/include/libstatistics_collector \
    -I/opt/ros/humble/include/rcl -I/opt/ros/humble/include/rcl_interfaces \
    -I/opt/ros/humble/include/rcl_logging_interface -I/opt/ros/humble/include/rcl_yaml_param_parser \
    -I/opt/ros/humble/include/libyaml_vendor -I/opt/ros/humble/include/tracetools -I/opt/ros/humble/include/rcpputils \
    -I/opt/ros/humble/include/statistics_msgs -I/opt/ros/humble/include/rosgraph_msgs \
    -I/opt/ros/humble/include/rosidl_typesupport_cpp -I/opt/ros/humble/include/rosidl_typesupport_c \
    -I/opt/ros/humble/include/class_loader -I/opt/ros/humble/include/composition_interfaces \
    -I/opt/ros/humble/include/rclcpp_action -I/opt/ros/humble/include/rcl_action \
    -I/home/autoware/install/autoware_common_msgs/include/autoware_common_msgs \
    -I/opt/ros/humble/include/geographic_msgs -I/opt/ros/humble/include/rosbag2_storage -I/usr/include/vtk-9.1 \
    -I/usr/include/jsoncpp -I/usr/include/freetype2 -I/usr/include/x86_64-linux-gnu/qt5 -I/usr/include/x86_64-linux-gnu/qt5/QtOpenGL \
    -I/usr/include/x86_64-linux-gnu/qt5/QtWidgets -I/usr/include/x86_64-linux-gnu/qt5/QtGui \
    -I/usr/include/x86_64-linux-gnu/qt5/QtCore -I/usr/lib/x86_64-linux-gnu/qt5/mkspecs/linux-g++ -isystem /usr/include/eigen3 -isystem /home/autoware/install/tier4_v2x_msgs/include/tier4_v2x_msgs \
    -isystem /home/autoware/install/tier4_api_msgs/include/tier4_api_msgs \
    -isystem /home/autoware/install/tier4_planning_msgs/include/tier4_planning_msgs \
    -isystem /home/autoware/install/autoware_auto_planning_msgs/include/autoware_auto_planning_msgs \
    -isystem /home/autoware/install/autoware_auto_perception_msgs/include/autoware_auto_perception_msgs \
    -isystem /home/autoware/install/autoware_auto_mapping_msgs/include/autoware_auto_mapping_msgs \
    -isystem /home/autoware/install/autoware_adapi_v1_msgs/include/autoware_adapi_v1_msgs \
    -isystem /opt/ros/humble/include/cv_bridge -isystem /opt/ros/humble/include/diagnostic_msgs \
    -isystem /opt/ros/humble/include/geometry_msgs -isystem /opt/ros/humble/include/message_filters \
    -isystem /opt/ros/humble/include/nav_msgs -isystem /opt/ros/humble/include/rclcpp \
    -isystem /opt/ros/humble/include/rclcpp_components -isystem /opt/ros/humble/include/sensor_msgs \
    -isystem /opt/ros/humble/include/tf2 -isystem /opt/ros/humble/include/tf2_eigen \
    -isystem /opt/ros/humble/include/tf2_geometry_msgs -isystem /opt/ros/humble/include/tf2_ros \
    -isystem /opt/ros/humble/include/visualization_msgs \
    -isystem /home/autoware/install/motion_velocity_smoother/include -isystem /home/autoware/install/vehicle_info_util/include -isystem /home/autoware/install/planning_test_utils/include \
    -isystem /home/autoware/install/component_interface_utils/include \
    -isystem /home/autoware/install/tier4_system_msgs/include/tier4_system_msgs \
    -isystem /home/autoware/install/rtc_interface/include \
    -isystem /home/autoware/install/tier4_rtc_msgs/include/tier4_rtc_msgs \
    -isystem /home/autoware/install/route_handler/include -isystem /home/autoware/install/motion_utils/include \
    -isystem /home/autoware/install/interpolation/include -isystem /home/autoware/install/grid_map_utils/include \
    -isystem /home/autoware/install/tier4_autoware_utils/include \
    -isystem /home/autoware/install/tier4_debug_msgs/include/tier4_debug_msgs \
    -isystem /home/autoware/install/osqp_interface/include -isystem /home/autoware/install/lanelet2_extension/include -isystem /home/autoware/install/component_interface_specs/include \
    -isystem /home/autoware/install/autoware_utils/include \
    -isystem /home/autoware/install/autoware_planning_msgs/include/autoware_planning_msgs \
    -isystem /home/autoware/install/autoware_auto_vehicle_msgs/include/autoware_auto_vehicle_msgs \
    -isystem /home/autoware/install/autoware_auto_tf2/include \
    -isystem /home/autoware/install/autoware_auto_system_msgs/include/autoware_auto_system_msgs \
    -isystem /home/autoware/install/autoware_auto_geometry_msgs/include/autoware_auto_geometry_msgs \
    -isystem /home/autoware/install/autoware_auto_control_msgs/include/autoware_auto_control_msgs \
    -isystem /home/autoware/install/autoware_auto_common/include \
    -isystem /opt/ros/humble/include/builtin_interfaces -isystem /opt/ros/humble/include \
    -isystem /opt/ros/humble/include/pluginlib -isystem /opt/ros/humble/include/grid_map_msgs \
    -isystem /opt/ros/humble/include/nav2_msgs -isystem /opt/ros/humble/include/rcutils \
    -isystem /opt/ros/humble/include/rosbag2_cpp -isystem /opt/ros/humble/include/std_msgs \
    -isystem /opt/ros/humble/include/pcl_msgs -isystem /opt/ros/humble/include/unique_identifier_msgs \
    -isystem /opt/ros/humble/include/tf2_msgs -isystem /usr/include/opencv4 -isystem /usr/include/pcl-1.12 \
    -isystem /usr/include/ni \
    -isystem /usr/include/openni2 \
    -Wno-deprecated-declarations -O2 -g -DNDEBUG -fPIC -Wall -Wextra -Wpedantic -Werror -fopenmp -fPIC -std=c++17 \
    -c ./scene_intersection.cpp

With this command you can only compile ./scene_intersection.cpp. First of all I used the current source file and measured how much time it takes to compile

time ./build.sh
--> 41.65 secs

Next I removed everything except for #include statement from this file. This file should have nothing to compile as shown below,

#include <grid_map_cv/grid_map_cv.hpp>
#include <grid_map_ros/grid_map_ros.hpp>
#include <lanelet2_extension/regulatory_elements/road_marking.hpp>
#include <lanelet2_extension/utility/message_conversion.hpp>
#include <lanelet2_extension/utility/query.hpp>
#include <lanelet2_extension/utility/utilities.hpp>

#include <cv_bridge/cv_bridge.h>
#include <magic_enum.hpp>
#include <opencv2/imgproc.hpp>
#include <scene_module/intersection/scene_intersection.hpp>
#include <scene_module/intersection/util.hpp>
#include <utilization/boost_geometry_helper.hpp>
#include <utilization/path_utilization.hpp>
#include <utilization/trajectory_utils.hpp>
#include <utilization/util.hpp>

#include <lanelet2_core/geometry/Polygon.h>
#include <lanelet2_core/primitives/BasicRegulatoryElements.h>

#include <algorithm>
#include <memory>
#include <vector>

namespace behavior_velocity_planner {}

but surprisingly, it took 18 secs! If there are only include statements for the standard library then it completed instantly.

VRichardJP commented 1 year ago

Unfortunately it's a tricky problem to solve. The deeper the more dependencies you have, and the more likely a single header is going to pull a hell lot of things with it.

I think using forward declaration could improve this, but that would mean writing forward.h headers for many packages.