RobotLocomotion / ros-drake-vendor

Maintainer scripts that package Drake in the ROS build farm
Other
1 stars 1 forks source link

Limiting the building threads to 1 for compiling in the ROS buildfarm #1

Closed j-rivero closed 6 months ago

j-rivero commented 9 months ago

While testing the compilation of Drake I've found that it can take several dozens of Gb of RAM specially when processing the python bindings since bazel will launch a bunch of compilation threads.

The Drake buildfarm (if I'm not wrong) uses a user.bazelrc that define a --jobs parameter calculated using the number of processors and the logic in bazel.cmake. In the ROS buildfarm the rule is to use single threaded builds for memory and cpu predictability.

For transforming the Bazel build in Drake to a single thread build, one option is to use the ament_vendor CMake API to include a simple patch againts tools/bazel.rc:

diff --git a/tools/bazel.rc b/tools/bazel.rc
index 59aedf4..909aac4 100644
--- a/tools/bazel.rc
+++ b/tools/bazel.rc
@@ -1,6 +1,9 @@
 # Don't use bzlmod yet.
 common --enable_bzlmod=false

+# Limit the building threads to 1
+build --jobs=1
+
 # Default to an optimized build.
 build -c opt

I did not find a better way by using environment variables or other approaches that don't require to patch the source code.

jwnimmer-tri commented 9 months ago

... bazel will launch a bunch of compilation threads.

Yes. By default, all CPU and RAM resources on the machine will try to be used.

In the ROS buildfarm the rule is to use single threaded builds for memory and cpu predictability.

Given the current packaging build timing, I'd estimate that a ROS build of Drake using a single-threaded build will take approximately 4 hours (assuming no caching from prior builds). Is that satisfactory?

I did not find a better way ...

If the buildfarm builds are only supposed to use 1 CPU, then to me the obvious way to implement that would be to only provide a single virtualized CPU in the build machine VMs, at the infrastructure level. Why would the buildfarm VMs provide >1 CPU when the policy is that more than one CPU must not be used? Solving this by dialing back every build tool's individual limit seems like playing whack-a-mole.

In any case, if we assume that this needs to be a bazel-specific option, then see the docs at https://bazel.build/run/bazelrc. Instead of patching the source tree, we can put the build --jobs=1 line into an rcfile in either /etc or $HOME. Since this is a buildfarm-specific rule, having the buildfarm set up the file in /etc to match it's policies seems like the right place.

j-rivero commented 9 months ago

In the ROS buildfarm the rule is to use single threaded builds for memory and cpu predictability.

Given the current packaging build timing, I'd estimate that a ROS build of Drake using a single-threaded build will take approximately 4 hours (assuming no caching from prior builds). Is that satisfactory?

4 hours might be problematic, if I'm not wrong the limit of the ROS buildfarm release jobs is set to 120 minutes right now for Rolling amd64. I'll check with the rest of the infra team but will open another issue to discuss potential reductions of this time.

I did not find a better way ...

If the buildfarm builds are only supposed to use 1 CPU, then to me the obvious way to implement that would be to only provide a single virtualized CPU in the build machine VMs, at the infrastructure level. Why would the buildfarm VMs provide >1 CPU when the policy is that more than one CPU must not be used? Solving this by dialing back every build tool's individual limit seems like playing whack-a-mole.

There is parallelization done in the ROS buildarm but it happens at the executor level rather than build level (it can parallelize across packages but use a single thread for each package).

In any case, if we assume that this needs to be a bazel-specific option, then see the docs at https://bazel.build/run/bazelrc. Instead of patching the source tree, we can put the build --jobs=1 line into an rcfile in either /etc or $HOME. Since this is a buildfarm-specific rule, having the buildfarm set up the file in /etc to match it's policies seems like the right place.

+1 I'll send the PR for patching the ROS buildfarm agents.

j-rivero commented 9 months ago

Drafted a PR to be discussed with the ROS infra team https://github.com/ros-infrastructure/ros_buildfarm/pull/1016

jwnimmer-tri commented 6 months ago

4 hours might be problematic, ... I'll check with the rest of the infra team but will open another issue to discuss potential reductions of this time.

Are there any updates on this side of the question?

I do anticipate that the Drake build will keep growing in size (build time) in future versions, so I'd like to get out in front of any potential challenges there.

j-rivero commented 6 months ago

Are there any updates on this side of the question?

We have discussed this internally in the OSRF infra team. The decision of not supporting long (and/or memory intensive) builds was made consciously for trying to facilitate the operations (and the cost) of the ROS buildfarm by encouraging users to optimize for resource consumption and build times. This place us here in a special use case. That said, we have plans to support the Drake compilation:

j-rivero commented 6 months ago

https://github.com/ros-infrastructure/ros_buildfarm/pull/1016 was merged.