RobotLocomotion / drake

Model-based design and verification for robotics.
https://drake.mit.edu
Other
3.35k stars 1.27k forks source link

Excessive system load compiling rigid body tree #2074

Closed mwoehlke-kitware closed 6 years ago

mwoehlke-kitware commented 8 years ago

Some of drake's source files take an excessive amount of time and memory to compile. RigidBodyTree{,SDF,URDF}.cpp are three particular offenders, requiring:

RigidBodyTree.cpp     104 sec, 4.1 GiB
RigidBodyTreeSDF.cpp   83 sec, 3.6 GiB
RigidBodyTreeURDF.cpp  77 sec, 3.5 GiB

Since these files are closely related, and therefore have a tendency to get built at the same time by parallel builds, on systems with only 16 GiB of RAM (assuming some running background applications, such as an IDE and web browser), it's entirely possible for this trio to consume all available memory. In particular, my (current) main machine becomes effectively unusable for tens of minutes when this trio of files hits the compile queue due to thrashing.

The first source file can be split into pieces, which helps, but the latter two are much less amenable to this process. From some experimentation, the problem seems to be due to the various joint classes (commenting out all code except #include "joints/DrakeJoints.h" still takes 40 sec, 2.1 GiB).

The above numbers were produced using /usr/bin/time --format=%U,%M to measure hand-compiling the aforementioned files with -g -O2 (roughly equivalent to CMAKE_BUILD_TYPE=RelWithDebInfo). Even with no debug/optimization flags, time and memory use is about half the above numbers, which is still fairly high.

david-german-tri commented 8 years ago

Thanks! This analysis is really helpful. It's a pain point for us too, but as far as I know no one is looking at it yet, so I'll provisionally assign myself.

@liangfok, you may also be interested.

liangfok commented 8 years ago

What does "joint classes" mean in this context?

+1 for a redesign of the URDF / SDF parsers. I propose a RigiBodyTree class that contains no parsers, and separate DrakeURDFParser and DrakeSDFParser classes that inherent from an abstract ModelParser class.

mwoehlke-kitware commented 8 years ago

Thanks, @david-german-tri; it's good to know it's not just me. One of the reasons I wanted to open an issue is because I'm not sure what our minimum system requirements are for building drake, and I could imagine some "average" user with an underpowered machine getting bitten by this.

What does "joint classes" mean in this context?

Subclasses of DrakeJoint. In particular, see those classes defined via includes in drake/systems/plants/joints/DrakeJoints.h.

I have a (very stale) branch (REBASE-rbt-split-templates in my fork) that splits RigidBodyTree.cpp into several pieces, with the worst needing only about 20 sec, 1.5 GiB to compile (most are about 10 sec, 1.0 GiB). Despite that this takes non-trivially more CPU cycles in total to compile, it's nearly a wash on higher-end machines, and does actually help on mid-range machines. (As an added bonus, the template instantiations are also much more legible.) However, it is not possible to do anything similar for RigidBodyTree{SDF,URDF}.cpp. The problem there seems to be with instantiating the various joint classes, which is why I'm not aware that anything can be done except to redesign those somehow so as to avoid the problem. (This is why I opened an issue rather than a PR; I don't have such a redesign available and wouldn't presume to be the best person to attempt such a change.) I expect that would also help RigidBodyTree.cpp considerably.

That said, I'd be happy to look at redoing the changes to split RigidBodyTree.cpp if you feel that would be valuable.

sherm1 commented 8 years ago

On Windows with VS 2015 RelWithDebInfo I'm seeing about 1.5GB in use when compiling those files. That's much bigger than a typical compile but still reasonable.

mwoehlke-kitware commented 8 years ago

On a related note... I tried to do a build last Friday, and noticed 8 compile tasks at about 1.5 GiB each. Unfortunately, this broke my computer :cry: (read: caused it to become entirely non-responsive, and it did not recover after being left along over the weekend), so I can't do a postmortem to determine what it was trying to build at the time.

jwnimmer-tri commented 8 years ago

The compiler grinding to a halt on these files is really starting to annoy me, too. If there's a patch nearby to break out some code into separate files, that'd be a good start. Once that's in, we could try to recruit someone at TRI to tackle the compiler-bogging problem with drake/systems/plants/joints code.

jwnimmer-tri commented 8 years ago

Perhaps we should also see if this compiler flag can be removed, once this issue is fixed:

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /bigobj") # after receiving RigidBodyTree.cpp : fatal error C1128: number of sections exceeded object file format limit: compile with /bigobj

david-german-tri commented 8 years ago

This has generated a bunch of complaints recently, and is a barrier to switching to Ninja, so I'm bumping it to priority: high and will start working on it instead of System2 for the next couple days.

david-german-tri commented 8 years ago

The @mwoehlke-kitware WIP to split RigidBodyTree into pieces is here: https://github.com/mwoehlke-kitware/drake/commit/e7717a062366b81d3b400dddc863c77ce56151f4

jwnimmer-tri commented 8 years ago

@david-german-tri Thanks for this! Especially #2614 was a huge difference. Even though more improvements are possible (and should continue), I wonder if we can call this ticket closed, or at least lower its priority? I haven't needed more than even 8 GB during -j8 builds since these changes.

david-german-tri commented 8 years ago

There are two related issues that I've been working under the aegis of this ticket:

  1. The three RigidBodyTree compilation units named in the OP are huge.
  2. The RigidBody* headers pull a large amount of template code into hundreds of downstream compilation units that include them.

I agree with you that good progress has been made on item (2), but for item (1) we're still in the earlier stages. I think breaking the dependencies on drakeGeometryUtil will make a huge difference. So, I'd like to keep this ticket open until that's done, but I've dropped the priority to medium.

liangfok commented 8 years ago

What about the plan to extract the URDF / SDF parsing code from RigidBodyTree? Is that going to part of this issue or should that be a separate issue?

david-german-tri commented 8 years ago

I'm not planning to factor out URDF/SDF parsing as part of this issue. It's a great idea for many reasons, but I'm not sure it will make a big difference to memory footprint (since the parsers are already separate compilation units).

jwnimmer-tri commented 8 years ago

So, I retract my proposal to close this, and my high watermwark evidence of 8G upthread. It turns out that when I switched to Ninja, I stopped compiling in Release mode (CMAKE_BUILD_TYPE was undefined). Having switched back to release mode now, memory is topping out above 16GB even under -j4 again.

mwoehlke-kitware commented 8 years ago

We've gone backwards here somewhere...

joints/RollPitchYawFloatingJoint.cpp  68 sec, 3.4 GiB
parser_urdf.cc                       109 sec, 3.8 GiB
RigidBodyTreeSDF.cpp                 108 sec, 4.5 GiB
RigidBodyTree.cpp                    268 sec, 7.3 GiB

(This is from stats collected from the entire build¹, on Linux using GCC 4.9.2 in RelWithDebInfo mode. The good news is that the next worst offender is examples/Quadrotor/runLQR.cpp at a little under 2 GiB and 41 seconds, after which nothing exceeds 1.5 GiB or 31 seconds.)

(¹ Of drake itself. Stats for the externals were not collected.)

mwoehlke-kitware commented 8 years ago

For grins, here's a plot (log scale) of build times to memory usage. Most of the build is in a reasonable region between 1-15 sec and about 150 KiB - 1 GiB. build-times

liangfok commented 8 years ago

Benchmarks 1

System

Commands

$ cd drake-distro
$ rm -rf build
$ rm -rf externals
$ git reset --hard HEAD
$ mkdir build
$ cd build
$ cmake .. -DCMAKE_BUILD_TYPE:STRING=RelWithDebInfo -DDISABLE_MATLAB=TRUE -DWITH_DREAL=FALSE -DWITH_SNOPT=FALSE
$ time make -j4

Notice that I disable both MATLAB and dReal. I use -j4 since the CPU has four cores.

SHA

0442d21362dbd23b326238f8d190080f0aae248f

Results

Using Make with 4 Threads

$ time make -j4
...
real    11m17.079s
user    34m42.356s
sys     2m25.955s

Using Make with 8 Threads

Since the CPU supports hyper-threading, I decided to try using -j8. Here are the results:

real    10m48.138s
user    52m32.926s
sys     3m19.787s

I believe the only metric that really matters is "real" (see this) article. On this particular machine, using -j8 marginally reduces the "real" build time by about 30 seconds.

Using Ninja with 4 Threads

$ time ninja -j4
...
real    10m37.357s
user    49m45.990s
sys     2m41.122s

Using Ninja with 8 Threads

$ time ninja -j8
...
real    10m41.235s
user    50m15.326s
sys     2m40.993s
david-german-tri commented 8 years ago

@mwoehlke-kitware: Interesting. I think that shows we've gone backwards in two respects:

  1. It remains true that DrakeJoints.h (and in particular RollPitchYawFloatingJoint.h) pull in a huge amount of template code. We've factored compilation units that include DrakeJoints.h into multiple pieces. However, as @liangfok's metrics point out, and indeed as you suggested upthread, that refactoring doesn't have much effect on user-perceived build times for multicore systems with 16GB or more of RAM. RigidBodyTree and the parsers are a dependency chokepoint for the entire build, so when we hit these compilation-mega-units, there are plenty of cores to spare.
  2. RigidBodyTree.cpp somehow picked up another 3+ GB of RAM usage. That's insane and alarming; it should be bisected for root cause.

Right now, this issue is assigned to me, but I don't have bandwidth to work on it. Someone else is welcome to chip in.

liangfok commented 8 years ago

I'll take on the investigation of why RigidBodyTree takes so much memory to compile since it is very likely my fault. My current suspicion is that it happened when I extracted the parser code into their own .h and .cc files.

Update Sept. 4, 2016: The high memory consumption problem was not caused by the extraction of the parsers. See update 3 below.

liangfok commented 8 years ago

Benchmarks 2

Platform

SHA

Using 0b1910cb7e0b8f0c0f144abe128061845e63e29d (September 1, 2016)

Procedure

I wanted to replicate what @mwoehlke-kitware reported in the original description of this issue. To do this, I first built Drake using VERBOSE=true to get the actual compile commands. I then found the command for compiling RigidBodyTree.cpp.

To manually build RigidBodytree.cpp, I executed the following commands:

$ cd /home/liang/dev/drake-distro-2/build/drake/systems/plants
$ rm CMakeFiles/drakeRBM.dir/RigidBody.cpp.o
$ /usr/bin/time --format=%U,%M /usr/bin/g++-4.9   -DHAVE_SPDLOG -DdrakeRBM_EXPORTS -I/home/liang/dev/drake-distro-2/drake/.. -I/home/liang/dev/drake-distro-2/build/drake/exports -I/home/liang/dev/drake-distro-2/build/install/include -I/home/liang/dev/drake-distro-2/build/drake -I/home/liang/dev/drake-distro-2/build/drake/lcmtypes -isystem /home/liang/dev/drake-distro-2/build/install/include/eigen3 -I/home/liang/dev/drake-distro-2/drake/thirdParty/bsd/spruce/include  -Werror=all -Werror=ignored-qualifiers -DGTEST_DONT_DEFINE_FAIL=1 -DGTEST_DONT_DEFINE_SUCCEED=1 -DGTEST_DONT_DEFINE_TEST=1 -O2 -g -DNDEBUG -fPIC -fvisibility=hidden -fvisibility-inlines-hidden   -std=gnu++14 -o CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o -c /home/liang/dev/drake-distro-2/drake/systems/plants/RigidBodyTree.cpp

Results

Here's what the last command above reported:

618.33,7894972

The above numbers indicate it took 618.33 seconds (~10 minutes) and 7,894,972 kilobytes of memory. The build type was CMAKE_BUILD_TYPE:STRING=RelWithDebInfo. The maximum amount of memory used (7.89GB) is far higher than what was reported above.

Update 1 - Testing an April 12, 2016 Version of Drake

Using SHA a974831f0716fbcd7890b4fb6c0f2402bbb9acd0 (April 12, 2016):

$ cd /home/liang/dev/drake-distro-2/drake/pod-build/systems/plants
$ rm ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o
$ /usr/bin/time --format=%U,%M  /usr/bin/g++-4.9   -DdrakeRBM_EXPORTS -I/home/liang/dev/drake-distro-2/drake/pod-build/generated -I/home/liang/dev/drake-distro-2/drake/.. -I/home/liang/dev/drake-distro-2/drake/pod-build/exports -isystem /home/liang/dev/drake-distro-2/build/include -I/home/liang/dev/drake-distro-2/drake/pod-build/lcmgen -isystem /usr/include/glib-2.0 -isystem /usr/lib/x86_64-linux-gnu/glib-2.0/include -isystem /home/liang/dev/drake-distro-2/build/include/eigen3 -I/home/liang/dev/drake-distro-2/drake/thirdParty/spruce/include -I/home/liang/dev/drake-distro-2/drake/thirdParty/cimg  -Wreturn-type -Wuninitialized -Wunused-variable -std=c++11 -O2 -g -DNDEBUG -fPIC   -o CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o -c /home/liang/dev/drake-distro-2/drake/systems/plants/RigidBodyTree.cpp
106.73,3907428
$ ls -lah ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o
-rw-rw-r-- 1 liang liang 183M Sep  3 21:46 ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o

So it's true. Compiling the April 12, 2016 version of RigidBodyTree.cpp consumed 3.9GB of memory. This is much less than the 7.9GB of memory needed on September 1, 2016. It also closely matches @mwoehlke-kitware's measurements posted in this issue's description.

Update 2 - Testing a July 7, 2016 Version of Drake

On July 6, 2016, https://github.com/RobotLocomotion/drake/issues/2074#issuecomment-230825388 pointed out that RigidBodyTree.cpp regressed in terms of compiler memory footprint. The following tests a commit from July 7, 2016:

Using SHA 854a4589ebd5eb2a85f19fa4dd3bea854d2c9290 (July 7, 2016):

$ cd /home/liang/dev/drake-distro-2/drake/pod-build/systems/plants
$ rm ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o
$ /usr/bin/time --format=%U,%M  /usr/bin/g++-4.9   -DROSCONSOLE_BACKEND_LOG4CXX -DROS_PACKAGE_NAME=\"drake\" -DdrakeRBM_EXPORTS -I/home/liang/dev/drake-distro-2/drake/pod-build/generated -I/home/liang/dev/drake-distro-2/drake/.. -I/home/liang/dev/drake-distro-2/drake/pod-build/exports -I/home/liang/dev/drake-distro-2/build/include -I/opt/ros/indigo/include -I/home/liang/dev/drake-distro-2/drake/pod-build -I/home/liang/dev/drake-distro-2/drake/pod-build/lcmtypes -isystem /home/liang/dev/drake-distro-2/build/include/eigen3 -I/home/liang/dev/drake-distro-2/drake/thirdParty/spruce/include -I/home/liang/dev/drake-distro-2/drake/thirdParty/cimg  -Werror=all -Werror=ignored-qualifiers -DGTEST_DONT_DEFINE_FAIL=1 -DGTEST_DONT_DEFINE_SUCCEED=1 -DGTEST_DONT_DEFINE_TEST=1 -Wno-sign-compare -O2 -g -DNDEBUG -fPIC   -std=gnu++14 -o CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o -c /home/liang/dev/drake-distro-2/drake/systems/plants/RigidBodyTree.cpp
268.07,7577520
$ ls -lah ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o
-rw-rw-r-- 1 liang liang 385M Sep  3 23:52 ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o

These results confirm that on July 7, 2016, RigidBodyTree.cpp required 7.6GB of RAM to compile. This is in contrast to April 12, 2016, where only 3.9GB was required.

Update 3 - Testing a June 28, 2016 Version of Drake

Does the switch from C++11 to C++14 make a difference?

One change from April 12 to July 7 is the switch from C++11 to C++14. This occurred in dba30b30d09e7fd98441570fecc4fa7852a03e3b. The immediately preceding commit is b3028662c4c76d6a339bdff8b2dfde0fe4180203. Unfortunately, it does not compile due to lcm-lua failing to find lua.h. Thus, I test the commit that immediately precedes b3028662c4c76d6a339bdff8b2dfde0fe4180203, which is ea0fe6362cfa959a39a961055da7038eb3da8498.

$ cd /home/liang/dev/drake-distro-2/drake/pod-build/systems/plants
$ rm ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o
$ /usr/bin/time --format=%U,%M /usr/bin/g++-4.9   -DROSCONSOLE_BACKEND_LOG4CXX -DROS_PACKAGE_NAME=\"drake\" -DdrakeRBM_EXPORTS -I/home/liang/dev/drake-distro-2/drake/pod-build/generated -I/home/liang/dev/drake-distro-2/drake/.. -I/home/liang/dev/drake-distro-2/drake/pod-build/exports -isystem /home/liang/dev/drake-distro-2/build/include -I/opt/ros/indigo/include -I/home/liang/dev/drake-distro-2/drake/pod-build/lcmgen -isystem /usr/include/glib-2.0 -isystem /usr/lib/x86_64-linux-gnu/glib-2.0/include -isystem /home/liang/dev/drake-distro-2/build/include/eigen3 -I/home/liang/dev/drake-distro-2/drake/thirdParty/spruce/include -I/home/liang/dev/drake-distro-2/drake/thirdParty/cimg  -Werror=all -Werror=ignored-qualifiers -DGTEST_DONT_DEFINE_FAIL=1 -DGTEST_DONT_DEFINE_SUCCEED=1 -DGTEST_DONT_DEFINE_TEST=1 -Wno-sign-compare -O2 -g -DNDEBUG -fPIC   -std=gnu++11 -o CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o -c /home/liang/dev/drake-distro-2/drake/systems/plants/RigidBodyTree.cpp
250.36,7470296
$ ls -lah ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o
-rw-rw-r-- 1 liang liang 385M Sep  4 13:55 ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o

The results indicate that switching from C++11 to C++14 did not result in the increase in memory utilization. Version ea0fe6362cfa959a39a961055da7038eb3da8498 from June 28, 2016, requires 7.47GB of RAM to compile RigidBodyTree.cpp. Note that since this version is prior to the extraction of the parser code into parser_urdf.cc and parser_sdf.cc, we now know that the memory problem was not introduced by the extraction of the parsers.

Update 4 - Testing a June 1, 2016 Version of Drake

Arbitrarily select a SHA on June 1, 2016: 920bfcfe5b30bd30d27e378bf4194734a8bb28e7. This is an attempt to isolate the memory problem by bisecting the range of dates over which the problem must have been introduced.

$ cd /home/liang/dev/drake-distro-2/drake/pod-build/systems/plants
$ rm ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o
$ /usr/bin/time --format=%U,%M /usr/bin/g++-4.9   -DROSCONSOLE_BACKEND_LOG4CXX -DROS_PACKAGE_NAME=\"drake\" -DdrakeRBM_EXPORTS -I/home/liang/dev/drake-distro-2/drake/pod-build/generated -I/home/liang/dev/drake-distro-2/drake/.. -I/home/liang/dev/drake-distro-2/drake/pod-build/exports -isystem /home/liang/dev/drake-distro-2/build/include -I/home/liang/dev/drake-distro-2/drake/pod-build/lcmgen -isystem /usr/include/glib-2.0 -isystem /usr/lib/x86_64-linux-gnu/glib-2.0/include -isystem /home/liang/dev/drake-distro-2/build/include/eigen3 -I/opt/ros/indigo/include -I/home/liang/dev/drake-distro-2/drake/thirdParty/spruce/include -I/home/liang/dev/drake-distro-2/drake/thirdParty/cimg  -Werror=all -Wno-sign-compare -DGTEST_DONT_DEFINE_FAIL=1 -DGTEST_DONT_DEFINE_SUCCEED=1 -DGTEST_DONT_DEFINE_TEST=1 -O2 -g -DNDEBUG -fPIC   -std=gnu++11 -o CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o -c /home/liang/dev/drake-distro-2/drake/systems/plants/RigidBodyTree.cpp
320.11,7371368
$ ls -lah ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o
-rw-rw-r-- 1 liang liang 380M Sep  4 15:33 ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o

The excessive memory utilization problem existed prior to June 1, 2016. The results above show that on June 1, 2016, compiling RigidBodyTree.cpp required 7.37GB of RAM. We now know that the problem arose sometime between April 12, 2016 and June 1, 2016.

Update 5 - Testing a May 1, 2016 Version of Drake

Arbitrarily select a SHA on May 1, 2016: 729e64b2e4b03cb6fa9471b6aabf96415ef737a7. This is an attempt to isolate the memory problem by bisecting the range of dates over which the problem must have been introduced.

$ cd /home/liang/dev/drake-distro-2/drake/pod-build/systems/plants
$ rm ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o
$ /usr/bin/time --format=%U,%M /usr/bin/g++-4.9   -DROSCONSOLE_BACKEND_LOG4CXX -DROS_PACKAGE_NAME=\"drake\" -DdrakeRBM_EXPORTS -I/home/liang/dev/drake-distro-2/drake/pod-build/generated -I/home/liang/dev/drake-distro-2/drake/.. -I/home/liang/dev/drake-distro-2/drake/pod-build/exports -isystem /home/liang/dev/drake-distro-2/build/include -I/home/liang/dev/drake-distro-2/drake/pod-build/lcmgen -isystem /usr/include/glib-2.0 -isystem /usr/lib/x86_64-linux-gnu/glib-2.0/include -isystem /home/liang/dev/drake-distro-2/build/include/eigen3 -I/opt/ros/indigo/include -I/home/liang/dev/drake-distro-2/drake/thirdParty/spruce/include -I/home/liang/dev/drake-distro-2/drake/thirdParty/cimg  -Wreturn-type -Wuninitialized -Wunused-variable -std=c++11 -O2 -g -DNDEBUG -fPIC   -o CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o -c /home/liang/dev/drake-distro-2/drake/systems/plants/RigidBodyTree.cpp
255.19,7371684
$ ls -lah ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o
-rw-rw-r-- 1 liang liang 380M Sep  4 18:18 ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o

The excessive memory utilization problem existed prior to May 1, 2016. The results above show that on May 1, 2016, compiling RigidBodyTree.cpp required 7.37GB of RAM. We now know that the problem arose sometime between April 12, 2016 and May 1, 2016.

Update 6 - Testing an April 21, 2016 Version of Drake

Arbitrarily select a SHA on April 21, 2016: d6beee40827c327ba637297bb9ae891344f48321. This is an attempt to isolate the memory problem by bisecting the range of dates over which the problem must have been introduced.

$ cd /home/liang/dev/drake-distro-2/drake/pod-build/systems/plants
$ rm ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o
$ /usr/bin/time --format=%U,%M /usr/bin/g++-4.9   -DROSCONSOLE_BACKEND_LOG4CXX -DROS_PACKAGE_NAME=\"drake\" -DdrakeRBM_EXPORTS -I/home/liang/dev/drake-distro-2/drake/pod-build/generated -I/home/liang/dev/drake-distro-2/drake/.. -I/home/liang/dev/drake-distro-2/drake/pod-build/exports -isystem /home/liang/dev/drake-distro-2/build/include -I/home/liang/dev/drake-distro-2/drake/pod-build/lcmgen -isystem /usr/include/glib-2.0 -isystem /usr/lib/x86_64-linux-gnu/glib-2.0/include -isystem /home/liang/dev/drake-distro-2/build/include/eigen3 -I/opt/ros/indigo/include -I/home/liang/dev/drake-distro-2/drake/thirdParty/spruce/include -I/home/liang/dev/drake-distro-2/drake/thirdParty/cimg  -Wreturn-type -Wuninitialized -Wunused-variable -std=c++11 -O2 -g -DNDEBUG -fPIC   -o CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o -c /home/liang/dev/drake-distro-2/drake/systems/plants/RigidBodyTree.cpp
289.13,7371724
$ ls -lah ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o
-rw-rw-r-- 1 liang liang 380M Sep  4 20:19 ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o

The excessive memory utilization problem existed prior to April 21, 2016. The results above show that on April 21, 2016, compiling RigidBodyTree.cpp required 7.37GB of RAM. We now know that the problem arose sometime between April 12, 2016 and April 21, 2016.

Update 8 - Testing an April 18, 2016 Version of Drake

Arbitrarily select a SHA on April 18, 2016: c678bc7373bf69639503288191e1139d49c153a5. This is an attempt to isolate the memory problem by bisecting the range of dates over which the problem must have been introduced.

$ cd /home/liang/dev/drake-distro-2/drake/pod-build/systems/plants
$ rm ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o
$ /usr/bin/time --format=%U,%M /usr/bin/g++-4.9   -DdrakeRBM_EXPORTS -I/home/liang/dev/drake-distro-2/drake/pod-build/generated -I/home/liang/dev/drake-distro-2/drake/.. -I/home/liang/dev/drake-distro-2/drake/pod-build/exports -isystem /home/liang/dev/drake-distro-2/build/include -I/home/liang/dev/drake-distro-2/drake/pod-build/lcmgen -isystem /usr/include/glib-2.0 -isystem /usr/lib/x86_64-linux-gnu/glib-2.0/include -isystem /home/liang/dev/drake-distro-2/build/include/eigen3 -I/home/liang/dev/drake-distro-2/drake/thirdParty/spruce/include -I/home/liang/dev/drake-distro-2/drake/thirdParty/cimg  -Wreturn-type -Wuninitialized -Wunused-variable -std=c++11 -O2 -g -DNDEBUG -fPIC   -o CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o -c /home/liang/dev/drake-distro-2/drake/systems/plants/RigidBodyTree.cpp
101.84,3906988
$ ls -lah ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o
-rw-rw-r-- 1 liang liang 183M Sep  4 23:27 ./CMakeFiles/drakeRBM.dir/RigidBodyTree.cpp.o

The excessive memory utilization problem did not exist prior to April 18, 2016. The results above show that on April 18, 2016, compiling RigidBodyTree.cpp required 3.9GB of RAM. We now know that the problem arose sometime between April 18, 2016 and April 21, 2016.

Comparing the non-problematic version on April 18, 2016 (c678bc7373bf69639503288191e1139d49c153a5) with the problematic version on April 21, 2016 (d6beee40827c327ba637297bb9ae891344f48321), I believe I found the problem. The screenshot below shows a diff of RigidBodyTree.h in the two versions. The non-problematic version is on the left while the problematic version is on the right. Notice that the problematic version includes DrakeJoints.h:

header_includes

This header file was included due to the addition of RigidBodyTree::AddFloatingJoint().

method_added

I suspect the inclusion of DrakeJoints.h in RigidBodyTree.h results in the much higher memory footprint while compiling RigidBodyTree.h.

Note that @david-german-tri modified RigidBodyTree.h to include DrakeJoint.h instead of DrakeJoints.h on June 21, 2016 (https://github.com/RobotLocomotion/drake/commit/9fabbc1a1e93751d8512c023fc3c04a7c08bc437). Update 3 above tested a version of Drake after this optimization and it still took > 7GB of RAM to build RigidBodyTree.cpp.

liangfok commented 8 years ago

Optimization 1

I extracted RigidBodyTree::AddFloatingJoint() and the floating base types into their own .h and .cc files. See: https://github.com/liangfok/drake/tree/feature/extract_joint_types_and_add_floating_joints.

Benchmark Platform

Lenovo T430 laptop with an Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz, and 16GB DDR3 1600 MHz RAM, running Ubuntu 14.04.

Benchmark Results

Optimized Custom Branch

The overall benchmark results of https://github.com/liangfok/drake/commit/51e481dea44aecb13d2ba4547911a6a4b73fe53e show building Drake takes 1:03:51 of wall clock time to build, 10,344 user mode CPU seconds (2.85 user mode CPU hours), and 5.1 GB of RAM.

$ cd drake-distro
$ rm -rf build
$ rm -rf externals
$ git reset --hard HEAD
$ mkdir build
$ cd build
$ cmake .. -DCMAKE_BUILD_TYPE:STRING=RelWithDebInfo -DDISABLE_MATLAB=TRUE -DWITH_DREAL=FALSE -DWITH_SNOPT=FALSE 
$ /usr/bin/time --format=%E,%U,%M make -j4
1:03:51,10344.23,5124420
$ du -ch | grep total
6.1G    total

Building 51e481dea44aecb13d2ba4547911a6a4b73fe53e again, the total RAM footprint is 5.1GB. This matches the previous test and shows some level of consistency. The wall clock time varies quite a lot.

$ /usr/bin/time --format=%E,%U,%M make -j4
47:47.54,8183.17,5114984

Unoptimized Head of Master Branch

Using the current head of master (0b1910cb7e0b8f0c0f144abe128061845e63e29d), building Drake took 59:45.99 wall clock time, 9,080 user mode CPU seconds (2.52 hours), and 7.9GB of RAM:

$ cd drake-distro
$ rm -rf build
$ rm -rf externals
$ git reset --hard HEAD
$ mkdir build
$ cd build
$ cmake .. -DCMAKE_BUILD_TYPE:STRING=RelWithDebInfo -DDISABLE_MATLAB=TRUE -DWITH_DREAL=FALSE -DWITH_SNOPT=FALSE 
$ /usr/bin/time --format=%E,%U,%M make -j4
59:45.99,9080.09,7900176
$ du -ch | grep total
6.1G    total

Building the master (0b1910cb7e0b8f0c0f144abe128061845e63e29d) again, it took 1:07:58 of wall clock time, 10,221 user mode CPU seconds (2.8 hours), and 7.9GB of RAM.

$ /usr/bin/time --format=%E,%U,%M   make -j4
...
1:07:58,10221.44,7900800

Conclusions

The total wall-clock time and CPU time are nearly identical at about an hour. However, the memory utilization is far less in https://github.com/liangfok/drake/commit/51e481dea44aecb13d2ba4547911a6a4b73fe53e (5.1 GB of RAM) versus the latest on master, which is 0b1910cb7e0b8f0c0f144abe128061845e63e29d (7.9GB of RAM).

david-german-tri commented 8 years ago

Great analysis and results! I'd like to be a reviewer of this PR. As a preview, here are my top two comments/questions from skimming your branch.

liangfok commented 8 years ago

Benchmarks 3

I use my HP z460 workstation to compare the latest head of master with the extraction of DrakeJoint::FloatingBaseType and RigidBodyTree::AddFloatingJoints() into their own compilation units.

Platform

Commands

Since the workstation as 12 cores, I use -j12.

$ cd drake-distro
$ rm -rf build
$ rm -rf externals
$ git reset --hard HEAD
$ mkdir build
$ cd build
$ cmake .. -DCMAKE_BUILD_TYPE:STRING=RelWithDebInfo -DDISABLE_MATLAB=TRUE -DWITH_DREAL=FALSE -DWITH_SNOPT=FALSE 
$ /usr/bin/time --format=%E,%U,%M make -j12
$ du -ch | grep total

Results

Using the latest on master (8e6c585c73e573133a3cf76b02972e5c00ab433d):

$ /usr/bin/time --format=%E,%U,%M make -j12
16:29.95,4813.20,7903908
$ du -ch | grep total
6.1G    total

Using the optimized branch (https://github.com/liangfok/drake/commit/1bff65ce81aa9fe608eb23a87df46380e392a5e3):

$ /usr/bin/time --format=%E,%U,%M make -j12
13:13.46,4788.78,5097876
$ du -ch | grep total
6.1G    total

Conclusions

Extracting DrakeJoint::FloatingBaseType and RigidBodyTree::AddFloatingJoint() into their own compilation units decreased build times from ~16 minutes to ~13 minutes and, more importantly, decreased the maximum memory footprint to be from 7.9 GB to 5.1 GB.

Update 1 (Sept. 19, 2016) - Using clang + ninja instead of make + gcc

Using the same workstation as mentioned above and 7f48705eaca4215d43cd90efa3e710725aecbaab:

$ cd drake-distro/build
$ cmake .. -G Ninja -DCMAKE_BUILD_TYPE:STRING=RelWithDebInfo -DWITH_DREAL=FALSE -DDISABLE_MATLAB=TRUE
$ /usr/bin/time --format=%E,%U,%M ninja -j24
12:11.92,3777.46,3205664
$ du -ch | grep total
6.5G    total

Wow, ninja + clang uses significantly less memory than make + gcc (3.2GB vs. 5.1G). ninja + clang is only about a minute faster (12 minutes vs. 13 minutes).

liangfok commented 8 years ago

Are we really sure that AddFloatingJoint belongs to drake::parsers? That approach (a) is semantically weird if we ever need to add a floating joint outside a parser and (b) creates a bunch of new dependencies on RigidBodyTree public data. Another option, which I also don't love, would be to leave AddFloatingJoint as a member of RigidBodyTree, and just move the implementation to a separate .cc file. Maybe there is a third option?

Yes, I believe RigidBodyTree::AddFloatingJoint() should be part of parsers since we only anticipate it ever being called by the parsers. Recall that it was originally added as a way to connect newly added models to an existing RigidBodyTree. It does this by searching through all bodies in the tree for those that are parent-less, and adding floating joints to them. This can be thought of as a hack that was needed based on the limitations of the parsers (At the time, parsers couldn't keep track of which models they were adding. Now, with the introduction of the model_instance_id, this is no longer the case.). Longer-term, I expect the parsers to be able to automatically add the floating joints as it is parsing the model and adding a model instance, meaning this method will no longer be necessary.

Note that even if we remove this method from RigidBodyTree, we can continue to programmatically add floating joints using a combination of the following two methods:

Regarding the concern about introducing new dependencies, my longer term plan is to pull the parsers into their own library, which is then linked against the existing drakeRBM library. In other words, users will interact directly with the parsers rather than go through the RigidBodySystem and RigidBodyTree to add models to them.

Factoring out the joint types into a separate header is a great idea. It would help readability to also format and namespace them properly in this PR, e.g. drake::systems::joints::kFixed.

Yeah, I agree. In the spirit of incremental PRs, however, I will probably not initially namespace the floating base types. I do have another WIP branch that adds name spaces. I'll submit that PR after the initial PR that brings down the memory footprint.

sherm1 commented 8 years ago

BTW, for another approach to adding floating joints see Simbody's MultibodyGraphMaker class which is independent of the parser and independent of the multibody tree. It is structured as a utility that absorbs body and joint information (as obtained by a parser typically) and then spits out a spanning tree plus loop constraints design for building the multibody tree, including any needed floating joints. We could consider that more flexible approach at some point, although I agree that Liang's proposal is an improvement.

liangfok commented 8 years ago

Benchmarks 4

System

Commands

$ cmake .. -G Ninja -DCMAKE_BUILD_TYPE:STRING=RelWithDebInfo -DDISABLE_MATLAB=TRUE
$  /usr/bin/time --format=%E,%U,%M ninja

SHA

3af99f19790b2341cc4901fa871cacd6d14c7634

Results

18:34.81,5875.01,3223556

Building Drake took about 18.5 wall clock minutes, 1.63 CPU hours, and 3.2GB of RAM.

jwnimmer-tri commented 7 years ago

So here's a good way to profile the build:

$ bazel build --profile profile.bin //...
$ bazel analyze-profile profile.bin --dump=raw > profile.csv
$ grep 'ACTION_EXECUTE.*Compiling' profile.csv |
  sort -t\| -k5,5rn | head -n40 | cut -d\| -f5,8 | 
  perl -pe 's/^(\d+)\d{9}\|/\1s /;' | 
  awk '{if ((NR-1) % 2 ==0) print}'

For me with default Bazel config (GCC, release), it blames:

177s Compiling drake/multibody/rigid_body_tree.cc
116s Compiling drake/multibody/parsers/sdf_parser.cc
107s Compiling drake/multibody/parsers/urdf_parser.cc
95s Compiling drake/multibody/parsers/parser_common.cc
85s Compiling drake/multibody/test/rigid_body_tree/rigid_body_collision_clique_test.cc
64s Compiling drake/multibody/collision/test/collision_filter_group_test.cc
53s Compiling drake/multibody/joints/roll_pitch_yaw_floating_joint.cc
49s Compiling drake/multibody/joints/roll_pitch_yaw_floating_joint.cc
45s Compiling drake/common/test/symbolic_mixing_scalar_types_test.cc
36s Compiling drake/multibody/constraint/rigid_body_constraint.cc
35s Compiling drake/systems/analysis/test/runge_kutta3_integrator_test.cc
34s Compiling drake/multibody/rigid_body_plant/test/compute_contact_result_test.cc
31s Compiling drake/solvers/test/optimization_examples.cc
30s Compiling drake/solvers/test/optimization_examples.cc
28s Compiling drake/multibody/joints/quaternion_floating_joint.cc
26s Compiling drake/systems/framework/test/diagram_test.cc
26s Compiling drake/systems/analysis/test/simulator_test.cc
24s Compiling drake/systems/framework/test/diagram_builder_test.cc
24s Compiling drake/systems/controllers/test/pid_controlled_system_test.cc
23s Compiling drake/multibody/joints/quaternion_floating_joint.cc
liangfok commented 7 years ago

Latest stats using Puget workstation:

$ bazel build --profile profile.bin //...
...........
INFO: Writing profile data to '/home/liang/dev/drake-distro-1/profile.bin'
WARNING: /home/liang/dev/drake-distro-1/drake/util/BUILD:63:1: target '//drake/util:app_util' is deprecated: Please use gflags instead of drakeAppUtil.h.
WARNING: /home/liang/.cache/bazel/_bazel_liang/ede03c0a430a52111efe35db021d2956/external/drake_visualizer/BUILD:2:48: soft_failure.bzl: @drake_visualizer//:drake-visualizer does not work because /home/liang/dev/drake-distro-1/build/install/bin/drake-visualizer was missing.
INFO: Found 2337 targets...
INFO: From Executing genrule //drake/automotive:speed_bump_genrule:
[2017-03-31 09:57:50.770] [console] [info] Loading road geometry.
[2017-03-31 09:57:50.772] [console] [info] Generating OBJ.
INFO: Elapsed time: 301.216s, Critical Path: 271.22s

Leader board:

$ grep 'ACTION_EXECUTE.*Compiling' profile.csv |
>   sort -t\| -k5,5rn | head -n40 | cut -d\| -f5,8 | 
>   perl -pe 's/^(\d+)\d{9}\|/\1s /;' | 
>   awk '{if ((NR-1) % 2 ==0) print}'
162s Compiling drake/multibody/rigid_body_tree.cc
71s Compiling drake/multibody/parsers/sdf_parser.cc
68s Compiling drake/multibody/parsers/urdf_parser.cc
68s Compiling drake/multibody/parsers/urdf_parser.cc
67s Compiling drake/multibody/parsers/sdf_parser.cc
64s Compiling drake/multibody/parsers/parser_common.cc
62s Compiling drake/multibody/parsers/parser_common.cc
57s Compiling drake/examples/Quadrotor/quadrotor_plant.cc
52s Compiling drake/math/discrete_algebraic_riccati_equation.cc
47s Compiling drake/solvers/moby_lcp_solver.cc
47s Compiling drake/solvers/moby_lcp_solver.cc
44s Compiling drake/examples/Acrobot/acrobot_run_lqr_w_estimator.cc
42s Compiling drake/examples/Acrobot/acrobot_plant.cc
40s Compiling drake/examples/QPInverseDynamicsForHumanoids/system/manipulator_inverse_dynamics_controller.cc
39s Compiling drake/automotive/single_lane_ego_and_agent.cc
38s Compiling drake/examples/QPInverseDynamicsForHumanoids/system/test/humanoid_plan_eval_system_test.cc
37s Compiling drake/examples/QPInverseDynamicsForHumanoids/system/test/qp_controller_system_test.cc
37s Compiling drake/examples/Valkyrie/test/robot_state_encoder_decoder_test.cc
37s Compiling drake/systems/framework/test/diagram_builder_test.cc
36s Compiling drake/examples/kuka_iiwa_arm/iiwa_world/iiwa_wsg_diagram_factory.cc
jwnimmer-tri commented 7 years ago

FYI my WIP branch on fixing this is https://github.com/jwnimmer-tri/drake/tree/rbt-build-time. I haven't had a chance to correctly reprofile and tune up the results, but it's a solution framework.

jwnimmer-tri commented 6 years ago

Hopefully with #8442 and #8543 merged, this is now "good enough".