Closed daljit46 closed 4 months ago
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
Managed to obtain further improvements by enabling PCHs for executables too (previously they were only used for the shared libraries). The new timings look like below:
dev | new | |
---|---|---|
MacOS 14 (AppleClang) | 124s | 84s |
Fedora Asahi (clang 17) | 139s | 84s |
Fedora Asahi (gcc 13) | 154s | 103s |
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
Are you able to glean insight into what it is that makes the mrregister
/ mrtransform
compilation so long? That would potentially be the place to be looking into precompiled headers / precompiled template instantiations.
Are you able to glean insight into what it is that makes the
mrregister
/mrtransform
compilation so long? That would potentially be the place to be looking into precompiled headers / precompiled template instantiations.
Looking at the output of ClangBuildAnalyzer
, for mrregister
it shows that only 20% of the time is spent in the compiler's front-end (parsing and instantiation templates), while the rest of the time the compiler is busy optimising the code (most likely on inlined templated functions).
Here is the json trace file for the command obtained by compiling with CMAKE_CXX_FLAGS="-ftime-trace"
(you can read it using https://ui.perfetto.dev)
mrregister.cpp.json
Hence I think the best course of action would be to split this file into multiple files (or at least move parts of its content somewhere else) so that the compiler can operate multithreaded.
Further finding on mrregister
: on my desk PC (a 20-core Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz), the compilation of mrregister.cpp.o
takes more than the time required to build all other the object files in the project (nearly 4 minutes)!
I noticed that registration/transform/base.h
took a very long time to parse. I moved some functions from the header to a new corresponding source file and this provided a nice build time improvement for mrregister
(about 10%). On my work desk PC compile times went down from 235 seconds to 212 seconds.
This is a pattern I noticed a lot in our codebase: there are many instances where (non-templated) functions are unnecessarily defined in header files. We should try to improve on this (at least for new code).
Another finding is that the main culprit of large compilation times for mrregister
is the compilation of all instantiations of Linear::run_masked
(in src/registration/linear.h
) which is responsible for at least 50% of the compilation times.
This is now ready for review. I've cleaned up a few things and added more explicit instantiations to reduce the build time for the registration code. I've tested the changes on three different machines. Generally speaking, you should observe an improvement of between 30-50% in compile times (the changes are most effective when compiling with Clang rather than GCC).
The compilation of mrregister
, while significantly faster than on dev
, remains a bottleneck. This is because mrregister.cpp
contains many template instantiations that take a lot of time to compile. The only viable strategy to mitigate this is to move these instantiations to separate .cpp files. I've already done this for some templates, but to see even further improvements one would need to (very tediously) take this further. However, this PR is already beyond the scope of what I initially intended. So, I would prefer, if this strategy is pursued further, to implement those changes in a future PR.
A point of concern, raised by @jdtournier, was that explicit instantiation of a templated function in a .cpp file will lead the compiler to not inline that function. While it's true that extern template
can hinder inlining, I don't think this is a problem. Firstly, trying to inline a function may not always lead to the compiler producing faster code. Inlining only really makes sense for small and inexpensive functions. On this note, I do think that we are slightly abusing this feature in our codebase as it mostly likely doesn't lead to better code, I wouldn't be surprised if it has the opposite effect. Secondly, Link Time Optimisation (cmake -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON
) can obviate this issue when using extern template
(for a demonstration see this very nice video).
@Lestropie @jdtournier any comments are welcome.
MR::Registration::Transform::Base::Base(size_t)
should be easy to fix the clang-tidy
warnings; those Eigen
member variables have corresponding initialisation syntax. Not essential for merge, just what I saw when reviewing.
I would have just pushed the change but I wasn't able to verify compilation because of:
robertes@9520l-003953-l:~/src/mrtrix3$ cmake --build build_debug/
[29/498] Linking CXX executable cmd/pch_cmd
FAILED: cmd/pch_cmd
: && /usr/bin/c++ -g -fuse-ld=lld cmd/CMakeFiles/pch_cmd.dir/pch_cmd.cpp.o -o cmd/pch_cmd && :
ld.lld: error: undefined symbol: MR::File::Config::config[abi:cxx11]
>>> referenced by config.h:34 (/home/unimelb.edu.au/robertes/src/mrtrix3/cmd/../core/file/config.h:34)
>>> cmd/CMakeFiles/pch_cmd.dir/pch_cmd.cpp.o:(MR::File::Config::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&))
>>> referenced by config.h:35 (/home/unimelb.edu.au/robertes/src/mrtrix3/cmd/../core/file/config.h:35)
>>> cmd/CMakeFiles/pch_cmd.dir/pch_cmd.cpp.o:(MR::File::Config::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&))
collect2: error: ld returned 1 exit status
[50/498] Building CXX object src/CMakeFiles/mrtrix-headless.dir/dwi/fmls.cpp.o
ninja: build stopped: subcommand failed.
Windows 11 machine through MSYS2, 16 cores, Ninja 1.12 (already available on MSYS2), average of 3 runs:
dev |
new | |
---|---|---|
GCC | 202s | 118s |
Clang | 183s | 106s |
Can't currently test on my work Linux laptop as it's partway through a script doing 72 bedpostx
runs and I don't particularly want to interrupt it.
I agree regarding abuse of inlining / too many implementations in header files / would benefit from use of extern template
. Chances are most of the benefits have already been obtained here, and there may be somewhat diminishing returns hunting for more. So I'm not sure it's worth delaying the merge. I will myself try to keep in mind whenever any given area of code is under modification to consider rectifying such things on a case-by-case basis.
This PR aims to significantly reduce the time to compile MRtrix3 and address the issue raised in #2876.
I have tested the changes locally on my laptop on Fedora Asahi and Mac OS 14 and, roughly speaking, I see an improvement of almost 30% compared to
dev
(without usingccache
).Here is a table showing indicative build times (in seconds) on my Macbook M2 Pro 2023 (I took the best of 3 successive builds).
This was achieved with a lot of trial and error and the help of
ClangBuildAnalyzer
. @MRtrix3/mrtrix3-devs it'd be great if you could test this on your machines to see if you can reproduce similar results. One caveat is that to achieve the most optimal results, you must use the latest version ofNinja
(v 1.12), released last week (see first point below). On MacOS, it's just a matter of runningbrew upgrade
. On Linux you will probably need to download it directly from their Github releases page and then runexport PATH=/ninja_dir/:$PATH
(it's unlikely that your distro has packaged the latest version). Before rebuilding, you obviously need to clear your build directory and compiler cache (ccache -C
).More in detail, there are three changes in this PR:
Increased build parallelism. During the initial stages of our CMake effort, it was pointed out by @jdtournier that the CPU load during the build phase didn't stay 100% throughout its entirety. So looking more into this, I found that when building
dev
, the Ninja build graph looks like this: If you look at the end of the graph, you can see thatmrregister.cpp.o
takes a long time to compile (about 45s) and in fact the rest of the object files finish compiling much earlier. So there is a period when the compiler is effectively operating using a single thread. Ideally, the compilation ofmrregister
should start at the beginning of the compilation phase, however, there is effectively no way to control the order of compilation in CMake (or in Ninja). To get around this, I used a simple trick (seecmd/CMakeLists.txt
): create a dummy executable and then addmrregister
as a dependency of this dummy (this is the relevant change in Ninja 1.12 that makes this possible). This is enough to trick Ninja and change the build graph to the following: Now, you see that we are operating multi-threaded throughout the entire compilation phase.Precompiled headers. Precompiled headers speed up compilation by storing parsed header information in a special file. During subsequent compilations, the compiler reuses this precompiled data instead of re-parsing headers. They are particularly suited for "stable" header files, i.e. files that do not change very often (e.g. STL headers). If files listed in a precompiled header file change, then a target using the precompiled header will need to be recompiled from scratch. In this PR, I only included the following headers (other than STL and third parties) in precompiled headers:
gl.h
andexception.h
. The first one is only used formrview/shview
and the second one is for all the code. If a dev needs to frequently edit these files, it'd be best to turn offMRTRIX_USE_PCH
as otherwise, any change in those files will trigger a rebuild of the whole project.Explicit template instantiations. C++11 introduced the
extern template
syntax, which lets you say, "Hey, the template's definition? It's elsewhere!". So if we use this in a header file, we can tell the compiler to avoid automatically instantiating a specific template in every .cpp file that includes the header (of course we have to make sure that the template is instantiated somewhere, e.g. in the corresponding .cpp for the header). After looking at the report byClangBuildAnalyzer
, I used this feature to explicitly instantiate some class and function templates to reduce compilation time. I did this fairly conservatively (I really don't like dealing with the complexity that templates bring), but the few changes I made still made a significant difference in build time.