MRtrix3 / mrtrix3

MRtrix3 provides a set of tools to perform various advanced diffusion MRI analyses, including constrained spherical deconvolution (CSD), probabilistic tractography, track-density imaging, and apparent fibre density
http://www.mrtrix.org
Mozilla Public License 2.0
291 stars 179 forks source link

Optimisation of compilation speed with CMake/Ninja #2877

Closed daljit46 closed 4 months ago

daljit46 commented 5 months ago

This PR aims to significantly reduce the time to compile MRtrix3 and address the issue raised in #2876.

I have tested the changes locally on my laptop on Fedora Asahi and Mac OS 14 and, roughly speaking, I see an improvement of almost 30% compared to dev (without using ccache).

Here is a table showing indicative build times (in seconds) on my Macbook M2 Pro 2023 (I took the best of 3 successive builds).

dev new
MacOS 14 (AppleClang) 124s 89s
Fedora Asahi (clang 17) 139s 91s
Fedora Asahi (gcc 13) 154s 110s

This was achieved with a lot of trial and error and the help of ClangBuildAnalyzer. @MRtrix3/mrtrix3-devs it'd be great if you could test this on your machines to see if you can reproduce similar results. One caveat is that to achieve the most optimal results, you must use the latest version of Ninja(v 1.12), released last week (see first point below). On MacOS, it's just a matter of running brew upgrade. On Linux you will probably need to download it directly from their Github releases page and then run export PATH=/ninja_dir/:$PATH (it's unlikely that your distro has packaged the latest version). Before rebuilding, you obviously need to clear your build directory and compiler cache (ccache -C).

More in detail, there are three changes in this PR:

github-actions[bot] commented 5 months ago

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] commented 5 months ago

clang-tidy review says "All clean, LGTM! :+1:"

daljit46 commented 5 months ago

Managed to obtain further improvements by enabling PCHs for executables too (previously they were only used for the shared libraries). The new timings look like below:

dev new
MacOS 14 (AppleClang) 124s 84s
Fedora Asahi (clang 17) 139s 84s
Fedora Asahi (gcc 13) 154s 103s
github-actions[bot] commented 5 months ago

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] commented 5 months ago

clang-tidy review says "All clean, LGTM! :+1:"

Lestropie commented 5 months ago

Are you able to glean insight into what it is that makes the mrregister / mrtransform compilation so long? That would potentially be the place to be looking into precompiled headers / precompiled template instantiations.

daljit46 commented 5 months ago

Are you able to glean insight into what it is that makes the mrregister / mrtransform compilation so long? That would potentially be the place to be looking into precompiled headers / precompiled template instantiations.

Looking at the output of ClangBuildAnalyzer, for mrregister it shows that only 20% of the time is spent in the compiler's front-end (parsing and instantiation templates), while the rest of the time the compiler is busy optimising the code (most likely on inlined templated functions).

Screenshot 2024-04-16 at 10 39 02

Here is the json trace file for the command obtained by compiling with CMAKE_CXX_FLAGS="-ftime-trace" (you can read it using https://ui.perfetto.dev) mrregister.cpp.json

Hence I think the best course of action would be to split this file into multiple files (or at least move parts of its content somewhere else) so that the compiler can operate multithreaded.

daljit46 commented 5 months ago

Further finding on mrregister: on my desk PC (a 20-core Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz), the compilation of mrregister.cpp.o takes more than the time required to build all other the object files in the project (nearly 4 minutes)!

daljit46 commented 5 months ago

I noticed that registration/transform/base.h took a very long time to parse. I moved some functions from the header to a new corresponding source file and this provided a nice build time improvement for mrregister (about 10%). On my work desk PC compile times went down from 235 seconds to 212 seconds. This is a pattern I noticed a lot in our codebase: there are many instances where (non-templated) functions are unnecessarily defined in header files. We should try to improve on this (at least for new code).

Another finding is that the main culprit of large compilation times for mrregister is the compilation of all instantiations of Linear::run_masked (in src/registration/linear.h) which is responsible for at least 50% of the compilation times.

daljit46 commented 5 months ago

This is now ready for review. I've cleaned up a few things and added more explicit instantiations to reduce the build time for the registration code. I've tested the changes on three different machines. Generally speaking, you should observe an improvement of between 30-50% in compile times (the changes are most effective when compiling with Clang rather than GCC).

The compilation of mrregister, while significantly faster than on dev, remains a bottleneck. This is because mrregister.cpp contains many template instantiations that take a lot of time to compile. The only viable strategy to mitigate this is to move these instantiations to separate .cpp files. I've already done this for some templates, but to see even further improvements one would need to (very tediously) take this further. However, this PR is already beyond the scope of what I initially intended. So, I would prefer, if this strategy is pursued further, to implement those changes in a future PR.

A point of concern, raised by @jdtournier, was that explicit instantiation of a templated function in a .cpp file will lead the compiler to not inline that function. While it's true that extern template can hinder inlining, I don't think this is a problem. Firstly, trying to inline a function may not always lead to the compiler producing faster code. Inlining only really makes sense for small and inexpensive functions. On this note, I do think that we are slightly abusing this feature in our codebase as it mostly likely doesn't lead to better code, I wouldn't be surprised if it has the opposite effect. Secondly, Link Time Optimisation (cmake -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON) can obviate this issue when using extern template (for a demonstration see this very nice video).

@Lestropie @jdtournier any comments are welcome.

Lestropie commented 5 months ago

MR::Registration::Transform::Base::Base(size_t) should be easy to fix the clang-tidy warnings; those Eigen member variables have corresponding initialisation syntax. Not essential for merge, just what I saw when reviewing.

I would have just pushed the change but I wasn't able to verify compilation because of:

robertes@9520l-003953-l:~/src/mrtrix3$ cmake --build build_debug/
[29/498] Linking CXX executable cmd/pch_cmd
FAILED: cmd/pch_cmd 
: && /usr/bin/c++ -g -fuse-ld=lld cmd/CMakeFiles/pch_cmd.dir/pch_cmd.cpp.o -o cmd/pch_cmd   && :
ld.lld: error: undefined symbol: MR::File::Config::config[abi:cxx11]
>>> referenced by config.h:34 (/home/unimelb.edu.au/robertes/src/mrtrix3/cmd/../core/file/config.h:34)
>>>               cmd/CMakeFiles/pch_cmd.dir/pch_cmd.cpp.o:(MR::File::Config::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&))
>>> referenced by config.h:35 (/home/unimelb.edu.au/robertes/src/mrtrix3/cmd/../core/file/config.h:35)
>>>               cmd/CMakeFiles/pch_cmd.dir/pch_cmd.cpp.o:(MR::File::Config::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&))
collect2: error: ld returned 1 exit status
[50/498] Building CXX object src/CMakeFiles/mrtrix-headless.dir/dwi/fmls.cpp.o
ninja: build stopped: subcommand failed.
Lestropie commented 4 months ago

Windows 11 machine through MSYS2, 16 cores, Ninja 1.12 (already available on MSYS2), average of 3 runs:

dev new
GCC 202s 118s
Clang 183s 106s

Can't currently test on my work Linux laptop as it's partway through a script doing 72 bedpostx runs and I don't particularly want to interrupt it.

I agree regarding abuse of inlining / too many implementations in header files / would benefit from use of extern template. Chances are most of the benefits have already been obtained here, and there may be somewhat diminishing returns hunting for more. So I'm not sure it's worth delaying the merge. I will myself try to keep in mind whenever any given area of code is under modification to consider rectifying such things on a case-by-case basis.