Closed lifflander closed 3 years ago
gcc -ftime-report
might be useful.
Also this: https://github.com/mikael-s-persson/templight although it requires compiling clang from the source.
gcc -ftime-report
might be useful.
a couple of promising possibilities here:
We have to keep in mind that before Production mode PR got merged, we were building vt:develop with debug printing disabled, so I would assume that's the main reason.
Looking at the termination.cc file compilation ( Ubuntu 20.04 Clang10 with -ftime-trace
flag enabled ):
Compilation time - 37 sec
Compilation time - ~16 sec
We have to keep in mind that before Production mode PR got merged, we were building vt:develop with debug printing disabled, so I would assume that's the main reason.
Production mode just enables assertions and debug prints (AFAIK). Assertions are fairly sparse so it must be the debug prints taking all the time? Can we confirm that hypothesis? It's still very surprising to me that the debug prints would be adding seconds to the build time!
@cz4rs @JacobDomagala
-ftime-trace
to generate the traces)
/usr/bin/clang++ -DFMT_HEADER_ONLY=1 -DFMT_USE_USER_DEFINED_LITERALS=0 -DHAS_DETECTION_COMPONENT=1 -I/vt/lib/fmt -I/vt/lib/CLI -I/vt/lib/libfort/lib -Irelease -I/vt/src -Ilib/checkpoint/src -I/vt/lib/checkpoint/src -I/vt/lib/detector/src -Wall -pedantic -Wshadow -Wno-unknown-pragmas -Wsign-compare -ftemplate-backtrace-limit=100 -ftemplate-depth=900 -DCLI11_EXPERIMENTAL_OPTIONAL=0 -Werror -O3 -DNDEBUG -fcolor-diagnostics -ftime-trace -fPIC -std=c++14 -MD -MT /build/output/test_file.cc.o -MF /build/output/test_file.cc.o.d -o /build/output/test_file.cc.o -c /vt/test_file.cc
#include "vt/config.h"
int main(int /argc/, char* /argv*/) {
for(int i = 0; i < 20; ++i) { vt_debug_print( barrier, node, "Dummy string: val={}, val2={}\n", 10, 2 ); } }
## Production mode = ON
![image](https://user-images.githubusercontent.com/9077677/111884860-8dc49c80-89c4-11eb-89ef-0bc3cac289f9.png)
**Total compile time** = 872ms
**Time spent on:** Mostly on Frontend, meaning parsing include files.
----------------------------------------------------------------------------
## Production mode = OFF
![image](https://user-images.githubusercontent.com/9077677/111884958-16dbd380-89c5-11eb-942a-bb6b40c57799.png)
**Total compile time** = 3.1 sec
**Time spent on:** Time spent on Frontent is roughly the same as before. Most time was spent on Backend part, which I assume is optimization. Looking into the pink vertical bars that represent the actual functions, we can see that most of them are FMT functions and DebugPrint calls.
Some additional statistics (compilation times for termination.cc
using clang-10
):
develop
Time (mean ± σ): 18.308 s ± 0.115 s [User: 18.113 s, System: 0.137 s]
Range (min … max): 18.068 s … 18.411 s 10 runs
develop
Time (mean ± σ): 7.616 s ± 0.050 s [User: 7.486 s, System: 0.108 s]
Range (min … max): 7.582 s … 7.752 s 10 runs
Basically, the same results as @JacobDomagala has posted.
Additionally, I compiled termination.cc
at 9ec8955 (last commit before production mode PR). At that point, debug prints were disabled in release builds, but assertions (and soft errors) could be enabled using vt_ci_build
option.
Time (mean ± σ): 7.969 s ± 0.118 s [User: 7.828 s, System: 0.116 s]
Range (min … max): 7.891 s … 8.279 s 10 runs
Time (mean ± σ): 9.504 s ± 0.082 s [User: 9.386 s, System: 0.100 s]
Range (min … max): 9.369 s … 9.634 s 10 runs
We can see that just enabling assertions adds over 1.5 s compilation time (almost 20% in this case).
vt_ci_build: 1, 9ec8955 + following patch applied
--- a/src/vt/configs/debug/debug_masterconfig.h
+++ b/src/vt/configs/debug/debug_masterconfig.h
@@ -60,14 +60,7 @@
namespace vt { namespace config {
-#if !vt_debug_enabled -using DefaultConfig = Configuration<
; -#endif
(this enables debug prints in release mode without pulling in any other changes from production mode PR)
Time (mean ± σ): 17.975 s ± 0.106 s [User: 17.827 s, System: 0.129 s] Range (min … max): 17.894 s … 18.259 s 10 runs
Thanks for all the detailed performance analysis @cz4rs and @JacobDomagala!
Maybe we can discuss this more in the Tuesday meeting. I'm thinking that it might be related to the templates for vt_debug_print
. A long time ago, debugging prints were implemented with macros, which had some issues so we switched to templates. We might be able to simplify that code.. not sure.
I'm a little confused why assertions would add 20%. That's a pretty light weight macro!
This comes with an idea: https://github.com/fmtlib/fmt/issues/1537.
TODO: check if using core.h
instead of format.h
speeds up things.
--- a/src/vt/configs/debug/debug_print.h
+++ b/src/vt/configs/debug/debug_print.h
@@ -51,7 +51,7 @@
#include "vt/configs/debug/debug_var_unused.h"
-#include <fmt/format.h>
+#include <fmt/core.h>
edit:
Switching to core.h
doesn't work - this is officially discouraged: https://github.com/fmtlib/fmt/blob/master/doc/api.rst#core-api and breaks the build anyways.
https://github.com/aras-p/ClangBuildAnalyzer looks like something that can be really useful. It also uses -ftime-trace
and generates really nice raport. I will try it locally and see how it works.
ClangBuildAnalyzer output:
Edit: Updated the output with the raport for entire project (including tests and examples), not only /vt/src
Possibly related: https://github.com/DARMA-tasking/vt/issues/247.
Here's my latest profile analysis using the changes to checkpoint to explicitly instantiate the templates for DiagnosticBase
polymorphic serialization code.
I also suggest using custom .ini file for https://github.com/aras-p/ClangBuildAnalyzer such as:
ClangBuildAnalyzer.ini
# ClangBuildAnalyzer reads ClangBuildAnalyzer.ini file from the working directory
# when invoked, and various aspects of reporting can be configured this way.
# This file example is setup to be exactly like what the defaults are.
# How many of most expensive things are reported?
[counts]
# files that took most time to parse
fileParse = 20
# files that took most time to generate code for
fileCodegen = 20
# functions that took most time to generate code for
function = 50
# header files that were most expensive to include
header = 20
# for each expensive header, this many include paths to it are shown
headerChain = 10
# templates that took longest to instantiate
template = 50
# Minimum times (in ms) for things to be recorded into trace
[minTimes]
# parse/codegen for a file
file = 20
[misc]
# Maximum length of symbol names printed; longer names will get truncated
maxNameLength = 1000
# Only print "root" headers in expensive header report, i.e.
# only headers that are directly included by at least one source file
onlyRootHeaders = true
Interesting read from the author of fmt
: https://www.zverovich.net/2017/12/09/improving-compile-times.html.
reopening for #1347
JD`s work on build-stats is merged, so we can see the graphs in all their glory:
The last drop is after #1374 was merged. Even if we take it with a grain of salt (no isolation for the build environment) this looks pretty decent :slightly_smiling_face:
We've explored this quite a bit and will continue to try to reduce compile times in more specific ways.
What Needs to be Done?
Research what happened here to solve the problem.