Closed nickdesaulniers closed 2 years ago
@glandium's https://glandium.org/blog/?p=2467 is vaguely reminiscent of parts of BOLT.
Yes, would be an great addition to the llvm script! and also for the kernel!
https://www.phoronix.com/scan.php?page=news_item&px=Facebook-BOLTing-The-Kernel
I made good experiences with a selfmade PGO+ThinLTO optimized LLVM-13 toolchain on Debian/unstable AMD64. This reduces here my build-time approx. 30% on an Intel SandyBridge system when building Linux v5.15 RCs.
According to a recent talk about BOLT at Linux Plumbers Conference 2021 an Intel Ivy Bridge system had a boost of 50% with a BOLT plus PGO+ThinLTO optimized LLVM toolchain.
A BOLT support is highly appreciated in tc-build. Some hints to hardware (CPU, RAM, etc.) and software requirement (which LLVM/Clang version) from people who have experiences in building with BOLT are welcome - some informations about build-time and disc-usage, too.
I will have a look to bring PGO/ThinLTO for apt.llvm.org (i have access to way more powerful system). BOLT will follow after this :)
https://github.com/ptr1337/llvm-bolt-builds
ive updated here a old script. Works like a charm. Tested it some hours ago, sadly got a amd cpu and cant bolt the applications, but performance increase is about 25-30 %
if someone wants to, i can share my compiled toolchain. gonna compile it on my server with a 5900x with mostly all projects in.
@sylvestre
Cool :-).
Will PGO+ThinLTO be integrated into official packages or will you create something new like a new meta-package llvm-toolchain-NN-pgo-thinlto
?
@ptr1337 Wich LLVM/Clang version? Performance increase compared to what... Normal clang-NN or an optimized clang-NN?
Might want to update your results [1]?
[1] https://github.com/ptr1337/llvm-bolt-builds/blob/master/results.md
@ptr1337 how long does it take https://github.com/ptr1337/llvm-bolt-builds/blob/master/full_workflow.bash ?
@dileks integrate it. I don't see any reason why not ?
@sylvestre You are welcome to do so :-).
@sylvestre
Will you announce this change
didnt tested full workflow right now, also if it works or not. mostly it does not care on which system it is compiled. since you need only the binaries.
depends on the system, flags and so on. but i think around 4 hrs with my 5950x
@dileks yeah, i will ;)
For the 3 supported releases on apt.llvm.org (12, 13 & 14 currently)
@dileks
i will do a full compile at night. hard to time it. after i will benchmark it in several ways.
oh, sorry did not ready correctly, last compile like this was around 1.5 hrs at me
@ptr1337 how long does it take https://github.com/ptr1337/llvm-bolt-builds/blob/master/full_workflow.bash ?
@dileks integrate it. I don't see any reason why not ?
@sylvestre @ptr1337 Thank you very much!
Some fixes needed in the script.
gonna figure it out, local worked without a problem :x
@sylvestre @dileks
All scripts are now working, also the big one.
Looks like BOLT is getting ready to be added to the monorepo: https://lists.llvm.org/pipermail/llvm-dev/2021-November/153551.html
Once that is done, I will explore adding support for it to tc-build, although given that it requires a processor with LBR support, I will have to find a way to restrict it in the script to avoid weird errors.
@nathanchance, you can use BOLT with cycles only, but the performance gains wouldn't be as high. There's also x86-only instrumentation-based profile collection: https://github.com/facebookincubator/BOLT#with-instrumentation.
@maksfb thanks a lot for the information, I will digest that and see how it can be integrated once BOLT is upstream :)
Yes, sadly only with intel based cpu's. I got a patch for amd to use perf record, but sadly not with "-j any,u". Maybe the patch is sometime ready for bolt. Or there is any other solution.
Here the patch: https://github.com/ptr1337/kernel-patches/blob/master/5.15/AMD/0001-AMD-PERF-PATCH.patch
Alright, I have gotten this all wired up with the instrumentation mode as best as I can tell. I have opened a draft pull request if people want to take a look.
I did not wire up perf
-based sampling yet, as my main workstation has a Threadripper 3990X, which does not support it:
$ perf record -e cycles:u -j any,u -- sleep 1
Error:
cycles:u: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'
However, there are a couple of issues I have noticed and reported upstream:
This instrumentation adds a significant amount of overhead at build time on the couple of machines I tested on and I see next to no improvement at run time over regular PGO. I tested this by building LLVM at https://github.com/llvm/llvm-project/commit/3de29ad20955eb8ed68e831795bf55bfe9fbe58b with PGO then PGO and BOLT (with assertions for the time being) and using those toolchains to build ARCH=arm
, ARCH=arm64
, and ARCH=x86_64
kernels (defconfig
and allmodconfig
) from 5.18-rc3. The host machine that I used to gather these results on has an AMD EPYC 7502P (as I could not have my main machine tied up for this amount of time).
I only ran build-llvm.py
once for this benchmark, which is mainly meant to show that BOLT's instrumentation is much heavier at run time than the instrumentation for PGO, otherwise I would still be waiting for results :^)
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
1843.697 | 1843.697 | 1843.697 | 1.00 |
PGO + BOLT |
10552.749 | 10552.749 | 10552.749 | 5.72 |
Each kernel was built ten times with the toolchains built above.
ARCH=arm defconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
81.692 ± 0.026 | 81.638 | 81.725 | 1.00 |
PGO + BOLT |
81.835 ± 0.043 | 81.784 | 81.935 | 1.00 ± 0.00 |
ARCH=arm64 defconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
114.339 ± 0.048 | 114.248 | 114.409 | 1.00 |
PGO + BOLT |
114.787 ± 0.036 | 114.726 | 114.833 | 1.00 ± 0.00 |
ARCH=x86_64 defconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
52.140 ± 0.053 | 52.057 | 52.218 | 1.00 |
PGO + BOLT |
52.203 ± 0.065 | 52.138 | 52.317 | 1.00 ± 0.00 |
ARCH=arm allmodconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
389.725 ± 0.100 | 389.536 | 389.884 | 1.00 |
PGO + BOLT |
390.591 ± 0.116 | 390.398 | 390.786 | 1.00 ± 0.00 |
ARCH=arm64 allmodconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
500.086 ± 0.271 | 499.482 | 500.472 | 1.00 |
PGO + BOLT |
501.744 ± 0.218 | 501.494 | 502.087 | 1.00 ± 0.00 |
ARCH=x86_64 allmodconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
487.722 ± 0.132 | 487.565 | 487.939 | 1.00 |
PGO + BOLT |
489.218 ± 0.061 | 489.136 | 489.346 | 1.00 ± 0.00 |
Thanks for all your testing.
I did not wire up perf-based sampling yet, as my main workstation has a Threadripper 3990X, which does not support it:
Actually bolt does improve binary's not really without branch sampling. I personally did tested it also several times and faced the same result. Maybe some seconds more or less, but thats in "tolerance". With branch sampling you gain the real performance out of it.
Actually I have a Intel Server and can run some workloads. I'll post the coming days the results between STAGE 1 | PGO+LTO | PGO+BOLT | PGO+LTO+BOLT Compiler.
I have actually also on my workstation a AMD CPU which is a bit annoying but yeah. Maybe soon it will be possible to do some branch sampling with ZEN3 based cpu's. Which can be already tested with linux-next or the patching 5.18-rc. https://www.phoronix.com/scan.php?page=news_item&px=AMD-Branch-Sampling-v5.19
I'll take a watch at you PR.
Hey @nathanchance
I just did used your commit, built on my AMD 5900x the toolchain. It could even get instrumented via BOLT. Here the output from llvm-bolt when bolting it:
░▒▓ ~/repo/tc-build/install/bin bolt !1 ?2 llvm-bolt --data /home/ptr1337/repo/tc-build/build/llvm/clang.fdata /home/ptr1337/repo/tc-build/install/bin/clang-15 -o /home/ptr1337/repo/tc-build/install/bin/clang-15.bolt \
-reorder-blocks=cache+ -reorder-functions=hfsort+ -split-functions=3 \
-split-all-cold -dyno-stats -icf=1 -use-gnu-stack
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: f595b51f502b2f3c97d30e826784159438bde9c4
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 2424 relocations
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN4llvm3sys4path11is_absoluteERKNS_5TwineENS1_5StyleE
BOLT-INFO: 1405 out of 96706 functions in the binary (1.5%) have non-empty execution profile
BOLT-INFO: 3 functions with profile could not be optimized
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: the input contains 255 (dynamic count : 870) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 19226 instructions were shortened
BOLT-INFO: removed 13 empty blocks
BOLT-INFO: ICF folded 332 out of 96983 functions in 4 passes. 0 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 94.36 KB of code space. Folded functions were called 1672 times based on profile.
BOLT-INFO: basic block reordering modified layout of 367 (0.38%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 621000 hot bytes from 540455 cold bytes (53.47% of split functions is hot).
BOLT-INFO: 4 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 1076 to 586
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:
205673 : executed forward branches
22162 : taken forward branches
76422 : executed backward branches
37249 : taken backward branches
22987 : executed unconditional branches
107760 : all function calls
45821 : indirect calls
42814 : PLT calls
2270751 : executed instructions
489495 : executed load instructions
260256 : executed store instructions
271 : taken jump table branches
0 : taken unknown indirect branches
305082 : total branches
82398 : taken branches
222684 : non-taken conditional branches
59411 : taken conditional branches
282095 : all conditional branches
195217 : executed forward branches (-5.1%)
7344 : taken forward branches (-66.9%)
86878 : executed backward branches (+13.7%)
33314 : taken backward branches (-10.6%)
11876 : executed unconditional branches (-48.3%)
107760 : all function calls (=)
45821 : indirect calls (=)
42814 : PLT calls (=)
2247420 : executed instructions (-1.0%)
489495 : executed load instructions (=)
260256 : executed store instructions (=)
271 : taken jump table branches (=)
0 : taken unknown indirect branches (=)
293971 : total branches (-3.6%)
52534 : taken branches (-36.2%)
241437 : non-taken conditional branches (+8.4%)
40658 : taken conditional branches (-31.6%)
282095 : all conditional branches (=)
BOLT-INFO: SCTC: patched 4 tail calls (4 forward) tail calls (0 backward) from a total of 4 while removing 0 double jumps and removing 4 basic blocks totalling 20 bytes of code. CTCs total execution count is 18 and the number of times CTCs are taken is 4.
BOLT-INFO: padding code to 0x5000000 to accommodate hot text
BOLT-INFO: setting __hot_start to 0x4e00000
BOLT-INFO: setting __hot_end to 0x4ea46e8
Here also some fast benchmarks:
meassure_script.sh:
#!/bin/bash
mkdir -p measure-build-time || (echo "Could not create build-directory!"; exit 1)
cd measure-build-time
echo "== Clean old build-artifacts"
rm -r *
echo "== Configure reference Clang-build with tools from ${CPATH}"
CC=clang CXX=clang++ LD=lld \
cmake -G Ninja \
-DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD="X86"\
-DCMAKE_INSTALL_PREFIX="$(pwd)/install" \
-DLLVM_USE_LINKER=lld \
-DLLVM_ENABLE_PROJECTS="clang" \
-DLLVM_PARALLEL_COMPILE_JOBS="$(nproc)"\
-DLLVM_PARALLEL_LINK_JOBS="$(nproc)" \
../llvm-project/llvm || (echo "Could not configure project!"; exit 1)
echo
echo "== Start Build"
time ninja clang || (echo "Could not build project!"; exit 1)
LLVM-BOLT PGO+LTO Instrumented without perf:
== Start Build
[2529/2529] Creating executable symlink bin/clang
real 4m26,656s
user 96m17,866s
sys 3m41,900s
LLVM-BOLT PGO+LTO Instrumented with perf
[2529/2529] Creating executable symlink bin/clang
real 4m12,346s
user 88m10,566s
sys 3m28,353s
LLVM 13 STOCK ARCH LINUX:
== Start Build
[2529/2529] Creating executable symlink bin/clang
real 7m9,064s
user 154m7,218s
sys 4m20,882s
Full log can be found here: https://pastebin.com/L859SqaQ
I saw your perf bolt branch, will let my server built over night.
BOLT-INFO: 1405 out of 96706 functions in the binary (1.5%) have non-empty execution profile
2270751 : executed instructions
The number of profiled functions and executed instructions is low. Try to increase the sampling frequency. If you cannot increase the sampling frequency, profile the same compiler invocation in a loop (10+x) and merge multiple converted profiles with merge-fdata
.
This instrumentation adds a significant amount of overhead at build time on the couple of machines I tested on and I see next to no improvement at run time over regular PGO.
@nathanchance, how did you invoke llvm-bolt
? Could you share its output with -dyno-stats
option added (if you already don't have it)?
According to rafaelauler, the recommended pipeline is to use sampling with LBR if you can, instead of instrumenting. According to him most of the time LBR profiles tend to win.
@nathanchance, how did you invoke
llvm-bolt
?
The invocations are here:
They were shamelessly stolen from Optimizing Clang : A Practical Example of Applying BOLT :) if there is something different I should be doing, please let me know!
Could you share its output with
-dyno-stats
option added (if you already don't have it)?
Sure, let me do a fresh set of benchmarks, as I used hyperfine
for the stats above, which does not show the output of a command by default (for performance reasons). I should have those done in a couple of hours if all goes well.
@nathanchance
I can confirm that your branch-perf works also with branch sampling. I will do also benchmarks now with hyperfine and post them there. Actually the perf record profile is a bit small with around 6.5GB. When building clang the and profile it the profile got around 15GB. Will post the benchmarks here then also.
Here the output of the llvm-bolt process:
[ perf record: Captured and wrote 6540.474 MB /home/ptr1337/tc-build-perf/build/llvm/perf.data (8781714 samples) ]
$ /home/ptr1337/tc-build-perf/build/llvm/stage1/bin/perf2bolt -p /home/ptr1337/tc-build-perf/build/llvm/perf.data -o /home/ptr1337/tc-build-perf/build/llvm/clang.fdata /home/ptr1337/tc-build-perf/install/bin/clang-15
BOLT-INFO: shared object or position-independent executable detected
PERF2BOLT: Starting data aggregation job for /home/ptr1337/tc-build-perf/build/llvm/perf.data
PERF2BOLT: spawning perf job to read branch events
PERF2BOLT: spawning perf job to read mem events
PERF2BOLT: spawning perf job to read process events
PERF2BOLT: spawning perf job to read task events
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 074abdcc60fabe5aa7b301bccf9676e1cbcc1df5
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x4e00000, offset 0x4e00000
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling strict relocation mode for aggregation purposes
BOLT-WARNING: Failed to analyze 2630 relocations
BOLT-INFO: pre-processing profile using perf data aggregator
BOLT-WARNING: build-id will not be checked because we could not read one from input binary
PERF2BOLT: waiting for perf mmap events collection to finish...
PERF2BOLT: parsing perf-script mmap events output
PERF2BOLT: waiting for perf task events collection to finish...
PERF2BOLT: parsing perf-script task events output
PERF2BOLT: input binary is associated with 2997 PID(s)
PERF2BOLT: waiting for perf events collection to finish...
PERF2BOLT: parse branch events...
PERF2BOLT: read 7586584 samples and 240561575 LBR entries
PERF2BOLT: 1195130 samples (13.6%) were ignored
PERF2BOLT: traces mismatching disassembled function contents: 86327 (0.0%)
PERF2BOLT: out of range traces involving unknown regions: 23566706 (10.1%)
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5clang4Decl19castFromDeclContextEPKNS_11DeclContextE
BOLT-WARNING: 10 collisions detected while hashing binary objects. Use -v=1 to see the list.
PERF2BOLT: processing branch events...
PERF2BOLT: wrote 543512 objects and 0 memory objects to /home/ptr1337/tc-build-perf/build/llvm/clang.fdata
$ /home/ptr1337/tc-build-perf/build/llvm/stage1/bin/llvm-bolt --data=/home/ptr1337/tc-build-perf/build/llvm/clang.fdata --reorder-blocks=cache+ --reorder-functions=hfsort+ --split-functions=3 --split-all-cold --dyno-stats --icf=1 --use-gnu-stack -o /home/ptr1337/tc-build-perf/install/bin/clang.bolt /home/ptr1337/tc-build-perf/install/bin/clang-15
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 074abdcc60fabe5aa7b301bccf9676e1cbcc1df5
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 2630 relocations
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5clang4Decl19castFromDeclContextEPKNS_11DeclContextE
BOLT-WARNING: 4 collisions detected while hashing binary objects. Use -v=1 to see the list.
BOLT-INFO: 11604 out of 96577 functions in the binary (12.0%) have non-empty execution profile
BOLT-INFO: 377 functions with profile could not be optimized
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: the input contains 7490 (dynamic count : 5041440) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 615051 instructions were shortened
BOLT-INFO: removed 1334 empty blocks
BOLT-INFO: ICF folded 1300 out of 96853 functions in 4 passes. 2 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 149.79 KB of code space. Folded functions were called 897050 times based on profile.
BOLT-INFO: basic block reordering modified layout of 6659 (6.97%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 7003491 hot bytes from 11571105 cold bytes (37.70% of split functions is hot).
BOLT-INFO: 222 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 10680 to 1642
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:
287756641 : executed forward branches
43936574 : taken forward branches
60807613 : executed backward branches
37693565 : taken backward branches
20564369 : executed unconditional branches
27539550 : all function calls
7137129 : indirect calls
3989435 : PLT calls
2386833259 : executed instructions
560479928 : executed load instructions
293238455 : executed store instructions
3811792 : taken jump table branches
0 : taken unknown indirect branches
369128623 : total branches
102194508 : taken branches
266934115 : non-taken conditional branches
81630139 : taken conditional branches
348564254 : all conditional branches
283691311 : executed forward branches (-1.4%)
16601215 : taken forward branches (-62.2%)
64872943 : executed backward branches (+6.7%)
30465450 : taken backward branches (-19.2%)
17384504 : executed unconditional branches (-15.5%)
27539550 : all function calls (=)
7137129 : indirect calls (=)
3989435 : PLT calls (=)
2375107716 : executed instructions (-0.5%)
560479928 : executed load instructions (=)
293238455 : executed store instructions (=)
3811792 : taken jump table branches (=)
0 : taken unknown indirect branches (=)
365948758 : total branches (-0.9%)
64451169 : taken branches (-36.9%)
301497589 : non-taken conditional branches (+12.9%)
47066665 : taken conditional branches (-42.3%)
348564254 : all conditional branches (=)
BOLT-INFO: SCTC: patched 53 tail calls (47 forward) tail calls (6 backward) from a total of 53 while removing 6 double jumps and removing 40 basic blocks totalling 200 bytes of code. CTCs total execution count is 2240 and the number of times CTCs are taken is 1368.
BOLT-INFO: setting __hot_start to 0x4e00000
BOLT-INFO: setting __hot_end to 0x57eafca
Regards.
BOLT-INFO: 11604 out of 96577 functions in the binary (12.0%) have non-empty execution profile
2386833259 : executed instructions
Looks much better!
BOLT-INFO: 11604 out of 96577 functions in the binary (12.0%) have non-empty execution profile
2386833259 : executed instructions
Looks much better! @maksfb Actually this was with LBR there it recorded with perf but in his PR he uses a kernel compile to record it which should be changed.. The other output where you said thats to less was without branch-sampling (amd cpu), just with the instrumentation.
Actually I did with instrumentation two times the instrumentation and it doubled tge executed instructions. But sadly the binary for instrumenting is really really slow.
Output after two times of instrumentation then combining them with merge-fdata:
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: f595b51f502b2f3c97d30e826784159438bde9c4
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 2424 relocations
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN4llvm3sys4path11is_absoluteERKNS_5TwineENS1_5StyleE
BOLT-INFO: 1315 out of 96706 functions in the binary (1.4%) have non-empty execution profile
BOLT-INFO: 3 functions with profile could not be optimized
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: the input contains 176 (dynamic count : 5974) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 13166 instructions were shortened
BOLT-INFO: removed 11 empty blocks
BOLT-INFO: ICF folded 316 out of 96983 functions in 4 passes. 0 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 94.04 KB of code space. Folded functions were called 3316 times based on profile.
BOLT-INFO: basic block reordering modified layout of 330 (0.34%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 560689 hot bytes from 346146 cold bytes (61.83% of split functions is hot).
BOLT-INFO: 3 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 1002 to 552
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:
553848 : executed forward branches
64210 : taken forward branches
191942 : executed backward branches
100756 : taken backward branches
59984 : executed unconditional branches
251734 : all function calls
103854 : indirect calls
96734 : PLT calls
5738078 : executed instructions
1228106 : executed load instructions
632336 : executed store instructions
988 : taken jump table branches
0 : taken unknown indirect branches
805774 : total branches
224950 : taken branches
580824 : non-taken conditional branches
164966 : taken conditional branches
745790 : all conditional branches
525170 : executed forward branches (-5.2%)
25848 : taken forward branches (-59.7%)
220620 : executed backward branches (+14.9%)
95176 : taken backward branches (-5.5%)
30714 : executed unconditional branches (-48.8%)
251734 : all function calls (=)
103854 : indirect calls (=)
96734 : PLT calls (=)
5676670 : executed instructions (-1.1%)
1228106 : executed load instructions (=)
632336 : executed store instructions (=)
988 : taken jump table branches (=)
0 : taken unknown indirect branches (=)
776504 : total branches (-3.6%)
151738 : taken branches (-32.5%)
624766 : non-taken conditional branches (+7.6%)
121024 : taken conditional branches (-26.6%)
745790 : all conditional branches (=)
BOLT-INFO: SCTC: patched 4 tail calls (4 forward) tail calls (0 backward) from a total of 4 while removing 0 double jumps and removing 4 basic blocks totalling 20 bytes of code. CTCs total execution count is 36 and the number of times CTCs are taken is 8.
BOLT-INFO: padding code to 0x5000000 to accommodate hot text
BOLT-INFO: setting __hot_start to 0x4e00000
BOLT-INFO: setting __hot_end to 0x4e956f4
Do you know a way split to split with perf record the file at for example 15 GB? Actually if using perf record when compiling clang resulting into a 35 GB big file and perf2bolt errors then cause the Server got just 32GB RAM. Even with a big swapfile it fails.
Regardless of the means of profiling, perf LBR, sample, or instrumentation, the number of profiled functions should be more or less the same. It's possible that different processes stomp over each others instrumentation files. That could be the case when clang driver output is the same as "core" clang. Could you add -instrumentation-file-append-pid
for instrumentation and see if more than one file gets generated?
35GB perf.data is too much. You can sample with less frequency. If you want to use the existing file, add -max-samples=100000000
to perf2bolt
. If you are still running out of memory, try adding -strict=0
.
@maksfb Unfortunately, I lost access to the machine that I ran the original results on (good ol' spot market), so they won't really be comparable to the previous results, but I was able to get access to a more powerful one (EPYC 7502P, 32c/64t) and I was able to see some small improvements.
The output from llvm-bolt
while optimizing:
BOLT-INFO: shared object or position-independent executable detected [69/27611]
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 794a0bb547484ec33c13bd6c7c04b1dbd03d040a
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 1551 relocations
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN4llvm11raw_ostream5writeEPKcm
BOLT-INFO: 1846 out of 119068 functions in the binary (1.6%) have non-empty execution profile
BOLT-INFO: 4 functions with profile could not be optimized
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: the input contains 213 (dynamic count : 1193) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 18828 instructions were shortened
BOLT-INFO: removed 9 empty blocks
BOLT-INFO: ICF folded 331 out of 119342 functions in 4 passes. 0 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 78.94 KB of code space. Folded functions were called 1334 times based on profile.
BOLT-INFO: basic block reordering modified layout of 380 (0.32%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 778357 hot bytes from 462225 cold bytes (62.74% of split functions is hot).
BOLT-INFO: 3 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 1519 to 639
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:
360388 : executed forward branches
26808 : taken forward branches
119537 : executed backward branches
63573 : taken backward branches
29252 : executed unconditional branches
240580 : all function calls
89273 : indirect calls
85711 : PLT calls
3585797 : executed instructions
791739 : executed load instructions
393839 : executed store instructions
260 : taken jump table branches
0 : taken unknown indirect branches
509177 : total branches
119633 : taken branches
389544 : non-taken conditional branches
90381 : taken conditional branches
479925 : all conditional branches
358569 : executed forward branches (-0.5%)
22037 : taken forward branches (-17.8%)
121356 : executed backward branches (+1.5%)
62130 : taken backward branches (-2.3%)
17110 : executed unconditional branches (-41.5%)
240580 : all function calls (=)
89273 : indirect calls (=)
85711 : PLT calls (=)
3570780 : executed instructions (-0.4%)
791739 : executed load instructions (=)
393839 : executed store instructions (=)
260 : taken jump table branches (=)
0 : taken unknown indirect branches (=)
497035 : total branches (-2.4%)
101277 : taken branches (-15.3%)
395758 : non-taken conditional branches (+1.6%)
84167 : taken conditional branches (-6.9%)
479925 : all conditional branches (=)
358569 : executed forward branches (-0.5%)
22037 : taken forward branches (-17.8%)
121356 : executed backward branches (+1.5%)
62130 : taken backward branches (-2.3%)
17110 : executed unconditional branches (-41.5%)
240580 : all function calls (=)
89273 : indirect calls (=)
85711 : PLT calls (=)
3570780 : executed instructions (-0.4%)
791739 : executed load instructions (=)
393839 : executed store instructions (=)
260 : taken jump table branches (=)
0 : taken unknown indirect branches (=)
497035 : total branches (-2.4%)
101277 : taken branches (-15.3%)
395758 : non-taken conditional branches (+1.6%)
84167 : taken conditional branches (-6.9%)
479925 : all conditional branches (=)
BOLT-INFO: SCTC: patched 5 tail calls (5 forward) tail calls (0 backward) from a total of 5 while removing 0 double jumps and removing 5 basic blocks totalling 25 bytes of code. CTCs total execution count is 22 and the number of times CTCs are taken is 4.
BOLT-INFO: padding code to 0x6400000 to accommodate hot text
BOLT-INFO: setting __hot_start to 0x6200000
BOLT-INFO: setting __hot_end to 0x62cdabb
and the results of building the following kernels ten times with each toolchain:
ARCH=arm64 defconfig
:
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
114.623 ± 0.060 | 114.499 | 114.695 | 1.02 ± 0.00 |
PGO + BOLT |
112.425 ± 0.062 | 112.353 | 112.548 | 1.00 |
ARCH=x86_64 defconfig
:
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
52.455 ± 0.084 | 52.357 | 52.616 | 1.02 ± 0.00 |
PGO + BOLT |
51.535 ± 0.056 | 51.450 | 51.627 | 1.00 |
Additionally, the output of time -v
doing just PGO:
Command being timed: "/home/nathan/cbl/github/tc-build/build-llvm.py --assertions --build-folder /home/nathan/tmp/llvm-pgo-bolt-benchmarking/build/llvm --check-targets clang lld llvm llvm-unit --llvm-folder /home/nathan/cbl/src/llvm-project --install-folder /home/nathan/tmp/llvm-pgo-bolt-benchmarking/install/llvm/pgo --pgo kernel-defconfig --projects clang;lld --show-build-commands --targets AArch64;ARM;X86"
User time (seconds): 88892.02
System time (seconds): 5557.36
Percent of CPU this job got: 4632%
Elapsed (wall clock) time (h:mm:ss or m:ss): 33:58.65
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2191052
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3656163
Minor (reclaiming a frame) page faults: 1293436132
Voluntary context switches: 24439135
Involuntary context switches: 17089464
Swaps: 0
File system inputs: 112
File system outputs: 213101104
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
and doing PGO + BOLT:
Command being timed: "/home/nathan/cbl/github/tc-build/build-llvm.py --assertions --bolt --build-folder /home/nathan/tmp/llvm-pgo-bolt-benchmarking/build/llvm --check-targets clang lld llvm llvm-unit --llvm-folder /home/nathan/cbl/src/llvm-project --install-folder /home/nathan/tmp/llvm-pgo-bolt-benchmarking/install/llvm/pgo-bolt --pgo kernel-defconfig --projects clang;lld --show-build-commands --targets AArch64;ARM;X86"
User time (seconds): 106368.10
System time (seconds): 402880.84
Percent of CPU this job got: 5551%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:32:53
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 15886208
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 50641167
Minor (reclaiming a frame) page faults: 1698451290
Voluntary context switches: 84779639
Involuntary context switches: 19818841
Swaps: 0
File system inputs: 64
File system outputs: 1828643504
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
I will see if your suggestion around -instrumentation-file-append-pid
improves it even further.
@nathanchance Actually maybe it could be a issue since you PGO llvm with the kernel? Maybe try to --pgo llvm, that is how I do it and there was alot of improvement even without LBR. Also according their docs they just target X86.
@maksfb With a sampling from -c 2500 everything went fine, maybe this should changed in the docs.
I did now a run with -instrumentation-file-append-pid
and it resulted in over 190 GB file usage when instrumenting clang, the combindend.fdata got 165MB, but the result which is shown is a lot of better then before (on a amd cpu without lbr):
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: f595b51f502b2f3c97d30e826784159438bde9c4
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 2424 relocations
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5clang4Decl19castFromDeclContextEPKNS_11DeclContextE
BOLT-WARNING: 7 collisions detected while hashing binary objects. Use -v=1 to see the list.
BOLT-INFO: 16889 out of 96706 functions in the binary (17.5%) have non-empty execution profile
BOLT-INFO: 495 functions with profile could not be optimized
BOLT-WARNING: 9 (0.1% of all profiled) functions have invalid (possibly stale) profile. Use -report-stale to see the list.
BOLT-WARNING: 695817 out of 2165380220666 samples in the binary (0.0%) belong to functions with invalid (possibly stale) profile.
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: the input contains 9616 (dynamic count : 19121657995) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 782089 instructions were shortened
BOLT-INFO: removed 1458 empty blocks
BOLT-INFO: ICF folded 1687 out of 96983 functions in 5 passes. 2 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 192.87 KB of code space. Folded functions were called 6708861534 times based on profile.
BOLT-INFO: basic block reordering modified layout of 9503 (9.97%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 11629561 hot bytes from 11335899 cold bytes (50.64% of split functions is hot).
BOLT-INFO: 223 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 15697 to 7487
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:
1345852442580 : executed forward branches
177695950568 : taken forward branches
279784019546 : executed backward branches
165552345783 : taken backward branches
80707139554 : executed unconditional branches
111711292644 : all function calls
38436135059 : indirect calls
20540549655 : PLT calls
10555075375899 : executed instructions
2627819906958 : executed load instructions
1279424679765 : executed store instructions
16846108018 : taken jump table branches
0 : taken unknown indirect branches
1706343601680 : total branches
423955435905 : taken branches
1282388165775 : non-taken conditional branches
343248296351 : taken conditional branches
1625636462126 : all conditional branches
1280317632631 : executed forward branches (-4.9%)
78974541817 : taken forward branches (-55.6%)
345318829495 : executed backward branches (+23.4%)
155608698163 : taken backward branches (-6.0%)
60656090585 : executed unconditional branches (-24.8%)
111711292644 : all function calls (=)
38436135059 : indirect calls (=)
20540549655 : PLT calls (=)
10505269253372 : executed instructions (-0.5%)
2627819906958 : executed load instructions (=)
1279424679765 : executed store instructions (=)
16846108018 : taken jump table branches (=)
0 : taken unknown indirect branches (=)
1686292552711 : total branches (-1.2%)
295239330565 : taken branches (-30.4%)
1391053222146 : non-taken conditional branches (+8.5%)
234583239980 : taken conditional branches (-31.7%)
1625636462126 : all conditional branches (=)
BOLT-INFO: SCTC: patched 78 tail calls (70 forward) tail calls (8 backward) from a total of 78 while removing 12 double jumps and removing 65 basic blocks totalling 325 bytes of code. CTCs total execution count is 54542385 and the number of times CTCs are taken is 51777916.
BOLT-INFO: setting __hot_start to 0x4e00000
Actually maybe it could be a issue since you PGO llvm with the kernel? Maybe try to --pgo llvm, that is how I do it and there was alot of improvement even without LBR.
Sure, I can try that, but the primary purpose of this script is generating a version of LLVM that is most optimized for compiling the kernel, so I would think that running LLVM against the kernel would be better than running it against LLVM. Yet another hypothesis :)
I did now a run with -instrumentation-file-append-pid and it resulted in over 190 GB file usage when instrumenting clang, the combindend.fdata got 165MB, but the result which is shown is a lot of better then before (on a amd cpu without lbr)
Okay. This confirms that in previous runs the "normal" clang profile was overwritten by the driver profile.
With instrumentation, you will likely get a good-quality profile after compiling <10% of the code. I don't have a clear idea how such process could be automated though.
To speedup instrumented binaries, in BOLT we can introduce instrumentation "sampling". For each function, we will have to emit two versions, one instrumented and one not. Non-instrumented code will be executed most of the time, but functions will be redirected to instrumented version on every Nth invocation (either deterministically or randomly).
@maksfb
Yes definitly it several overwrites. I did also saw, that the profile got newly generated again and again with out the instrumentation-file-append-pid
option. Actually at compiling the code the profile got first around 100mb, but when its done it went down to 2MB so this bug needs definitely fixed.
Actually the compile was this time faster with the instrumented binary, i saw in the processes without the instrumentation-file-append-pid
many disk sleeps, actually it could be through this.
@nickdesaulniers I did actually two kernel compiles with modprobed-db:
LLVM15 BOLT+PGO+LTO (without LBR)
________________________________________________________
Executed in 131.14 secs fish external
usr time 31.12 mins 242.00 micros 31.12 mins
sys time 2.77 mins 98.00 micros 2.77 mins
LLVM 13 Stock
==> Leaving fakeroot environment.
==> Finished making: linux-cachyos-lto 5.17.4-2 (Sa 23 Apr 2022 00:56:39 CEST)
________________________________________________________
Executed in 189.55 secs fish external
usr time 50.26 mins 220.00 micros 50.26 mins
sys time 3.48 mins 125.00 micros 3.48 mins
Based on the above discussion, I have added support for perf
to build-llvm.py
's BOLT support and I added --instrumentation-file-append-pid
for the instrumentation command. To avoid generating too much data from either perf
or BOLT's instrumentation (as --instrumentation-file-append-pid
will generate one file for each invocation of clang
during a kernel build), we will just build one kernel (either the host target or the first target in the user's list if the host is not supported), which seems to be good enough to see some gains along the lines of what @nickdesaulniers reported on our mailing list, which is around 5-7% across the board. I'll tidy up these changes and push to the pull request for review tomorrow.
For the below benchmarks, this is the "base" build-llvm.py
invocation:
$ build-llvm.py --no-ccache --pgo kernel-defconfig --projects "clang;lld" --targets "AArch64;ARM;X86"
For BOLT
and (assertions)
, those correspond to --bolt
and --assertions
respectively.
n2.xlarge.x86
Equinix's n2.xlarge.x86 has an Intel Xeon Gold 5218 (32C/64T), 384GB of RAM, and NVMe storage, which supports the perf
approach:
build-llvm.py
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
2022.866 ± 4.092 | 2012.303 | 2026.556 | 1.00 |
PGO + BOLT |
2537.742 ± 4.482 | 2528.943 | 2544.133 | 1.25 ± 0.00 |
PGO (assertions) |
2219.303 ± 4.937 | 2210.261 | 2225.482 | 1.10 ± 0.00 |
PGO + BOLT (assertions) |
2833.222 ± 6.157 | 2825.017 | 2842.768 | 1.40 ± 0.00 |
ARCH=arm defconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
89.767 ± 0.067 | 89.644 | 89.853 | 1.05 ± 0.00 |
PGO + BOLT |
85.146 ± 0.059 | 85.051 | 85.250 | 1.00 |
PGO (assertions) |
105.353 ± 0.095 | 105.183 | 105.500 | 1.24 ± 0.00 |
PGO + BOLT (assertions) |
99.211 ± 0.078 | 99.133 | 99.355 | 1.17 ± 0.00 |
ARCH=arm64 defconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
125.606 ± 0.061 | 125.518 | 125.691 | 1.06 ± 0.00 |
PGO + BOLT |
118.282 ± 0.057 | 118.227 | 118.401 | 1.00 |
PGO (assertions) |
147.560 ± 0.077 | 147.454 | 147.671 | 1.25 ± 0.00 |
PGO + BOLT (assertions) |
138.358 ± 0.078 | 138.227 | 138.503 | 1.17 ± 0.00 |
ARCH=x86_64 defconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
58.281 ± 0.115 | 58.129 | 58.498 | 1.05 ± 0.00 |
PGO + BOLT |
55.411 ± 0.069 | 55.348 | 55.548 | 1.00 |
PGO (assertions) |
67.350 ± 0.085 | 67.235 | 67.479 | 1.22 ± 0.00 |
PGO + BOLT (assertions) |
63.839 ± 0.099 | 63.657 | 63.951 | 1.15 ± 0.00 |
ARCH=arm allmodconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
441.529 ± 0.112 | 441.395 | 441.722 | 1.05 ± 0.00 |
PGO + BOLT |
420.460 ± 0.214 | 420.121 | 420.845 | 1.00 |
PGO (assertions) |
510.643 ± 0.118 | 510.460 | 510.837 | 1.21 ± 0.00 |
PGO + BOLT (assertions) |
482.406 ± 0.309 | 482.078 | 483.173 | 1.15 ± 0.00 |
ARCH=arm64 allmodconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
563.066 ± 0.138 | 562.897 | 563.236 | 1.06 ± 0.00 |
PGO + BOLT |
533.187 ± 0.191 | 532.798 | 533.584 | 1.00 |
PGO (assertions) |
651.277 ± 0.154 | 651.095 | 651.528 | 1.22 ± 0.00 |
PGO + BOLT (assertions) |
612.805 ± 0.145 | 612.613 | 613.141 | 1.15 ± 0.00 |
ARCH=x86_64 allmodconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
558.869 ± 0.160 | 558.588 | 559.043 | 1.06 ± 0.00 |
PGO + BOLT |
529.673 ± 0.124 | 529.509 | 529.875 | 1.00 |
PGO (assertions) |
641.870 ± 0.196 | 641.636 | 642.297 | 1.21 ± 0.00 |
PGO + BOLT (assertions) |
604.896 ± 0.215 | 604.660 | 605.259 | 1.14 ± 0.00 |
m3.large.x86
Equinix's m3.large.x86 has an AMD EPYC 7502P (32C/64T), 256GB of RAM, and NVMe storage, which does not support the perf
approach, instead relying on instrumentation:
build-llvm.py
NOTE: Due to https://github.com/llvm/llvm-project/issues/55004, these builds have assertions enabled, so they should not be compared with the PGO
and PGO + BOLT
times above.
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
1915.534 ± 1.204 | 1913.470 | 1917.758 | 1.00 |
PGO + BOLT |
2844.596 ± 5.504 | 2836.409 | 2851.857 | 1.49 ± 0.00 |
ARCH=arm defconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
81.965 ± 0.036 | 81.914 | 82.006 | 1.07 ± 0.00 |
PGO + BOLT |
76.696 ± 0.036 | 76.627 | 76.737 | 1.00 |
ARCH=arm64 defconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
114.519 ± 0.036 | 114.459 | 114.567 | 1.07 ± 0.00 |
PGO + BOLT |
106.899 ± 0.062 | 106.779 | 107.001 | 1.00 |
ARCH=x86_64 defconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
52.413 ± 0.081 | 52.262 | 52.514 | 1.06 ± 0.00 |
PGO + BOLT |
49.313 ± 0.088 | 49.128 | 49.427 | 1.00 |
ARCH=arm allmodconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
390.632 ± 0.140 | 390.418 | 390.854 | 1.06 ± 0.00 |
PGO + BOLT |
367.258 ± 0.125 | 367.039 | 367.494 | 1.00 |
ARCH=arm64 allmodconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
500.528 ± 0.092 | 500.425 | 500.675 | 1.07 ± 0.00 |
PGO + BOLT |
469.376 ± 0.110 | 469.242 | 469.527 | 1.00 |
ARCH=x86_64 allmodconfig
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
PGO |
489.216 ± 0.160 | 488.974 | 489.471 | 1.07 ± 0.00 |
PGO + BOLT |
458.317 ± 0.134 | 458.179 | 458.545 | 1.00 |
We should look into @maksfb and @rafaelauler 's docs on building clang with BOLT (post link optimization) which seems to get some performance improvements on to of LTO+PGO.
https://github.com/facebookincubator/BOLT/blob/rebased/bolt/docs/OptimizingClang.md#optimizing-clang-with-bolt https://research.fb.com/wp-content/uploads/2019/02/BOLT-A-Practical-Binary-Optimizer-for-Data-Centers-and-Beyond.pdf