ClangBuiltLinux / tc-build

A set of scripts to build LLVM and binutils
Apache License 2.0
222 stars 179 forks source link

BOLT clang #155

Closed nickdesaulniers closed 2 years ago

nickdesaulniers commented 3 years ago

We should look into @maksfb and @rafaelauler 's docs on building clang with BOLT (post link optimization) which seems to get some performance improvements on to of LTO+PGO.

https://github.com/facebookincubator/BOLT/blob/rebased/bolt/docs/OptimizingClang.md#optimizing-clang-with-bolt https://research.fb.com/wp-content/uploads/2019/02/BOLT-A-Practical-Binary-Optimizer-for-Data-Centers-and-Beyond.pdf

nickdesaulniers commented 3 years ago

@glandium's https://glandium.org/blog/?p=2467 is vaguely reminiscent of parts of BOLT.

ptr1337 commented 3 years ago

Yes, would be an great addition to the llvm script! and also for the kernel!

https://www.phoronix.com/scan.php?page=news_item&px=Facebook-BOLTing-The-Kernel

dileks commented 3 years ago

I made good experiences with a selfmade PGO+ThinLTO optimized LLVM-13 toolchain on Debian/unstable AMD64. This reduces here my build-time approx. 30% on an Intel SandyBridge system when building Linux v5.15 RCs.

According to a recent talk about BOLT at Linux Plumbers Conference 2021 an Intel Ivy Bridge system had a boost of 50% with a BOLT plus PGO+ThinLTO optimized LLVM toolchain.

A BOLT support is highly appreciated in tc-build. Some hints to hardware (CPU, RAM, etc.) and software requirement (which LLVM/Clang version) from people who have experiences in building with BOLT are welcome - some informations about build-time and disc-usage, too.

sylvestre commented 3 years ago

I will have a look to bring PGO/ThinLTO for apt.llvm.org (i have access to way more powerful system). BOLT will follow after this :)

ptr1337 commented 3 years ago

https://github.com/ptr1337/llvm-bolt-builds

ive updated here a old script. Works like a charm. Tested it some hours ago, sadly got a amd cpu and cant bolt the applications, but performance increase is about 25-30 %

ptr1337 commented 3 years ago

if someone wants to, i can share my compiled toolchain. gonna compile it on my server with a 5900x with mostly all projects in.

dileks commented 3 years ago

@sylvestre Cool :-). Will PGO+ThinLTO be integrated into official packages or will you create something new like a new meta-package llvm-toolchain-NN-pgo-thinlto?

@ptr1337 Wich LLVM/Clang version? Performance increase compared to what... Normal clang-NN or an optimized clang-NN?

Might want to update your results [1]?

[1] https://github.com/ptr1337/llvm-bolt-builds/blob/master/results.md

sylvestre commented 3 years ago

@ptr1337 how long does it take https://github.com/ptr1337/llvm-bolt-builds/blob/master/full_workflow.bash ?

@dileks integrate it. I don't see any reason why not ?

dileks commented 3 years ago

@sylvestre You are welcome to do so :-).

dileks commented 3 years ago

@sylvestre Will you announce this change to LLVM-dev and/or ClangBuiltLinux mailing-lists? I mean how will someone get the information about PGO+ThinLTO optimized llvm-toolchain-NN for Debian/Ubuntu systems? Will this affect all current supported llvm-toolchain-NN (where NN might be versions like 11/12/13/14?)?

ptr1337 commented 3 years ago

didnt tested full workflow right now, also if it works or not. mostly it does not care on which system it is compiled. since you need only the binaries.

depends on the system, flags and so on. but i think around 4 hrs with my 5950x

sylvestre commented 3 years ago

@dileks yeah, i will ;)

For the 3 supported releases on apt.llvm.org (12, 13 & 14 currently)

ptr1337 commented 3 years ago

@dileks

i will do a full compile at night. hard to time it. after i will benchmark it in several ways.

oh, sorry did not ready correctly, last compile like this was around 1.5 hrs at me

@ptr1337 how long does it take https://github.com/ptr1337/llvm-bolt-builds/blob/master/full_workflow.bash ?

@dileks integrate it. I don't see any reason why not ?

dileks commented 3 years ago

@sylvestre @ptr1337 Thank you very much!

ptr1337 commented 3 years ago

Some fixes needed in the script.

gonna figure it out, local worked without a problem :x

ptr1337 commented 3 years ago

@sylvestre @dileks

All scripts are now working, also the big one.

nathanchance commented 3 years ago

Looks like BOLT is getting ready to be added to the monorepo: https://lists.llvm.org/pipermail/llvm-dev/2021-November/153551.html

Once that is done, I will explore adding support for it to tc-build, although given that it requires a processor with LBR support, I will have to find a way to restrict it in the script to avoid weird errors.

maksfb commented 3 years ago

@nathanchance, you can use BOLT with cycles only, but the performance gains wouldn't be as high. There's also x86-only instrumentation-based profile collection: https://github.com/facebookincubator/BOLT#with-instrumentation.

nathanchance commented 3 years ago

@maksfb thanks a lot for the information, I will digest that and see how it can be integrated once BOLT is upstream :)

ptr1337 commented 3 years ago

Yes, sadly only with intel based cpu's. I got a patch for amd to use perf record, but sadly not with "-j any,u". Maybe the patch is sometime ready for bolt. Or there is any other solution.

Here the patch: https://github.com/ptr1337/kernel-patches/blob/master/5.15/AMD/0001-AMD-PERF-PATCH.patch

nathanchance commented 2 years ago

Alright, I have gotten this all wired up with the instrumentation mode as best as I can tell. I have opened a draft pull request if people want to take a look.

187

I did not wire up perf-based sampling yet, as my main workstation has a Threadripper 3990X, which does not support it:

$ perf record -e cycles:u -j any,u -- sleep 1
Error:
cycles:u: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'

However, there are a couple of issues I have noticed and reported upstream:

This instrumentation adds a significant amount of overhead at build time on the couple of machines I tested on and I see next to no improvement at run time over regular PGO. I tested this by building LLVM at https://github.com/llvm/llvm-project/commit/3de29ad20955eb8ed68e831795bf55bfe9fbe58b with PGO then PGO and BOLT (with assertions for the time being) and using those toolchains to build ARCH=arm, ARCH=arm64, and ARCH=x86_64 kernels (defconfig and allmodconfig) from 5.18-rc3. The host machine that I used to gather these results on has an AMD EPYC 7502P (as I could not have my main machine tied up for this amount of time).

LLVM build times

I only ran build-llvm.py once for this benchmark, which is mainly meant to show that BOLT's instrumentation is much heavier at run time than the instrumentation for PGO, otherwise I would still be waiting for results :^)

Command Mean [s] Min [s] Max [s] Relative
PGO 1843.697 1843.697 1843.697 1.00
PGO + BOLT 10552.749 10552.749 10552.749 5.72

Kernel build times

Each kernel was built ten times with the toolchains built above.

ARCH=arm defconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 81.692 ± 0.026 81.638 81.725 1.00
PGO + BOLT 81.835 ± 0.043 81.784 81.935 1.00 ± 0.00

ARCH=arm64 defconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 114.339 ± 0.048 114.248 114.409 1.00
PGO + BOLT 114.787 ± 0.036 114.726 114.833 1.00 ± 0.00

ARCH=x86_64 defconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 52.140 ± 0.053 52.057 52.218 1.00
PGO + BOLT 52.203 ± 0.065 52.138 52.317 1.00 ± 0.00

ARCH=arm allmodconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 389.725 ± 0.100 389.536 389.884 1.00
PGO + BOLT 390.591 ± 0.116 390.398 390.786 1.00 ± 0.00

ARCH=arm64 allmodconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 500.086 ± 0.271 499.482 500.472 1.00
PGO + BOLT 501.744 ± 0.218 501.494 502.087 1.00 ± 0.00

ARCH=x86_64 allmodconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 487.722 ± 0.132 487.565 487.939 1.00
PGO + BOLT 489.218 ± 0.061 489.136 489.346 1.00 ± 0.00
ptr1337 commented 2 years ago

Thanks for all your testing.

I did not wire up perf-based sampling yet, as my main workstation has a Threadripper 3990X, which does not support it:

Actually bolt does improve binary's not really without branch sampling. I personally did tested it also several times and faced the same result. Maybe some seconds more or less, but thats in "tolerance". With branch sampling you gain the real performance out of it.

Actually I have a Intel Server and can run some workloads. I'll post the coming days the results between STAGE 1 | PGO+LTO | PGO+BOLT | PGO+LTO+BOLT Compiler.

I have actually also on my workstation a AMD CPU which is a bit annoying but yeah. Maybe soon it will be possible to do some branch sampling with ZEN3 based cpu's. Which can be already tested with linux-next or the patching 5.18-rc. https://www.phoronix.com/scan.php?page=news_item&px=AMD-Branch-Sampling-v5.19

I'll take a watch at you PR.

ptr1337 commented 2 years ago

Hey @nathanchance

I just did used your commit, built on my AMD 5900x the toolchain. It could even get instrumented via BOLT. Here the output from llvm-bolt when bolting it:

░▒▓    ~/repo/tc-build/install/bin   bolt !1 ?2  llvm-bolt --data /home/ptr1337/repo/tc-build/build/llvm/clang.fdata /home/ptr1337/repo/tc-build/install/bin/clang-15 -o /home/ptr1337/repo/tc-build/install/bin/clang-15.bolt \
                                                               -reorder-blocks=cache+ -reorder-functions=hfsort+ -split-functions=3 \
                                                               -split-all-cold -dyno-stats -icf=1 -use-gnu-stack
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: f595b51f502b2f3c97d30e826784159438bde9c4
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 2424 relocations
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN4llvm3sys4path11is_absoluteERKNS_5TwineENS1_5StyleE
BOLT-INFO: 1405 out of 96706 functions in the binary (1.5%) have non-empty execution profile
BOLT-INFO: 3 functions with profile could not be optimized
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: the input contains 255 (dynamic count : 870) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 19226 instructions were shortened
BOLT-INFO: removed 13 empty blocks
BOLT-INFO: ICF folded 332 out of 96983 functions in 4 passes. 0 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 94.36 KB of code space. Folded functions were called 1672 times based on profile.
BOLT-INFO: basic block reordering modified layout of 367 (0.38%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 621000 hot bytes from 540455 cold bytes (53.47% of split functions is hot).
BOLT-INFO: 4 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 1076 to 586
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

              205673 : executed forward branches
               22162 : taken forward branches
               76422 : executed backward branches
               37249 : taken backward branches
               22987 : executed unconditional branches
              107760 : all function calls
               45821 : indirect calls
               42814 : PLT calls
             2270751 : executed instructions
              489495 : executed load instructions
              260256 : executed store instructions
                 271 : taken jump table branches
                   0 : taken unknown indirect branches
              305082 : total branches
               82398 : taken branches
              222684 : non-taken conditional branches
               59411 : taken conditional branches
              282095 : all conditional branches

              195217 : executed forward branches (-5.1%)
                7344 : taken forward branches (-66.9%)
               86878 : executed backward branches (+13.7%)
               33314 : taken backward branches (-10.6%)
               11876 : executed unconditional branches (-48.3%)
              107760 : all function calls (=)
               45821 : indirect calls (=)
               42814 : PLT calls (=)
             2247420 : executed instructions (-1.0%)
              489495 : executed load instructions (=)
              260256 : executed store instructions (=)
                 271 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
              293971 : total branches (-3.6%)
               52534 : taken branches (-36.2%)
              241437 : non-taken conditional branches (+8.4%)
               40658 : taken conditional branches (-31.6%)
              282095 : all conditional branches (=)

BOLT-INFO: SCTC: patched 4 tail calls (4 forward) tail calls (0 backward) from a total of 4 while removing 0 double jumps and removing 4 basic blocks totalling 20 bytes of code. CTCs total execution count is 18 and the number of times CTCs are taken is 4.
BOLT-INFO: padding code to 0x5000000 to accommodate hot text
BOLT-INFO: setting __hot_start to 0x4e00000
BOLT-INFO: setting __hot_end to 0x4ea46e8

Here also some fast benchmarks:

meassure_script.sh:

#!/bin/bash
mkdir -p measure-build-time || (echo "Could not create build-directory!"; exit 1)
cd measure-build-time
echo "== Clean old build-artifacts"
rm -r *

echo "== Configure reference Clang-build with tools from ${CPATH}"

CC=clang CXX=clang++ LD=lld \
  cmake     -G Ninja \
  -DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD="X86"\
  -DCMAKE_INSTALL_PREFIX="$(pwd)/install" \
  -DLLVM_USE_LINKER=lld \
  -DLLVM_ENABLE_PROJECTS="clang" \
  -DLLVM_PARALLEL_COMPILE_JOBS="$(nproc)"\
  -DLLVM_PARALLEL_LINK_JOBS="$(nproc)" \
  ../llvm-project/llvm || (echo "Could not configure project!"; exit 1)

echo
echo "== Start Build"
time ninja clang || (echo "Could not build project!"; exit 1)

LLVM-BOLT PGO+LTO Instrumented without perf:

== Start Build
[2529/2529] Creating executable symlink bin/clang

real    4m26,656s
user    96m17,866s
sys     3m41,900s

LLVM-BOLT PGO+LTO Instrumented with perf

[2529/2529] Creating executable symlink bin/clang

real    4m12,346s
user    88m10,566s
sys     3m28,353s

LLVM 13 STOCK ARCH LINUX:

== Start Build
[2529/2529] Creating executable symlink bin/clang

real    7m9,064s
user    154m7,218s
sys     4m20,882s

Full log can be found here: https://pastebin.com/L859SqaQ

I saw your perf bolt branch, will let my server built over night.

maksfb commented 2 years ago

BOLT-INFO: 1405 out of 96706 functions in the binary (1.5%) have non-empty execution profile

2270751 : executed instructions

The number of profiled functions and executed instructions is low. Try to increase the sampling frequency. If you cannot increase the sampling frequency, profile the same compiler invocation in a loop (10+x) and merge multiple converted profiles with merge-fdata.

maksfb commented 2 years ago

This instrumentation adds a significant amount of overhead at build time on the couple of machines I tested on and I see next to no improvement at run time over regular PGO.

@nathanchance, how did you invoke llvm-bolt? Could you share its output with -dyno-stats option added (if you already don't have it)?

insilications commented 2 years ago

According to rafaelauler, the recommended pipeline is to use sampling with LBR if you can, instead of instrumenting. According to him most of the time LBR profiles tend to win.

nathanchance commented 2 years ago

@nathanchance, how did you invoke llvm-bolt?

The invocations are here:

https://github.com/ClangBuiltLinux/tc-build/pull/187/commits/159b6cb7f2c51970804eaf0e769d443147ad43d4#diff-ff8184afddb42c587485db0cff5989631826a9725d16ec5fa2ccb001c8061948R1292-R1296

https://github.com/ClangBuiltLinux/tc-build/pull/187/commits/159b6cb7f2c51970804eaf0e769d443147ad43d4#diff-ff8184afddb42c587485db0cff5989631826a9725d16ec5fa2ccb001c8061948R1304-R1309

They were shamelessly stolen from Optimizing Clang : A Practical Example of Applying BOLT :) if there is something different I should be doing, please let me know!

Could you share its output with -dyno-stats option added (if you already don't have it)?

Sure, let me do a fresh set of benchmarks, as I used hyperfine for the stats above, which does not show the output of a command by default (for performance reasons). I should have those done in a couple of hours if all goes well.

ptr1337 commented 2 years ago

@nathanchance

I can confirm that your branch-perf works also with branch sampling. I will do also benchmarks now with hyperfine and post them there. Actually the perf record profile is a bit small with around 6.5GB. When building clang the and profile it the profile got around 15GB. Will post the benchmarks here then also.

Here the output of the llvm-bolt process:

[ perf record: Captured and wrote 6540.474 MB /home/ptr1337/tc-build-perf/build/llvm/perf.data (8781714 samples) ]
$ /home/ptr1337/tc-build-perf/build/llvm/stage1/bin/perf2bolt -p /home/ptr1337/tc-build-perf/build/llvm/perf.data -o /home/ptr1337/tc-build-perf/build/llvm/clang.fdata /home/ptr1337/tc-build-perf/install/bin/clang-15
BOLT-INFO: shared object or position-independent executable detected
PERF2BOLT: Starting data aggregation job for /home/ptr1337/tc-build-perf/build/llvm/perf.data
PERF2BOLT: spawning perf job to read branch events
PERF2BOLT: spawning perf job to read mem events
PERF2BOLT: spawning perf job to read process events
PERF2BOLT: spawning perf job to read task events
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 074abdcc60fabe5aa7b301bccf9676e1cbcc1df5
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x4e00000, offset 0x4e00000
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling strict relocation mode for aggregation purposes
BOLT-WARNING: Failed to analyze 2630 relocations
BOLT-INFO: pre-processing profile using perf data aggregator
BOLT-WARNING: build-id will not be checked because we could not read one from input binary
PERF2BOLT: waiting for perf mmap events collection to finish...
PERF2BOLT: parsing perf-script mmap events output
PERF2BOLT: waiting for perf task events collection to finish...
PERF2BOLT: parsing perf-script task events output
PERF2BOLT: input binary is associated with 2997 PID(s)
PERF2BOLT: waiting for perf events collection to finish...
PERF2BOLT: parse branch events...
PERF2BOLT: read 7586584 samples and 240561575 LBR entries
PERF2BOLT: 1195130 samples (13.6%) were ignored
PERF2BOLT: traces mismatching disassembled function contents: 86327 (0.0%)
PERF2BOLT: out of range traces involving unknown regions: 23566706 (10.1%)
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5clang4Decl19castFromDeclContextEPKNS_11DeclContextE
BOLT-WARNING: 10 collisions detected while hashing binary objects. Use -v=1 to see the list.
PERF2BOLT: processing branch events...
PERF2BOLT: wrote 543512 objects and 0 memory objects to /home/ptr1337/tc-build-perf/build/llvm/clang.fdata
$ /home/ptr1337/tc-build-perf/build/llvm/stage1/bin/llvm-bolt --data=/home/ptr1337/tc-build-perf/build/llvm/clang.fdata --reorder-blocks=cache+ --reorder-functions=hfsort+ --split-functions=3 --split-all-cold --dyno-stats --icf=1 --use-gnu-stack -o /home/ptr1337/tc-build-perf/install/bin/clang.bolt /home/ptr1337/tc-build-perf/install/bin/clang-15
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 074abdcc60fabe5aa7b301bccf9676e1cbcc1df5
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 2630 relocations
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5clang4Decl19castFromDeclContextEPKNS_11DeclContextE
BOLT-WARNING: 4 collisions detected while hashing binary objects. Use -v=1 to see the list.
BOLT-INFO: 11604 out of 96577 functions in the binary (12.0%) have non-empty execution profile
BOLT-INFO: 377 functions with profile could not be optimized
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: the input contains 7490 (dynamic count : 5041440) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 615051 instructions were shortened
BOLT-INFO: removed 1334 empty blocks
BOLT-INFO: ICF folded 1300 out of 96853 functions in 4 passes. 2 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 149.79 KB of code space. Folded functions were called 897050 times based on profile.
BOLT-INFO: basic block reordering modified layout of 6659 (6.97%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 7003491 hot bytes from 11571105 cold bytes (37.70% of split functions is hot).
BOLT-INFO: 222 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 10680 to 1642
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

           287756641 : executed forward branches
            43936574 : taken forward branches
            60807613 : executed backward branches
            37693565 : taken backward branches
            20564369 : executed unconditional branches
            27539550 : all function calls
             7137129 : indirect calls
             3989435 : PLT calls
          2386833259 : executed instructions
           560479928 : executed load instructions
           293238455 : executed store instructions
             3811792 : taken jump table branches
                   0 : taken unknown indirect branches
           369128623 : total branches
           102194508 : taken branches
           266934115 : non-taken conditional branches
            81630139 : taken conditional branches
           348564254 : all conditional branches

           283691311 : executed forward branches (-1.4%)
            16601215 : taken forward branches (-62.2%)
            64872943 : executed backward branches (+6.7%)
            30465450 : taken backward branches (-19.2%)
            17384504 : executed unconditional branches (-15.5%)
            27539550 : all function calls (=)
             7137129 : indirect calls (=)
             3989435 : PLT calls (=)
          2375107716 : executed instructions (-0.5%)
           560479928 : executed load instructions (=)
           293238455 : executed store instructions (=)
             3811792 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
           365948758 : total branches (-0.9%)
            64451169 : taken branches (-36.9%)
           301497589 : non-taken conditional branches (+12.9%)
            47066665 : taken conditional branches (-42.3%)
           348564254 : all conditional branches (=)

BOLT-INFO: SCTC: patched 53 tail calls (47 forward) tail calls (6 backward) from a total of 53 while removing 6 double jumps and removing 40 basic blocks totalling 200 bytes of code. CTCs total execution count is 2240 and the number of times CTCs are taken is 1368.
BOLT-INFO: setting __hot_start to 0x4e00000
BOLT-INFO: setting __hot_end to 0x57eafca

Regards.

maksfb commented 2 years ago

BOLT-INFO: 11604 out of 96577 functions in the binary (12.0%) have non-empty execution profile

2386833259 : executed instructions

Looks much better!

ptr1337 commented 2 years ago

BOLT-INFO: 11604 out of 96577 functions in the binary (12.0%) have non-empty execution profile

2386833259 : executed instructions

Looks much better! @maksfb Actually this was with LBR there it recorded with perf but in his PR he uses a kernel compile to record it which should be changed.. The other output where you said thats to less was without branch-sampling (amd cpu), just with the instrumentation.

Actually I did with instrumentation two times the instrumentation and it doubled tge executed instructions. But sadly the binary for instrumenting is really really slow.

Output after two times of instrumentation then combining them with merge-fdata:

BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: f595b51f502b2f3c97d30e826784159438bde9c4
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 2424 relocations
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN4llvm3sys4path11is_absoluteERKNS_5TwineENS1_5StyleE
BOLT-INFO: 1315 out of 96706 functions in the binary (1.4%) have non-empty execution profile
BOLT-INFO: 3 functions with profile could not be optimized
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: the input contains 176 (dynamic count : 5974) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 13166 instructions were shortened
BOLT-INFO: removed 11 empty blocks
BOLT-INFO: ICF folded 316 out of 96983 functions in 4 passes. 0 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 94.04 KB of code space. Folded functions were called 3316 times based on profile.
BOLT-INFO: basic block reordering modified layout of 330 (0.34%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 560689 hot bytes from 346146 cold bytes (61.83% of split functions is hot).
BOLT-INFO: 3 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 1002 to 552
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

              553848 : executed forward branches
               64210 : taken forward branches
              191942 : executed backward branches
              100756 : taken backward branches
               59984 : executed unconditional branches
              251734 : all function calls
              103854 : indirect calls
               96734 : PLT calls
             5738078 : executed instructions
             1228106 : executed load instructions
              632336 : executed store instructions
                 988 : taken jump table branches
                   0 : taken unknown indirect branches
              805774 : total branches
              224950 : taken branches
              580824 : non-taken conditional branches
              164966 : taken conditional branches
              745790 : all conditional branches

              525170 : executed forward branches (-5.2%)
               25848 : taken forward branches (-59.7%)
              220620 : executed backward branches (+14.9%)
               95176 : taken backward branches (-5.5%)
               30714 : executed unconditional branches (-48.8%)
              251734 : all function calls (=)
              103854 : indirect calls (=)
               96734 : PLT calls (=)
             5676670 : executed instructions (-1.1%)
             1228106 : executed load instructions (=)
              632336 : executed store instructions (=)
                 988 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
              776504 : total branches (-3.6%)
              151738 : taken branches (-32.5%)
              624766 : non-taken conditional branches (+7.6%)
              121024 : taken conditional branches (-26.6%)
              745790 : all conditional branches (=)

BOLT-INFO: SCTC: patched 4 tail calls (4 forward) tail calls (0 backward) from a total of 4 while removing 0 double jumps and removing 4 basic blocks totalling 20 bytes of code. CTCs total execution count is 36 and the number of times CTCs are taken is 8.
BOLT-INFO: padding code to 0x5000000 to accommodate hot text
BOLT-INFO: setting __hot_start to 0x4e00000
BOLT-INFO: setting __hot_end to 0x4e956f4

Do you know a way split to split with perf record the file at for example 15 GB? Actually if using perf record when compiling clang resulting into a 35 GB big file and perf2bolt errors then cause the Server got just 32GB RAM. Even with a big swapfile it fails.

maksfb commented 2 years ago

Regardless of the means of profiling, perf LBR, sample, or instrumentation, the number of profiled functions should be more or less the same. It's possible that different processes stomp over each others instrumentation files. That could be the case when clang driver output is the same as "core" clang. Could you add -instrumentation-file-append-pid for instrumentation and see if more than one file gets generated?

35GB perf.data is too much. You can sample with less frequency. If you want to use the existing file, add -max-samples=100000000 to perf2bolt. If you are still running out of memory, try adding -strict=0.

nathanchance commented 2 years ago

@maksfb Unfortunately, I lost access to the machine that I ran the original results on (good ol' spot market), so they won't really be comparable to the previous results, but I was able to get access to a more powerful one (EPYC 7502P, 32c/64t) and I was able to see some small improvements.

The output from llvm-bolt while optimizing:

BOLT-INFO: shared object or position-independent executable detected                                                                                                                                                                                                                                            [69/27611]
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 794a0bb547484ec33c13bd6c7c04b1dbd03d040a
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 1551 relocations
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN4llvm11raw_ostream5writeEPKcm
BOLT-INFO: 1846 out of 119068 functions in the binary (1.6%) have non-empty execution profile
BOLT-INFO: 4 functions with profile could not be optimized
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: the input contains 213 (dynamic count : 1193) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 18828 instructions were shortened
BOLT-INFO: removed 9 empty blocks
BOLT-INFO: ICF folded 331 out of 119342 functions in 4 passes. 0 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 78.94 KB of code space. Folded functions were called 1334 times based on profile.
BOLT-INFO: basic block reordering modified layout of 380 (0.32%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 778357 hot bytes from 462225 cold bytes (62.74% of split functions is hot).
BOLT-INFO: 3 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 1519 to 639
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

              360388 : executed forward branches
               26808 : taken forward branches
              119537 : executed backward branches
               63573 : taken backward branches
               29252 : executed unconditional branches
              240580 : all function calls
               89273 : indirect calls
               85711 : PLT calls
             3585797 : executed instructions
              791739 : executed load instructions
              393839 : executed store instructions
                 260 : taken jump table branches
                   0 : taken unknown indirect branches
              509177 : total branches
              119633 : taken branches
              389544 : non-taken conditional branches
               90381 : taken conditional branches
              479925 : all conditional branches

              358569 : executed forward branches (-0.5%)
               22037 : taken forward branches (-17.8%)
              121356 : executed backward branches (+1.5%)
               62130 : taken backward branches (-2.3%)
               17110 : executed unconditional branches (-41.5%)
              240580 : all function calls (=)
               89273 : indirect calls (=)
               85711 : PLT calls (=)
             3570780 : executed instructions (-0.4%)
              791739 : executed load instructions (=)
              393839 : executed store instructions (=)
                 260 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
              497035 : total branches (-2.4%)
              101277 : taken branches (-15.3%)
              395758 : non-taken conditional branches (+1.6%)
               84167 : taken conditional branches (-6.9%)
              479925 : all conditional branches (=)

              358569 : executed forward branches (-0.5%)
               22037 : taken forward branches (-17.8%)
              121356 : executed backward branches (+1.5%)
               62130 : taken backward branches (-2.3%)
               17110 : executed unconditional branches (-41.5%)
              240580 : all function calls (=)
               89273 : indirect calls (=)
               85711 : PLT calls (=)
             3570780 : executed instructions (-0.4%)
              791739 : executed load instructions (=)
              393839 : executed store instructions (=)
                 260 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
              497035 : total branches (-2.4%)
              101277 : taken branches (-15.3%)
              395758 : non-taken conditional branches (+1.6%)
               84167 : taken conditional branches (-6.9%)
              479925 : all conditional branches (=)

BOLT-INFO: SCTC: patched 5 tail calls (5 forward) tail calls (0 backward) from a total of 5 while removing 0 double jumps and removing 5 basic blocks totalling 25 bytes of code. CTCs total execution count is 22 and the number of times CTCs are taken is 4.
BOLT-INFO: padding code to 0x6400000 to accommodate hot text
BOLT-INFO: setting __hot_start to 0x6200000
BOLT-INFO: setting __hot_end to 0x62cdabb

and the results of building the following kernels ten times with each toolchain:

ARCH=arm64 defconfig:

Command Mean [s] Min [s] Max [s] Relative
PGO 114.623 ± 0.060 114.499 114.695 1.02 ± 0.00
PGO + BOLT 112.425 ± 0.062 112.353 112.548 1.00

ARCH=x86_64 defconfig:

Command Mean [s] Min [s] Max [s] Relative
PGO 52.455 ± 0.084 52.357 52.616 1.02 ± 0.00
PGO + BOLT 51.535 ± 0.056 51.450 51.627 1.00

Additionally, the output of time -v doing just PGO:

        Command being timed: "/home/nathan/cbl/github/tc-build/build-llvm.py --assertions --build-folder /home/nathan/tmp/llvm-pgo-bolt-benchmarking/build/llvm --check-targets clang lld llvm llvm-unit --llvm-folder /home/nathan/cbl/src/llvm-project --install-folder /home/nathan/tmp/llvm-pgo-bolt-benchmarking/install/llvm/pgo --pgo kernel-defconfig --projects clang;lld --show-build-commands --targets AArch64;ARM;X86"
        User time (seconds): 88892.02
        System time (seconds): 5557.36
        Percent of CPU this job got: 4632%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 33:58.65
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2191052
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 3656163
        Minor (reclaiming a frame) page faults: 1293436132
        Voluntary context switches: 24439135
        Involuntary context switches: 17089464
        Swaps: 0
        File system inputs: 112
        File system outputs: 213101104
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

and doing PGO + BOLT:

        Command being timed: "/home/nathan/cbl/github/tc-build/build-llvm.py --assertions --bolt --build-folder /home/nathan/tmp/llvm-pgo-bolt-benchmarking/build/llvm --check-targets clang lld llvm llvm-unit --llvm-folder /home/nathan/cbl/src/llvm-project --install-folder /home/nathan/tmp/llvm-pgo-bolt-benchmarking/install/llvm/pgo-bolt --pgo kernel-defconfig --projects clang;lld --show-build-commands --targets AArch64;ARM;X86"
        User time (seconds): 106368.10
        System time (seconds): 402880.84
        Percent of CPU this job got: 5551%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:32:53
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 15886208
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 50641167
        Minor (reclaiming a frame) page faults: 1698451290
        Voluntary context switches: 84779639
        Involuntary context switches: 19818841
        Swaps: 0
        File system inputs: 64
        File system outputs: 1828643504
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

I will see if your suggestion around -instrumentation-file-append-pid improves it even further.

ptr1337 commented 2 years ago

@nathanchance Actually maybe it could be a issue since you PGO llvm with the kernel? Maybe try to --pgo llvm, that is how I do it and there was alot of improvement even without LBR. Also according their docs they just target X86.

@maksfb With a sampling from -c 2500 everything went fine, maybe this should changed in the docs.

I did now a run with -instrumentation-file-append-pid and it resulted in over 190 GB file usage when instrumenting clang, the combindend.fdata got 165MB, but the result which is shown is a lot of better then before (on a amd cpu without lbr):

BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: f595b51f502b2f3c97d30e826784159438bde9c4
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 2424 relocations
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5clang4Decl19castFromDeclContextEPKNS_11DeclContextE
BOLT-WARNING: 7 collisions detected while hashing binary objects. Use -v=1 to see the list.
BOLT-INFO: 16889 out of 96706 functions in the binary (17.5%) have non-empty execution profile
BOLT-INFO: 495 functions with profile could not be optimized
BOLT-WARNING: 9 (0.1% of all profiled) functions have invalid (possibly stale) profile. Use -report-stale to see the list.
BOLT-WARNING: 695817 out of 2165380220666 samples in the binary (0.0%) belong to functions with invalid (possibly stale) profile.
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: the input contains 9616 (dynamic count : 19121657995) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 782089 instructions were shortened
BOLT-INFO: removed 1458 empty blocks
BOLT-INFO: ICF folded 1687 out of 96983 functions in 5 passes. 2 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 192.87 KB of code space. Folded functions were called 6708861534 times based on profile.
BOLT-INFO: basic block reordering modified layout of 9503 (9.97%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 11629561 hot bytes from 11335899 cold bytes (50.64% of split functions is hot).
BOLT-INFO: 223 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 15697 to 7487
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

       1345852442580 : executed forward branches
        177695950568 : taken forward branches
        279784019546 : executed backward branches
        165552345783 : taken backward branches
         80707139554 : executed unconditional branches
        111711292644 : all function calls
         38436135059 : indirect calls
         20540549655 : PLT calls
      10555075375899 : executed instructions
       2627819906958 : executed load instructions
       1279424679765 : executed store instructions
         16846108018 : taken jump table branches
                   0 : taken unknown indirect branches
       1706343601680 : total branches
        423955435905 : taken branches
       1282388165775 : non-taken conditional branches
        343248296351 : taken conditional branches
       1625636462126 : all conditional branches

       1280317632631 : executed forward branches (-4.9%)
         78974541817 : taken forward branches (-55.6%)
        345318829495 : executed backward branches (+23.4%)
        155608698163 : taken backward branches (-6.0%)
         60656090585 : executed unconditional branches (-24.8%)
        111711292644 : all function calls (=)
         38436135059 : indirect calls (=)
         20540549655 : PLT calls (=)
      10505269253372 : executed instructions (-0.5%)
       2627819906958 : executed load instructions (=)
       1279424679765 : executed store instructions (=)
         16846108018 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
       1686292552711 : total branches (-1.2%)
        295239330565 : taken branches (-30.4%)
       1391053222146 : non-taken conditional branches (+8.5%)
        234583239980 : taken conditional branches (-31.7%)
       1625636462126 : all conditional branches (=)

BOLT-INFO: SCTC: patched 78 tail calls (70 forward) tail calls (8 backward) from a total of 78 while removing 12 double jumps and removing 65 basic blocks totalling 325 bytes of code. CTCs total execution count is 54542385 and the number of times CTCs are taken is 51777916.
BOLT-INFO: setting __hot_start to 0x4e00000
nathanchance commented 2 years ago

Actually maybe it could be a issue since you PGO llvm with the kernel? Maybe try to --pgo llvm, that is how I do it and there was alot of improvement even without LBR.

Sure, I can try that, but the primary purpose of this script is generating a version of LLVM that is most optimized for compiling the kernel, so I would think that running LLVM against the kernel would be better than running it against LLVM. Yet another hypothesis :)

maksfb commented 2 years ago

I did now a run with -instrumentation-file-append-pid and it resulted in over 190 GB file usage when instrumenting clang, the combindend.fdata got 165MB, but the result which is shown is a lot of better then before (on a amd cpu without lbr)

Okay. This confirms that in previous runs the "normal" clang profile was overwritten by the driver profile.

maksfb commented 2 years ago

With instrumentation, you will likely get a good-quality profile after compiling <10% of the code. I don't have a clear idea how such process could be automated though.

To speedup instrumented binaries, in BOLT we can introduce instrumentation "sampling". For each function, we will have to emit two versions, one instrumented and one not. Non-instrumented code will be executed most of the time, but functions will be redirected to instrumented version on every Nth invocation (either deterministically or randomly).

ptr1337 commented 2 years ago

@maksfb Yes definitly it several overwrites. I did also saw, that the profile got newly generated again and again with out the instrumentation-file-append-pid option. Actually at compiling the code the profile got first around 100mb, but when its done it went down to 2MB so this bug needs definitely fixed.

Actually the compile was this time faster with the instrumented binary, i saw in the processes without the instrumentation-file-append-pid many disk sleeps, actually it could be through this.

@nickdesaulniers I did actually two kernel compiles with modprobed-db:

LLVM15 BOLT+PGO+LTO (without LBR)


________________________________________________________
Executed in  131.14 secs    fish           external
   usr time   31.12 mins  242.00 micros   31.12 mins
   sys time    2.77 mins   98.00 micros    2.77 mins

LLVM 13 Stock

==> Leaving fakeroot environment.
==> Finished making: linux-cachyos-lto 5.17.4-2 (Sa 23 Apr 2022 00:56:39 CEST)

________________________________________________________
Executed in  189.55 secs    fish           external
   usr time   50.26 mins  220.00 micros   50.26 mins
   sys time    3.48 mins  125.00 micros    3.48 mins
nathanchance commented 2 years ago

Based on the above discussion, I have added support for perf to build-llvm.py's BOLT support and I added --instrumentation-file-append-pid for the instrumentation command. To avoid generating too much data from either perf or BOLT's instrumentation (as --instrumentation-file-append-pid will generate one file for each invocation of clang during a kernel build), we will just build one kernel (either the host target or the first target in the user's list if the host is not supported), which seems to be good enough to see some gains along the lines of what @nickdesaulniers reported on our mailing list, which is around 5-7% across the board. I'll tidy up these changes and push to the pull request for review tomorrow.

For the below benchmarks, this is the "base" build-llvm.py invocation:

$ build-llvm.py --no-ccache --pgo kernel-defconfig --projects "clang;lld" --targets "AArch64;ARM;X86"

For BOLT and (assertions), those correspond to --bolt and --assertions respectively.

Benchmarks with n2.xlarge.x86

Equinix's n2.xlarge.x86 has an Intel Xeon Gold 5218 (32C/64T), 384GB of RAM, and NVMe storage, which supports the perf approach:

build-llvm.py

Command Mean [s] Min [s] Max [s] Relative
PGO 2022.866 ± 4.092 2012.303 2026.556 1.00
PGO + BOLT 2537.742 ± 4.482 2528.943 2544.133 1.25 ± 0.00
PGO (assertions) 2219.303 ± 4.937 2210.261 2225.482 1.10 ± 0.00
PGO + BOLT (assertions) 2833.222 ± 6.157 2825.017 2842.768 1.40 ± 0.00

ARCH=arm defconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 89.767 ± 0.067 89.644 89.853 1.05 ± 0.00
PGO + BOLT 85.146 ± 0.059 85.051 85.250 1.00
PGO (assertions) 105.353 ± 0.095 105.183 105.500 1.24 ± 0.00
PGO + BOLT (assertions) 99.211 ± 0.078 99.133 99.355 1.17 ± 0.00

ARCH=arm64 defconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 125.606 ± 0.061 125.518 125.691 1.06 ± 0.00
PGO + BOLT 118.282 ± 0.057 118.227 118.401 1.00
PGO (assertions) 147.560 ± 0.077 147.454 147.671 1.25 ± 0.00
PGO + BOLT (assertions) 138.358 ± 0.078 138.227 138.503 1.17 ± 0.00

ARCH=x86_64 defconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 58.281 ± 0.115 58.129 58.498 1.05 ± 0.00
PGO + BOLT 55.411 ± 0.069 55.348 55.548 1.00
PGO (assertions) 67.350 ± 0.085 67.235 67.479 1.22 ± 0.00
PGO + BOLT (assertions) 63.839 ± 0.099 63.657 63.951 1.15 ± 0.00

ARCH=arm allmodconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 441.529 ± 0.112 441.395 441.722 1.05 ± 0.00
PGO + BOLT 420.460 ± 0.214 420.121 420.845 1.00
PGO (assertions) 510.643 ± 0.118 510.460 510.837 1.21 ± 0.00
PGO + BOLT (assertions) 482.406 ± 0.309 482.078 483.173 1.15 ± 0.00

ARCH=arm64 allmodconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 563.066 ± 0.138 562.897 563.236 1.06 ± 0.00
PGO + BOLT 533.187 ± 0.191 532.798 533.584 1.00
PGO (assertions) 651.277 ± 0.154 651.095 651.528 1.22 ± 0.00
PGO + BOLT (assertions) 612.805 ± 0.145 612.613 613.141 1.15 ± 0.00

ARCH=x86_64 allmodconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 558.869 ± 0.160 558.588 559.043 1.06 ± 0.00
PGO + BOLT 529.673 ± 0.124 529.509 529.875 1.00
PGO (assertions) 641.870 ± 0.196 641.636 642.297 1.21 ± 0.00
PGO + BOLT (assertions) 604.896 ± 0.215 604.660 605.259 1.14 ± 0.00

Benchmarks with m3.large.x86

Equinix's m3.large.x86 has an AMD EPYC 7502P (32C/64T), 256GB of RAM, and NVMe storage, which does not support the perf approach, instead relying on instrumentation:

build-llvm.py

NOTE: Due to https://github.com/llvm/llvm-project/issues/55004, these builds have assertions enabled, so they should not be compared with the PGO and PGO + BOLT times above.

Command Mean [s] Min [s] Max [s] Relative
PGO 1915.534 ± 1.204 1913.470 1917.758 1.00
PGO + BOLT 2844.596 ± 5.504 2836.409 2851.857 1.49 ± 0.00

ARCH=arm defconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 81.965 ± 0.036 81.914 82.006 1.07 ± 0.00
PGO + BOLT 76.696 ± 0.036 76.627 76.737 1.00

ARCH=arm64 defconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 114.519 ± 0.036 114.459 114.567 1.07 ± 0.00
PGO + BOLT 106.899 ± 0.062 106.779 107.001 1.00

ARCH=x86_64 defconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 52.413 ± 0.081 52.262 52.514 1.06 ± 0.00
PGO + BOLT 49.313 ± 0.088 49.128 49.427 1.00

ARCH=arm allmodconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 390.632 ± 0.140 390.418 390.854 1.06 ± 0.00
PGO + BOLT 367.258 ± 0.125 367.039 367.494 1.00

ARCH=arm64 allmodconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 500.528 ± 0.092 500.425 500.675 1.07 ± 0.00
PGO + BOLT 469.376 ± 0.110 469.242 469.527 1.00

ARCH=x86_64 allmodconfig

Command Mean [s] Min [s] Max [s] Relative
PGO 489.216 ± 0.160 488.974 489.471 1.07 ± 0.00
PGO + BOLT 458.317 ± 0.134 458.179 458.545 1.00