Profile-Guided Optimization (PGO) and LLVM BOLT results

zamazan4ik commented 11 months ago

Hi!

I did a lot of Profile-Guided Optimization (PGO) benchmarks recently on different kinds of software - all currently available results are located at https://github.com/zamazan4ik/awesome-pgo . According to the tests, PGO usually helps with achieving better performance. That's why testing PGO would be a good idea for Typos. I did some benchmarks on my local machine and want to share my results.

Test environment

Fedora 38
Linux kernel 6.4.13
AMD Ryzen 9 5800x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Rust: 1.72
Latest Typos from the master branch (commit da2759161fbf9ac2840d6955f120bc3c6f24405f)

Test workload

As a test scenario, I used LLVM sources from https://github.com/llvm/llvm-project on commit 11db162db07d6083b79f4724e649a8c2c69913e1. All runs are performed on the same hardware, operating system, and the same background workload. The command to run typos is taskset -c 0 ./typos -q --threads 1 llvm_project. One thread was used for the purpose of reducing multi-threading scheduler influence on the results. All PGO optimizations are done with cargo-pgo.

Results

Here are the results. Also, I posted Instrumentation results so you can estimate how typos slow in the Instrumentation mode. The results are in time utility format.

Release: 48,86s user 3,44s system 99% cpu 52,628 total
PGO optimized: 30,09s user 3,23s system 99% cpu 33,616 total
PGO instrumented: 128,16s user 3,55s system 99% cpu 2:12,23 total
PGO optimized + BOLT instrumented: 92,05s user 3,60s system 99% cpu 1:36,08 total
PGO optimized + BOLT optimized: 29,09s user 3,16s system 98% cpu 32,585 total

Some conclusions

PGO shows great improvements in typos performance
BOLT (at least in the Lite mode) optimization mode does not show great improvement here

Further steps

I can suggest to do the following things:

Add a note to the Typos documentation about building with PGO. In this case, users and maintainers who build their own Typos binaries will be aware of PGO as an additional way to optimize the project
Optimize provided by Typos project binaries on the CI (like it's already done for other projects like Rustc), if any

epage commented 11 months ago

Thanks for running these numbers!

iirc BOLT doesn't need a representative run to guide its optimizations. I wonder what a BOLT-only run looks like.

Optimize provided by Typos project binaries on the CI (like it's already done for other projects like Rustc), if any

If you had an example to point to that isn't as large as rustc, I'd appreciate it. I'd be curious to see what maintenance burden and CI pipeline load this introduces.

zamazan4ik commented 11 months ago

iirc BOLT doesn't need a representative run to guide its optimizations. I wonder what a BOLT-only run looks like.

Well, it's partially true. Yes, BOLT can perform some optimizations even without a runtime profile. However most of the optimizations are done by BOLT only with runtime profiles. The runtime profile could be collected with Linux's perf (sampling mode) or via BOLT's Instrumentation (as I did for typos).

If you had an example to point to that isn't as large as rustc, I'd appreciate it. I'd be curious to see what maintenance burden and CI pipeline load this introduces.

Sure! I have multiple examples of different PGO and/or BOLT integration into different projects:

Rustc: a CI script for the multi-stage build
GCC:
- Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- A part in a "wonderful" configure script
Clang: Docs
Python:
- CPython: README
- Pyston: README
Go: Bash script
V8: Bazel flag
ChakraCore: Scripts
Chromium: Script
Firefox: Docs
- Thunderbird has PGO support too
PHP - Makefile command and old Centminmod scripts
MySQL: CMake script
YugabyteDB: GitHub commit
FoundationDB: Script
Zstd: Makefile
Foot: Scripts
Windows Terminal: GitHub PR
Pydantic-core: GitHub PR
file.d: GitHub PR

Here are the examples not only for Rust-based projects - I hope it could help somehow.

epage commented 11 months ago

Thanks for pydantic-core, that is exactly what I was looking for!

The next question is what is a minimal reasonable use case to profile. We're already going to be blowing up our build times with this and I'd like to not make it worse, particularly because our github action has a race condition where if you specify master it will start using the new version even if the binary isn't built yet which will fail.

zamazan4ik commented 11 months ago

The next question is what is a minimal reasonable use case to profile.

I have some (hopefully helpful) thoughts on that:

There is almost no need to run a PGO-optimized build for each commit/PR. I suggest using PGO only for Release builds. Under "Release" I mean delivered to the users' binaries (or something similar). So building PGO-optimized builds per Release or once should be ok.
You can generate a profile once and use it continuously in the CI, so there will be no need to perform 2-stage builds (that's the most time-consuming thing in the PGO builds). There is a possible issue here - profile skewing. If the profile is collected for the old enough typos version, the profile will be probably less efficient (code reformattings, missing profiles for the new code, changed a little hot/cold splitting in the workloads). So it would be good enough to just recollect binaries with some frequency, not per build.
How to generate a profile? I think it's okay to start with simply running Typos on some representative workloads (like checking Linux/LLVM sources in multiple modes), then collect the profiles and commit them into the repo + develop a script for recollecting the profiles (so in the future will be easier to regenerate profiles + for the users will be easier to generate their own profiles for Typos).

epage commented 11 months ago

There is almost no need to run a PGO-optimized build for each commit/PR. I suggest using PGO only for Release builds. Under "Release" I mean delivered to the users' binaries (or something similar). So building PGO-optimized builds per Release or once should be ok.

That was my expectation. Even still, build times are an impact because we have a gap between master being updated and the binary being available that for any actions living at HEAD will be broken

You can generate a profile once and use it continuously in the CI, so there will be no need to perform 2-stage builds (that's the most time-consuming thing in the PGO builds). There is a possible issue here - profile skewing. If the profile is collected for the old enough typos version, the profile will be probably less efficient (code reformattings, missing profiles for the new code, changed a little hot/cold splitting in the workloads). So it would be good enough to just recollect binaries with some frequency, not per build.

So to verify, the code doesn't need to be 1:1 but it handles skew between the profile and PGO? Where can I read more about this so I understand the technical limitations?

zamazan4ik commented 11 months ago

Even still, build times are an impact because we have a gap between master being updated and the binary being available that for any actions living at HEAD will be broken

Yeah. I think for the dependent on the HEAD actions you can use a Release build without PGO and just do not make PGO builds for the HEAD version.

So to verify, the code doesn't need to be 1:1 but it handles skew between the profile and PGO? Where can I read more about this so I understand the technical limitations?

That's an excellent question! Unfortunately, I have no related resources regarding PGO profile skew handling in rustc compiler. Maybe @kobzol has something. You can read something about this question in the PGO documentation for the Go compiler - https://go.dev/doc/pgo . I hope it could help somehow (but there is no guarantee that this information is true for rustc PGO implementation).

Kobzol commented 11 months ago

I don't think that rustc currently promises anything regarding skew, for ideal results, the code should be the same both for instrumentation and for optimization. That being said, I think that as long as most functions will still have the same symbol name (this is the important thing for PGO), it should be mostly fine, and probably will be better than no PGO at all. So only reprofiling e.g. every 100 commits or every week or so should be OK. Of course if any build flags or the compiler changes, then new profiles have to be gathered.

However, even if you reprofile the binary on every release workflow, I don't think that the CI cost would have to be so large. I think that running on some input that takes ~30s in CI should be enough for this project. So you'd have to pay for one additional (re)build of the crate + 30s-1m of profile gathering. You could try to use cargo-pgo to make the PGO workflow simpler.

By the way, I if you want to make the released binaries faster, I think that using ThinLTO and/or CGU=1 could also have a large effect, without the complication of profile gathering (it will somewhat increase build times of course).

epage commented 11 months ago

By the way, I if you want to make the released binaries faster, I think that using ThinLTO and/or CGU=1 could also have a large effect, without the complication of profile gathering (it will somewhat increase build times of course).

Huh, I had thought those were on. I enabled CGU=1 because it offered a big gain but didn't enable ThinLTO because it slowed down compile times (iirc) for little gain. See 12506092722e5dc8dc33a94a976616087223aa6c

epage commented 11 months ago

With the costs and trade offs of PGO, is it still worth it with the CGU=1 change?

zamazan4ik commented 11 months ago

With the costs and trade offs of PGO, is it still worth it with the CGU=1 change?

I think so since CGU=1 and PGO implement different optimization sets. And enabling CGU=1 with LTO is a good thing to do before enabling PGO.

epage commented 11 months ago

There are trade offs with this. What I'm trying to weigh is how much of a gain there is going from CGU=1 to CGU=1 + PGO compared to any analysis time we have to do as part of our release pipeline.

zamazan4ik commented 11 months ago

The only way to estimate the benefits is testing CGU=1 vs CGU=1 + PGO in the benchmarks :)

crate-ci / typos