crate-ci / typos

Source code spell checker
Apache License 2.0
2.43k stars 93 forks source link

Profile-Guided Optimization (PGO) and LLVM BOLT results #827

Open zamazan4ik opened 11 months ago

zamazan4ik commented 11 months ago

Hi!

I did a lot of Profile-Guided Optimization (PGO) benchmarks recently on different kinds of software - all currently available results are located at https://github.com/zamazan4ik/awesome-pgo . According to the tests, PGO usually helps with achieving better performance. That's why testing PGO would be a good idea for Typos. I did some benchmarks on my local machine and want to share my results.

Test environment

Test workload

As a test scenario, I used LLVM sources from https://github.com/llvm/llvm-project on commit 11db162db07d6083b79f4724e649a8c2c69913e1. All runs are performed on the same hardware, operating system, and the same background workload. The command to run typos is taskset -c 0 ./typos -q --threads 1 llvm_project. One thread was used for the purpose of reducing multi-threading scheduler influence on the results. All PGO optimizations are done with cargo-pgo.

Results

Here are the results. Also, I posted Instrumentation results so you can estimate how typos slow in the Instrumentation mode. The results are in time utility format.

Some conclusions

Further steps

I can suggest to do the following things:

epage commented 11 months ago

Thanks for running these numbers!

iirc BOLT doesn't need a representative run to guide its optimizations. I wonder what a BOLT-only run looks like.

Optimize provided by Typos project binaries on the CI (like it's already done for other projects like Rustc), if any

If you had an example to point to that isn't as large as rustc, I'd appreciate it. I'd be curious to see what maintenance burden and CI pipeline load this introduces.

zamazan4ik commented 11 months ago

iirc BOLT doesn't need a representative run to guide its optimizations. I wonder what a BOLT-only run looks like.

Well, it's partially true. Yes, BOLT can perform some optimizations even without a runtime profile. However most of the optimizations are done by BOLT only with runtime profiles. The runtime profile could be collected with Linux's perf (sampling mode) or via BOLT's Instrumentation (as I did for typos).

If you had an example to point to that isn't as large as rustc, I'd appreciate it. I'd be curious to see what maintenance burden and CI pipeline load this introduces.

Sure! I have multiple examples of different PGO and/or BOLT integration into different projects:

Here are the examples not only for Rust-based projects - I hope it could help somehow.

epage commented 11 months ago

Thanks for pydantic-core, that is exactly what I was looking for!

The next question is what is a minimal reasonable use case to profile. We're already going to be blowing up our build times with this and I'd like to not make it worse, particularly because our github action has a race condition where if you specify master it will start using the new version even if the binary isn't built yet which will fail.

zamazan4ik commented 11 months ago

The next question is what is a minimal reasonable use case to profile.

I have some (hopefully helpful) thoughts on that:

epage commented 11 months ago

There is almost no need to run a PGO-optimized build for each commit/PR. I suggest using PGO only for Release builds. Under "Release" I mean delivered to the users' binaries (or something similar). So building PGO-optimized builds per Release or once should be ok.

That was my expectation. Even still, build times are an impact because we have a gap between master being updated and the binary being available that for any actions living at HEAD will be broken

You can generate a profile once and use it continuously in the CI, so there will be no need to perform 2-stage builds (that's the most time-consuming thing in the PGO builds). There is a possible issue here - profile skewing. If the profile is collected for the old enough typos version, the profile will be probably less efficient (code reformattings, missing profiles for the new code, changed a little hot/cold splitting in the workloads). So it would be good enough to just recollect binaries with some frequency, not per build.

So to verify, the code doesn't need to be 1:1 but it handles skew between the profile and PGO? Where can I read more about this so I understand the technical limitations?

zamazan4ik commented 11 months ago

Even still, build times are an impact because we have a gap between master being updated and the binary being available that for any actions living at HEAD will be broken

Yeah. I think for the dependent on the HEAD actions you can use a Release build without PGO and just do not make PGO builds for the HEAD version.

So to verify, the code doesn't need to be 1:1 but it handles skew between the profile and PGO? Where can I read more about this so I understand the technical limitations?

That's an excellent question! Unfortunately, I have no related resources regarding PGO profile skew handling in rustc compiler. Maybe @kobzol has something. You can read something about this question in the PGO documentation for the Go compiler - https://go.dev/doc/pgo . I hope it could help somehow (but there is no guarantee that this information is true for rustc PGO implementation).

Kobzol commented 11 months ago

I don't think that rustc currently promises anything regarding skew, for ideal results, the code should be the same both for instrumentation and for optimization. That being said, I think that as long as most functions will still have the same symbol name (this is the important thing for PGO), it should be mostly fine, and probably will be better than no PGO at all. So only reprofiling e.g. every 100 commits or every week or so should be OK. Of course if any build flags or the compiler changes, then new profiles have to be gathered.

However, even if you reprofile the binary on every release workflow, I don't think that the CI cost would have to be so large. I think that running on some input that takes ~30s in CI should be enough for this project. So you'd have to pay for one additional (re)build of the crate + 30s-1m of profile gathering. You could try to use cargo-pgo to make the PGO workflow simpler.

By the way, I if you want to make the released binaries faster, I think that using ThinLTO and/or CGU=1 could also have a large effect, without the complication of profile gathering (it will somewhat increase build times of course).

epage commented 11 months ago

By the way, I if you want to make the released binaries faster, I think that using ThinLTO and/or CGU=1 could also have a large effect, without the complication of profile gathering (it will somewhat increase build times of course).

Huh, I had thought those were on. I enabled CGU=1 because it offered a big gain but didn't enable ThinLTO because it slowed down compile times (iirc) for little gain. See 12506092722e5dc8dc33a94a976616087223aa6c

epage commented 11 months ago

With the costs and trade offs of PGO, is it still worth it with the CGU=1 change?

zamazan4ik commented 11 months ago

With the costs and trade offs of PGO, is it still worth it with the CGU=1 change?

I think so since CGU=1 and PGO implement different optimization sets. And enabling CGU=1 with LTO is a good thing to do before enabling PGO.

epage commented 11 months ago

There are trade offs with this. What I'm trying to weigh is how much of a gain there is going from CGU=1 to CGU=1 + PGO compared to any analysis time we have to do as part of our release pipeline.

zamazan4ik commented 11 months ago

The only way to estimate the benefits is testing CGU=1 vs CGU=1 + PGO in the benchmarks :)