Open zamazan4ik opened 11 months ago
Thanks for running these numbers!
iirc BOLT doesn't need a representative run to guide its optimizations. I wonder what a BOLT-only run looks like.
Optimize provided by Typos project binaries on the CI (like it's already done for other projects like Rustc), if any
If you had an example to point to that isn't as large as rustc, I'd appreciate it. I'd be curious to see what maintenance burden and CI pipeline load this introduces.
iirc BOLT doesn't need a representative run to guide its optimizations. I wonder what a BOLT-only run looks like.
Well, it's partially true. Yes, BOLT can perform some optimizations even without a runtime profile. However most of the optimizations are done by BOLT only with runtime profiles. The runtime profile could be collected with Linux's perf
(sampling mode) or via BOLT's Instrumentation (as I did for typos
).
If you had an example to point to that isn't as large as rustc, I'd appreciate it. I'd be curious to see what maintenance burden and CI pipeline load this introduces.
Sure! I have multiple examples of different PGO and/or BOLT integration into different projects:
Here are the examples not only for Rust-based projects - I hope it could help somehow.
Thanks for pydantic-core, that is exactly what I was looking for!
The next question is what is a minimal reasonable use case to profile. We're already going to be blowing up our build times with this and I'd like to not make it worse, particularly because our github action has a race condition where if you specify master
it will start using the new version even if the binary isn't built yet which will fail.
The next question is what is a minimal reasonable use case to profile.
I have some (hopefully helpful) thoughts on that:
typos
version, the profile will be probably less efficient (code reformattings, missing profiles for the new code, changed a little hot/cold splitting in the workloads). So it would be good enough to just recollect binaries with some frequency, not per build.There is almost no need to run a PGO-optimized build for each commit/PR. I suggest using PGO only for Release builds. Under "Release" I mean delivered to the users' binaries (or something similar). So building PGO-optimized builds per Release or once should be ok.
That was my expectation. Even still, build times are an impact because we have a gap between master
being updated and the binary being available that for any actions living at HEAD will be broken
You can generate a profile once and use it continuously in the CI, so there will be no need to perform 2-stage builds (that's the most time-consuming thing in the PGO builds). There is a possible issue here - profile skewing. If the profile is collected for the old enough typos version, the profile will be probably less efficient (code reformattings, missing profiles for the new code, changed a little hot/cold splitting in the workloads). So it would be good enough to just recollect binaries with some frequency, not per build.
So to verify, the code doesn't need to be 1:1 but it handles skew between the profile and PGO? Where can I read more about this so I understand the technical limitations?
Even still, build times are an impact because we have a gap between master being updated and the binary being available that for any actions living at HEAD will be broken
Yeah. I think for the dependent on the HEAD actions you can use a Release build without PGO and just do not make PGO builds for the HEAD version.
So to verify, the code doesn't need to be 1:1 but it handles skew between the profile and PGO? Where can I read more about this so I understand the technical limitations?
That's an excellent question! Unfortunately, I have no related resources regarding PGO profile skew handling in rustc
compiler. Maybe @kobzol has something. You can read something about this question in the PGO documentation for the Go compiler - https://go.dev/doc/pgo . I hope it could help somehow (but there is no guarantee that this information is true for rustc
PGO implementation).
I don't think that rustc
currently promises anything regarding skew, for ideal results, the code should be the same both for instrumentation and for optimization. That being said, I think that as long as most functions will still have the same symbol name (this is the important thing for PGO), it should be mostly fine, and probably will be better than no PGO at all. So only reprofiling e.g. every 100 commits or every week or so should be OK. Of course if any build flags or the compiler changes, then new profiles have to be gathered.
However, even if you reprofile the binary on every release workflow, I don't think that the CI cost would have to be so large. I think that running on some input that takes ~30s in CI should be enough for this project. So you'd have to pay for one additional (re)build of the crate + 30s-1m of profile gathering. You could try to use cargo-pgo
to make the PGO workflow simpler.
By the way, I if you want to make the released binaries faster, I think that using ThinLTO and/or CGU=1 could also have a large effect, without the complication of profile gathering (it will somewhat increase build times of course).
By the way, I if you want to make the released binaries faster, I think that using ThinLTO and/or CGU=1 could also have a large effect, without the complication of profile gathering (it will somewhat increase build times of course).
Huh, I had thought those were on. I enabled CGU=1 because it offered a big gain but didn't enable ThinLTO because it slowed down compile times (iirc) for little gain. See 12506092722e5dc8dc33a94a976616087223aa6c
With the costs and trade offs of PGO, is it still worth it with the CGU=1 change?
With the costs and trade offs of PGO, is it still worth it with the CGU=1 change?
I think so since CGU=1 and PGO implement different optimization sets. And enabling CGU=1 with LTO is a good thing to do before enabling PGO.
There are trade offs with this. What I'm trying to weigh is how much of a gain there is going from CGU=1 to CGU=1 + PGO compared to any analysis time we have to do as part of our release pipeline.
The only way to estimate the benefits is testing CGU=1 vs CGU=1 + PGO in the benchmarks :)
Hi!
I did a lot of Profile-Guided Optimization (PGO) benchmarks recently on different kinds of software - all currently available results are located at https://github.com/zamazan4ik/awesome-pgo . According to the tests, PGO usually helps with achieving better performance. That's why testing PGO would be a good idea for Typos. I did some benchmarks on my local machine and want to share my results.
Test environment
master
branch (commitda2759161fbf9ac2840d6955f120bc3c6f24405f
)Test workload
As a test scenario, I used LLVM sources from https://github.com/llvm/llvm-project on commit
11db162db07d6083b79f4724e649a8c2c69913e1
. All runs are performed on the same hardware, operating system, and the same background workload. The command to runtypos
istaskset -c 0 ./typos -q --threads 1 llvm_project
. One thread was used for the purpose of reducing multi-threading scheduler influence on the results. All PGO optimizations are done with cargo-pgo.Results
Here are the results. Also, I posted Instrumentation results so you can estimate how
typos
slow in the Instrumentation mode. The results are intime
utility format.48,86s user 3,44s system 99% cpu 52,628 total
30,09s user 3,23s system 99% cpu 33,616 total
128,16s user 3,55s system 99% cpu 2:12,23 total
92,05s user 3,60s system 99% cpu 1:36,08 total
29,09s user 3,16s system 98% cpu 32,585 total
Some conclusions
typos
performanceFurther steps
I can suggest to do the following things: