Evaluate using LTO, Profile-Guided Optimization (PGO) and Post Link Optimization (PLO)

zamazan4ik commented 7 months ago

Hi!

Recently I checked optimizations like Link-Time Optimization (LTO), Profile-Guided Optimization (PGO) and Post-Link Optimizations (PLO) improvements on multiple projects. The results are available here. According to the tests, all these optimizations can help with achieving better performance in many cases for many applications. I think trying to enable them for Symbolicator can be a good idea.

I already did some benchmarks and want to share my results here. Hopefully, they will be helpful.

Test environment

Fedora 39
Linux kernel 6.5.12
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.74
Symbolicator version: the latest for now from the master branch on commit ac127975e4649dc2f40b177cb98556e307aca26e
Disabled Turbo boost (for better results consistency across runs)

Benchmark

For the benchmark purposes, I used this WRK-based scenario for Minidump. As a minidump, I use this Linux dump. WRK command is the same for all benchmarks and PGO/PLO training phases: WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump. Before each WRK benchmark, I once run cargo run -p process-event -- ../tests/fixtures/linux.dmp as it's recommended.

All PGO and PLO optimizations are done with cargo-pgo (I highly recommend using this tool). For PLO phase I use LLVM BOLT tool.

LTO is enabled by the following changes to the Release profile in the root Cargo.toml file:

[profile.release]
# For release builds, we do want line-only debug information to be able to symbolicate panic stack traces.
debug = 1
codegen-units = 1
lto = true

For all benchmarks, binaries are stripped with a strip tool.

All tests are done on the same machine, done multiple times, with the same background "noise" (as much as I can guarantee of course) - the results are reproducible at least on my machine.

Tricky moment with PGO dumps

For some unknown reason, Symbolicator does not dump the PGO profile to the disk on Ctrl+C. I guess it's somehow related to custom signal handling somewhere in the code. So I modified Symbolicator a little bit by manually dumping the PGO profile to the disk. As a reference implementation, I use this piece of code from YugabyteDB. I made the following changes to the main.rs:

extern {
    fn __llvm_profile_write_file();
}

use signal_hook::{consts::SIGINT, iterator::Signals};
use std::{error::Error, thread, time::Duration};

fn main() {
    let mut signals = Signals::new(&[SIGINT]).unwrap();

    thread::spawn(move || {
        for sig in signals.forever() {
            println!("Received signal {:?}", sig);
            unsafe { __llvm_profile_write_file(); }
            std::process::exit(0);
        }
    });

    match cli::execute() {
        Ok(()) => std::process::exit(0),
        Err(error) => {
            logging::ensure_log_error(&error);
            std::process::exit(1);
        }
    }
}

I use signal_hook dependency. Please note that __llvm_profile_write_file symbol is linked to the program only when you build your program with PGO instrumentation (it's done automatically by the Rustc compiler). Since this, you need to disable/comment out this code during the PGO optimization phase (otherwise you get a link error).

I think there should be a better way to implement this logic but for the tests' purposes, it's good enough.

Results

Here I post the benchmark results for the following Symbolicator configurations:

Release build (the default build)
Release + LTO
Release + LTO + PGO optimized
Release + LTO + PGO optimized + PLO optimized (with LLVM BOLT)

Release:

WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    18.60ms   10.37ms  41.83ms   57.72%
    Req/Sec   269.77     45.57   434.00     67.50%
  80627 requests in 30.02s, 470.12MB read
Requests/sec:   2685.82
Transfer/sec:     15.66MB

Release + LTO:

WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    16.09ms    8.95ms  35.94ms   57.91%
    Req/Sec   312.10     43.03   440.00     70.03%
  93266 requests in 30.03s, 543.81MB read
Requests/sec:   3106.16
Transfer/sec:     18.11MB

Release + LTO + PGO optimized:

WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    13.71ms    7.63ms  31.15ms   57.75%
    Req/Sec   366.18     51.08   545.00     67.53%
  109422 requests in 30.02s, 638.01MB read
Requests/sec:   3644.55
Transfer/sec:     21.25MB

Release + LTO + PGO optimized + PLO optimized:

WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    12.82ms    7.14ms  30.18ms   57.86%
    Req/Sec   391.64     58.28   555.00     62.60%
  117034 requests in 30.02s, 682.40MB read
Requests/sec:   3897.93
Transfer/sec:     22.73MB

According to the tests above, I see measurable improvements from enabling LTO, PGO and PLO with LLVM BOLT.

Additionally, below I post results for the PGO instrumentation and PLO instrumentation phases. So you can estimate the Symbolicator slowdown during the instrumentation.

Release + LTO + PGO instrumentation:

WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    19.10ms   10.63ms  42.48ms   57.75%
    Req/Sec   262.82     44.15   454.00     68.50%
  78545 requests in 30.02s, 457.98MB read
Requests/sec:   2616.14
Transfer/sec:     15.25MB

Release + LTO + PGO optimized + PLO instrumentation:

WRK_MINIDUMP="../tests/fixtures/linux.dmp" ./wrk --threads 10 --connections 50 --duration 30s --script ../tests/wrk/minidump.lua http://127.0.0.1:3021/minidump
Running 30s test @ http://127.0.0.1:3021/minidump
  10 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    22.72ms   12.66ms  50.54ms   57.68%
    Req/Sec   220.82     40.15   380.00     66.73%
  66012 requests in 30.02s, 384.90MB read
Requests/sec:   2198.81
Transfer/sec:     12.82MB

Further steps

I can suggest the following action points:

Perform more PGO benchmarks on Symbolicator. If it shows improvements - add a note to the documentation about possible improvements in Symbolicator performance with PGO.
Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize Symbolicator according to their workloads.
Optimize pre-built Symbolicator binaries
Evaluate other Sentry products for LTO/PGO/PLO applicability

Here are some examples of how PGO optimization is integrated in other projects:

Rustc: a CI script for the multi-stage build
GCC:
- Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- A part in a "wonderful" configure script
Clang: Docs
Python:
- CPython: README
- Pyston: README
Go: Bash script
V8: Bazel flag
ChakraCore: Scripts
Chromium: Script
Firefox: Docs
- Thunderbird has PGO support too
PHP - Makefile command and old Centminmod scripts
MySQL: CMake script
YugabyteDB: GitHub commit
FoundationDB: Script
Zstd: Makefile
Foot: Scripts
Windows Terminal: GitHub PR
Pydantic-core: GitHub PR
file.d: GitHub PR
OceanBase: CMake flag

I have some examples of how PGO information looks in the documentation:

ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization
Databend: https://databend.rs/doc/contributing/pgo
Vector: https://vector.dev/docs/administration/tuning/pgo/
Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/
GCC: Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
Clang:
- https://llvm.org/docs/HowToBuildWithPGO.html
- https://llvm.org/docs/AdvancedBuilds.html
tsv-utils: https://github.com/eBay/tsv-utils/blob/master/docs/BuildingWithLTO.md

Regarding LLVM BOLT integration, I have the following examples:

Rustc:
- Rustc itself (GitHub PR)
- LLVM in Rustc (Reddit)
CPython: GitHub PR
YDB: GitHub comment
Clang:
LDC: GitHub comment
HHVM, Proxygen and others: Facebook paper
NodeJS: Blog
Chromium: Blog
MySQL, MongoDB, memcached, Verilator: Paper

Swatinem commented 7 months ago

Wow, this is a very thorough analysis, thank you for doing this! I was previously reading all the Rust related posts enabling LTO/PGO/PLO for the compiler itself with great interest.

It looks like you get about ~15% for each of the steps, which is very impressive. I would be curious what the impacts on build times are. Especially with PGO/PLO, you have to build once with instrumentation and then a second time, so build times are at least 2x. We have to find a reasonable middle ground here. As we are building Docker images for every commit on master. And we would like to have reasonable turnaround times in case we might have to deploy some quick fixes, though that is rarely the case.

Then there is the question of getting some halfway complete and realistic testcases to run on instrumented builds, which should include:

Minidump processing with a realistic minidump
Native Symbolication (which is part of minidump processing anyway)
Debug File processing (PE, PDB, ELF, Mach-O, DWARF, wasm)
SourceMap processing with realistic SourceMaps and artifact bundles.

Curating a good corpus of tests with edgecases would be challenging, and hooking that up to the build as well.

A side not purely on benchmarking:

There is symbolicator-stress, which runs just in-process and does not require an external tool like wrk to push payloads to it. So if it is just about measuring potential wins, I would recommend that tool.

But yes, in the end, you would want to build/instrument/optimize the final http server, so in the end you would want to use that.

Again, huge thanks for doing all these measurements. At the very least LTO should provide us with a good 10-15% win with reasonable effort. The wins for PGO/PLO look amazing as well, but the effort to get there is a lot higher so I’m not sure yet it if would be worth doing.

untitaker commented 7 months ago

Especially with PGO/PLO, you have to build once with instrumentation and then a second time, so build times are at least 2x.

is it possible to build once with instrumentation to produce a profile, and then build the release with the profile of a previous commit (and just disabling PGO if there's none)? it would allow you to parallelize.

zamazan4ik commented 7 months ago

We have to find a reasonable middle ground here. As we are building Docker images for every commit on master. And we would like to have reasonable turnaround times in case we might have to deploy some quick fixes, though that is rarely the case.

Yep. This could be mitigated by building heavily-optimized Docker images (LTO + PGO + PLO) not for every commit, but let's say every night or something like that. Of course, some changes to the current CI pipelines will be needed in this case. That's just an idea of how it can be implemented.

The wins for PGO/PLO look amazing as well, but the effort to get there is a lot higher so I’m not sure yet it if would be worth doing.

I agree. If you decide to not invest much in enabling PGO/PLO in your build scripts, I can suggest at least writing a note somewhere in the Symbolicator documentation about possible opportunities to increase the Symbolicator performance with PGO/PLO. In this case, users who want to optimize Symbolicator as much as possible for their use cases and are not afraid of building their Symbolicator binaries from the source code will be aware of such optimizations.

is it possible to build once with instrumentation to produce a profile, and then build the release with the profile of a previous commit (and just disabling PGO if there's none)? it would allow you to parallelize.

Yes, it's possible to do. Just be aware that in the case of PGO profile caching you will face a "profile skew" problem then the PGO profile could be a bit outdated and not optimize some new code (since the profile is created for the older source code). In general, it's not critical if the source code is not changing much (so there will be no huge difference between the commits). Another possible trap - changing compiler options. If you change a compiler option or upgrade the compiler version, it's highly recommended to regenerate the PGO profiles since they with a huge probability will be incompatible.

Swatinem commented 6 months ago

We now build Symbolicator docker images with 1CGU + LTO. We needed to throw a beefier cloudbuild machine at it unfortunately. Build times suffered a little bit in general as well.

Though the results on CPU usage in production were underwhelming. I expected to see some improvements in the graph, but nothing is visible. I expected to see at least a tiny dent in the graphs but there was nothing.

So I will call this closed, thank you very much for doing the initial assessment on this 🎉

getsentry / symbolicator