Link-Time Optimization (LTO), Profile-Guided Optimization (PGO), Post-Link Optimization (PLO) benchmark results

Hi!

As was proposed here, I decided to perform various tests with optimization resvg with more advanced compiler optimizations like LTO, PGO, PLO. Recently I tested Profile-Guided Optimization (PGO) compiler optimization on different projects in different software domains - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Here are my results for the project - I hope they will be helpful to someone.

Test environment

Fedora 39
Linux kernel 6.8.9
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.78.0
resvg version: the latest for now from the master branch on commit 4b4e8970de29407e6257aac3d2f501b60e88236a
Disabled Turbo boost

Benchmark

For benchmark purposes, I use a simple scenario of converting an SVG file to a PNG file with the resvg input.svg output.png command. For PGO optimization I use cargo-pgo tool. Release build is done with cargo build --release, PGO instrumented - cargo pgo build, PGO-optimized - cargo pgo optimize build.

taskset -c 0 is used for reducing the OS scheduler's influence on the results during all measurements. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).

As an input file for the training purposes for the resvg input.svg output.png command, I use this file.

Additionally, I decided to enable back LTO for the tool. You disabled this optimization nearly 5 years ago due to some compiler bugs. I guess during the last 5 years the LTO implementation in the compiler became much more stable, and we can consider enabling it once again. So, for resvg during the benchmarks I enabled it with the following addition to the Cargo.toml file:

[profile.release]
codegen-units = 1
lto = true

Post-Link Optimization is also done with cargo-pgo with the same training workload as for the PGO step.

Results

Firstly, let's check the scenario when the training workload and the benchmark workload are the same. Such a benchmark is still useful for scenarios where you need to convert the same file many times (like a part of CI without caching):

hyperfine --warmup 5 --min-runs 15 'taskset -c 0 ./resvg_release large_svg_file_1.svg large_png_file_1.png' 'taskset -c 0 ./resvg_release_lto large_svg_file_1.svg large_png_file_1.png' 'taskset -c 0 ./resvg_lto_optimized large_svg_file_1.svg large_png_file_1.png' 'taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_1.svg large_png_file_1.png'
Benchmark 1: taskset -c 0 ./resvg_release large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      3.349 s ±  0.011 s    [User: 3.082 s, System: 0.257 s]
  Range (min … max):    3.333 s …  3.368 s    15 runs

Benchmark 2: taskset -c 0 ./resvg_release_lto large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      3.062 s ±  0.018 s    [User: 2.802 s, System: 0.250 s]
  Range (min … max):    3.040 s …  3.120 s    15 runs

Benchmark 3: taskset -c 0 ./resvg_lto_optimized large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      2.631 s ±  0.008 s    [User: 2.368 s, System: 0.255 s]
  Range (min … max):    2.622 s …  2.644 s    15 runs

Benchmark 4: taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      2.611 s ±  0.007 s    [User: 2.347 s, System: 0.256 s]
  Range (min … max):    2.598 s …  2.622 s    15 runs

Summary
  taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_1.svg large_png_file_1.png ran
    1.01 ± 0.00 times faster than taskset -c 0 ./resvg_lto_optimized large_svg_file_1.svg large_png_file_1.png
    1.17 ± 0.01 times faster than taskset -c 0 ./resvg_release_lto large_svg_file_1.svg large_png_file_1.png
    1.28 ± 0.01 times faster than taskset -c 0 ./resvg_release large_svg_file_1.svg large_png_file_1.png

where:

resvg_release - regular Release build
resvg_release_lto - Release + LTO
resvg_lto_optimized - Release + LTO + PGO optimized
resvg_lto_bolt_optimized - Release + LTO + PGO optimized + BOLT optimized

According to the results, LTO and PGO measurably improve performance. However, BOLT didn't improve the situation too much.

What if training and benchmarking workloads are different files? For this, I used the same file for training as above but for the benchmarks, I use another file. Here we go:

hyperfine --warmup 5 --min-runs 15 'taskset -c 0 ./resvg_release large_svg_file_3.svg large_png_file_3.png' 'taskset -c 0 ./resvg_release_lto large_svg_file_3.svg large_png_file_3.png' 'taskset -c 0 ./resvg_lto_optimized large_svg_file_3.svg large_png_file_3.png' 'taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_3.svg large_png_file_3.png'
Benchmark 1: taskset -c 0 ./resvg_release large_svg_file_3.svg large_png_file_3.png
  Time (mean ± σ):      2.398 s ±  0.006 s    [User: 2.260 s, System: 0.131 s]
  Range (min … max):    2.391 s …  2.414 s    15 runs

Benchmark 2: taskset -c 0 ./resvg_release_lto large_svg_file_3.svg large_png_file_3.png
  Time (mean ± σ):      2.130 s ±  0.008 s    [User: 1.991 s, System: 0.133 s]
  Range (min … max):    2.123 s …  2.157 s    15 runs

Benchmark 3: taskset -c 0 ./resvg_lto_optimized large_svg_file_3.svg large_png_file_3.png
  Time (mean ± σ):      1.846 s ±  0.006 s    [User: 1.707 s, System: 0.134 s]
  Range (min … max):    1.838 s …  1.859 s    15 runs

Benchmark 4: taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_3.svg large_png_file_3.png
  Time (mean ± σ):      1.864 s ±  0.021 s    [User: 1.723 s, System: 0.135 s]
  Range (min … max):    1.851 s …  1.935 s    15 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./resvg_lto_optimized large_svg_file_3.svg large_png_file_3.png ran
    1.01 ± 0.01 times faster than taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_3.svg large_png_file_3.png
    1.15 ± 0.01 times faster than taskset -c 0 ./resvg_release_lto large_svg_file_3.svg large_png_file_3.png
    1.30 ± 0.01 times faster than taskset -c 0 ./resvg_release large_svg_file_3.svg large_png_file_3.png

We got a performance boost once again for a different file. I suppose it's because these two files execute similar paths inside the tool but cannot say more since I am not an SVG expert at all :)

However, there are cases that show that training on only one file is not sufficient - e.g. let's use this file for the benchmark (the training file remains the same as in the tests above):

hyperfine --warmup 5 --min-runs 15 'taskset -c 0 ./resvg_release large_svg_file_2.svg large_png_file_2.png' 'taskset -c 0 ./resvg_release_lto large_svg_file_2.svg large_png_file_2.png' 'taskset -c 0 ./resvg_lto_optimized large_svg_file_2.svg large_png_file_2.png' 'taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_2.svg large_png_file_2.png'
Benchmark 1: taskset -c 0 ./resvg_release large_svg_file_2.svg large_png_file_2.png
  Time (mean ± σ):      1.415 s ±  0.003 s    [User: 1.040 s, System: 0.357 s]
  Range (min … max):    1.409 s …  1.421 s    15 runs

Benchmark 2: taskset -c 0 ./resvg_release_lto large_svg_file_2.svg large_png_file_2.png
  Time (mean ± σ):      1.439 s ±  0.004 s    [User: 1.055 s, System: 0.365 s]
  Range (min … max):    1.429 s …  1.445 s    15 runs

Benchmark 3: taskset -c 0 ./resvg_lto_optimized large_svg_file_2.svg large_png_file_2.png
  Time (mean ± σ):      1.488 s ±  0.002 s    [User: 1.107 s, System: 0.361 s]
  Range (min … max):    1.483 s …  1.491 s    15 runs

Benchmark 4: taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_2.svg large_png_file_2.png
  Time (mean ± σ):      1.497 s ±  0.002 s    [User: 1.116 s, System: 0.363 s]
  Range (min … max):    1.493 s …  1.502 s    15 runs

Summary
  taskset -c 0 ./resvg_release large_svg_file_2.svg large_png_file_2.png ran
    1.02 ± 0.00 times faster than taskset -c 0 ./resvg_release_lto large_svg_file_2.svg large_png_file_2.png
    1.05 ± 0.00 times faster than taskset -c 0 ./resvg_lto_optimized large_svg_file_2.svg large_png_file_2.png
    1.06 ± 0.00 times faster than taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_2.svg large_png_file_2.png

Here we see some performance decrease from all optimizations (even from LTO that's strange). It shows that the training PGO set should be wider.

Just for reference, I also measured the tool slowdown during the PGO and PLO training phases:

hyperfine --warmup 5 --min-runs 15 'taskset -c 0 ./resvg_lto_instrumented large_svg_file_1.svg large_png_file_1.png' 'taskset -c 0 ./resvg_lto_bolt_instrumented large_svg_file_1.svg large_png_file_1.png'
Benchmark 1: taskset -c 0 ./resvg_lto_instrumented large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      3.670 s ±  0.062 s    [User: 3.397 s, System: 0.262 s]
  Range (min … max):    3.638 s …  3.891 s    15 runs

Benchmark 2: taskset -c 0 ./resvg_lto_bolt_instrumented large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      4.593 s ±  0.010 s    [User: 4.223 s, System: 0.338 s]
  Range (min … max):    4.572 s …  4.610 s    15 runs

Summary
  taskset -c 0 ./resvg_lto_instrumented large_svg_file_1.svg large_png_file_1.png ran
    1.25 ± 0.02 times faster than taskset -c 0 ./resvg_lto_bolt_instrumented large_svg_file_1.svg large_png_file_1.png

where:

resvg_lto_instrumented - Release + LTO + PGO instrumentation
resvg_lto_bolt_instrumented - Release + LTO + PGO optimization + BOLT instrumentation

Also, I want to report the binary size changes (without strip-ing that can influence the binary size a lot):

Release: 3.6 Mib
Release + LTO: 3.1 Mib
Release + LTO + PGO instrumentation: 7.8 Mib
Release + LTO + PGO optimization: 4.8 Mib
Release + LTO + PGO optimization + BOLT instrumentation: 20 Mib
Release + LTO + PGO optimization + BOLT optimization: 8.7 Mib

Further steps

I can suggest the following action points:

Enable LTO. I expect in general performance boost "for free" and the binary size reduction.
Perform more PGO benchmarks with other datasets (if you are interested enough in it). If it shows improvements - add a note to the documentation (the README file, I guess) about possible improvements in the resvg's performance with PGO.
Probably, you can try to get some insights about how the code can be optimized further based on the changes that the compiler performed with PGO. It can be done via analyzing flamegraphs before and after applying PGO to understand the difference. Like more aggressive inlining.
Testing Post-Link Optimization techniques (like LLVM BOLT) with wider datasets would be interesting too (Clang and Rustc already use BOLT as an addition to PGO). However, I recommend starting from the usual PGO since it's a much more stable technology with much fewer limitations.

I would be happy to answer your questions about PGO.

P.S. Please do not treat the issue like a bug or something like that - it's just a benchmark report. Since the "Discussions" functionality is disabled in this repo, I created the Issue instead.

RazrFalcon / resvg