RazrFalcon / resvg

An SVG rendering library.
Mozilla Public License 2.0
2.74k stars 220 forks source link

Link-Time Optimization (LTO), Profile-Guided Optimization (PGO), Post-Link Optimization (PLO) benchmark results #765

Open zamazan4ik opened 4 months ago

zamazan4ik commented 4 months ago

Hi!

As was proposed here, I decided to perform various tests with optimization resvg with more advanced compiler optimizations like LTO, PGO, PLO. Recently I tested Profile-Guided Optimization (PGO) compiler optimization on different projects in different software domains - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Here are my results for the project - I hope they will be helpful to someone.

Test environment

Benchmark

For benchmark purposes, I use a simple scenario of converting an SVG file to a PNG file with the resvg input.svg output.png command. For PGO optimization I use cargo-pgo tool. Release build is done with cargo build --release, PGO instrumented - cargo pgo build, PGO-optimized - cargo pgo optimize build.

taskset -c 0 is used for reducing the OS scheduler's influence on the results during all measurements. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).

As an input file for the training purposes for the resvg input.svg output.png command, I use this file.

Additionally, I decided to enable back LTO for the tool. You disabled this optimization nearly 5 years ago due to some compiler bugs. I guess during the last 5 years the LTO implementation in the compiler became much more stable, and we can consider enabling it once again. So, for resvg during the benchmarks I enabled it with the following addition to the Cargo.toml file:

[profile.release]
codegen-units = 1
lto = true

Post-Link Optimization is also done with cargo-pgo with the same training workload as for the PGO step.

Results

Firstly, let's check the scenario when the training workload and the benchmark workload are the same. Such a benchmark is still useful for scenarios where you need to convert the same file many times (like a part of CI without caching):

hyperfine --warmup 5 --min-runs 15 'taskset -c 0 ./resvg_release large_svg_file_1.svg large_png_file_1.png' 'taskset -c 0 ./resvg_release_lto large_svg_file_1.svg large_png_file_1.png' 'taskset -c 0 ./resvg_lto_optimized large_svg_file_1.svg large_png_file_1.png' 'taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_1.svg large_png_file_1.png'
Benchmark 1: taskset -c 0 ./resvg_release large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      3.349 s ±  0.011 s    [User: 3.082 s, System: 0.257 s]
  Range (min … max):    3.333 s …  3.368 s    15 runs

Benchmark 2: taskset -c 0 ./resvg_release_lto large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      3.062 s ±  0.018 s    [User: 2.802 s, System: 0.250 s]
  Range (min … max):    3.040 s …  3.120 s    15 runs

Benchmark 3: taskset -c 0 ./resvg_lto_optimized large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      2.631 s ±  0.008 s    [User: 2.368 s, System: 0.255 s]
  Range (min … max):    2.622 s …  2.644 s    15 runs

Benchmark 4: taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      2.611 s ±  0.007 s    [User: 2.347 s, System: 0.256 s]
  Range (min … max):    2.598 s …  2.622 s    15 runs

Summary
  taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_1.svg large_png_file_1.png ran
    1.01 ± 0.00 times faster than taskset -c 0 ./resvg_lto_optimized large_svg_file_1.svg large_png_file_1.png
    1.17 ± 0.01 times faster than taskset -c 0 ./resvg_release_lto large_svg_file_1.svg large_png_file_1.png
    1.28 ± 0.01 times faster than taskset -c 0 ./resvg_release large_svg_file_1.svg large_png_file_1.png

where:

According to the results, LTO and PGO measurably improve performance. However, BOLT didn't improve the situation too much.

What if training and benchmarking workloads are different files? For this, I used the same file for training as above but for the benchmarks, I use another file. Here we go:

hyperfine --warmup 5 --min-runs 15 'taskset -c 0 ./resvg_release large_svg_file_3.svg large_png_file_3.png' 'taskset -c 0 ./resvg_release_lto large_svg_file_3.svg large_png_file_3.png' 'taskset -c 0 ./resvg_lto_optimized large_svg_file_3.svg large_png_file_3.png' 'taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_3.svg large_png_file_3.png'
Benchmark 1: taskset -c 0 ./resvg_release large_svg_file_3.svg large_png_file_3.png
  Time (mean ± σ):      2.398 s ±  0.006 s    [User: 2.260 s, System: 0.131 s]
  Range (min … max):    2.391 s …  2.414 s    15 runs

Benchmark 2: taskset -c 0 ./resvg_release_lto large_svg_file_3.svg large_png_file_3.png
  Time (mean ± σ):      2.130 s ±  0.008 s    [User: 1.991 s, System: 0.133 s]
  Range (min … max):    2.123 s …  2.157 s    15 runs

Benchmark 3: taskset -c 0 ./resvg_lto_optimized large_svg_file_3.svg large_png_file_3.png
  Time (mean ± σ):      1.846 s ±  0.006 s    [User: 1.707 s, System: 0.134 s]
  Range (min … max):    1.838 s …  1.859 s    15 runs

Benchmark 4: taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_3.svg large_png_file_3.png
  Time (mean ± σ):      1.864 s ±  0.021 s    [User: 1.723 s, System: 0.135 s]
  Range (min … max):    1.851 s …  1.935 s    15 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./resvg_lto_optimized large_svg_file_3.svg large_png_file_3.png ran
    1.01 ± 0.01 times faster than taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_3.svg large_png_file_3.png
    1.15 ± 0.01 times faster than taskset -c 0 ./resvg_release_lto large_svg_file_3.svg large_png_file_3.png
    1.30 ± 0.01 times faster than taskset -c 0 ./resvg_release large_svg_file_3.svg large_png_file_3.png

We got a performance boost once again for a different file. I suppose it's because these two files execute similar paths inside the tool but cannot say more since I am not an SVG expert at all :)

However, there are cases that show that training on only one file is not sufficient - e.g. let's use this file for the benchmark (the training file remains the same as in the tests above):

hyperfine --warmup 5 --min-runs 15 'taskset -c 0 ./resvg_release large_svg_file_2.svg large_png_file_2.png' 'taskset -c 0 ./resvg_release_lto large_svg_file_2.svg large_png_file_2.png' 'taskset -c 0 ./resvg_lto_optimized large_svg_file_2.svg large_png_file_2.png' 'taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_2.svg large_png_file_2.png'
Benchmark 1: taskset -c 0 ./resvg_release large_svg_file_2.svg large_png_file_2.png
  Time (mean ± σ):      1.415 s ±  0.003 s    [User: 1.040 s, System: 0.357 s]
  Range (min … max):    1.409 s …  1.421 s    15 runs

Benchmark 2: taskset -c 0 ./resvg_release_lto large_svg_file_2.svg large_png_file_2.png
  Time (mean ± σ):      1.439 s ±  0.004 s    [User: 1.055 s, System: 0.365 s]
  Range (min … max):    1.429 s …  1.445 s    15 runs

Benchmark 3: taskset -c 0 ./resvg_lto_optimized large_svg_file_2.svg large_png_file_2.png
  Time (mean ± σ):      1.488 s ±  0.002 s    [User: 1.107 s, System: 0.361 s]
  Range (min … max):    1.483 s …  1.491 s    15 runs

Benchmark 4: taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_2.svg large_png_file_2.png
  Time (mean ± σ):      1.497 s ±  0.002 s    [User: 1.116 s, System: 0.363 s]
  Range (min … max):    1.493 s …  1.502 s    15 runs

Summary
  taskset -c 0 ./resvg_release large_svg_file_2.svg large_png_file_2.png ran
    1.02 ± 0.00 times faster than taskset -c 0 ./resvg_release_lto large_svg_file_2.svg large_png_file_2.png
    1.05 ± 0.00 times faster than taskset -c 0 ./resvg_lto_optimized large_svg_file_2.svg large_png_file_2.png
    1.06 ± 0.00 times faster than taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_2.svg large_png_file_2.png

Here we see some performance decrease from all optimizations (even from LTO that's strange). It shows that the training PGO set should be wider.

Just for reference, I also measured the tool slowdown during the PGO and PLO training phases:

hyperfine --warmup 5 --min-runs 15 'taskset -c 0 ./resvg_lto_instrumented large_svg_file_1.svg large_png_file_1.png' 'taskset -c 0 ./resvg_lto_bolt_instrumented large_svg_file_1.svg large_png_file_1.png'
Benchmark 1: taskset -c 0 ./resvg_lto_instrumented large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      3.670 s ±  0.062 s    [User: 3.397 s, System: 0.262 s]
  Range (min … max):    3.638 s …  3.891 s    15 runs

Benchmark 2: taskset -c 0 ./resvg_lto_bolt_instrumented large_svg_file_1.svg large_png_file_1.png
  Time (mean ± σ):      4.593 s ±  0.010 s    [User: 4.223 s, System: 0.338 s]
  Range (min … max):    4.572 s …  4.610 s    15 runs

Summary
  taskset -c 0 ./resvg_lto_instrumented large_svg_file_1.svg large_png_file_1.png ran
    1.25 ± 0.02 times faster than taskset -c 0 ./resvg_lto_bolt_instrumented large_svg_file_1.svg large_png_file_1.png

where:

Also, I want to report the binary size changes (without strip-ing that can influence the binary size a lot):

Further steps

I can suggest the following action points:

I would be happy to answer your questions about PGO.

P.S. Please do not treat the issue like a bug or something like that - it's just a benchmark report. Since the "Discussions" functionality is disabled in this repo, I created the Issue instead.

RazrFalcon commented 4 months ago

Oh wow, that's a much bigger improvement than I was expecting. Thanks for looking into it!

I will try to find time to learn cargo-pgo

Enable LTO. I expect in general performance boost "for free" and the binary size reduction.

Yep, will do in the next release.

Perform more PGO benchmarks with other datasets

The only dataset available in CI is the resvg test suite.

And I will probably add build instructions with a PGO section.