Profile-Guided Optimization (PGO) benchmarks

Hi!

I tried to apply Profile-Guided Optimization (PGO) to optimize llrt performance further (as I already did for many other projects - see all current results here). I performed some basic benchmarks and want to share the results here.

Test environment

Fedora 39
Linux kernel 6.7.3
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.76
llrt version: the latest for now from the main branch on commit c040bfd05a2be8d3300e7a1bbfc9405c42a865fa
Disabled Turbo boost (for more stable results across benchmark runs)

Benchmark

As a benchmark, I use the same command as I found in the Makefile - llrt fixtures/hello.js. The same scenario is used for the PGO training phase. All PGO optimization steps are done with cargo-pgo tool. PGO instrumented version is built with cargo pgo build, PGO optimized version - cargo pgo optimize build. taskset -c 0 is used for reducing CPU scheduling influence on the results.

Results

I got the following results:

hyperfine -u microsecond -N --warmup=2000 --min-runs 10000 "taskset -c 0 ./llrt_optimized ../fixtures/hello.js" "taskset -c 0 ./llrt_release ../fixtures/hello.js"
Benchmark 1: taskset -c 0 ./llrt_optimized ../fixtures/hello.js
  Time (mean ± σ):     2664.8 µs ±  78.8 µs    [User: 590.1 µs, System: 1943.3 µs]
  Range (min … max):   2478.1 µs … 4486.1 µs    10000 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: taskset -c 0 ./llrt_release ../fixtures/hello.js
  Time (mean ± σ):     2796.1 µs ±  63.6 µs    [User: 601.4 µs, System: 2068.9 µs]
  Range (min … max):   2647.5 µs … 4495.0 µs    10000 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./llrt_optimized ../fixtures/hello.js ran
    1.05 ± 0.04 times faster than taskset -c 0 ./llrt_release ../fixtures/hello.js

, where llrt_release - usual Release version, llrt_optimized - PGO-optimized version.

I ran the benchmark multiple times, with different command orders, etc - in all cases, the PGO-optimized version was faster than the usual release version. However, it would be awesome to perform some more precise benchmarks.

Further steps

I can suggest to do the following things:

Perform more PGO benchmarks with some more precise performance measurements.
If PGO is worth it - add a note to the documentation about it and, possibly, make an option in the build scripts to optimize llrt easier with the existing build infrastructure.
Try to play with Post-Link Optimization (PLO) with tools like LLVM BOLT.

I hope these benchmark results can be interesting to someone.

awslabs / llrt