chmln / sd

Intuitive find & replace CLI (sed alternative)
MIT License
5.77k stars 136 forks source link

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #237

Open zamazan4ik opened 11 months ago

zamazan4ik commented 11 months ago

Hi!

Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects - the results are available here. So that's why I think it's worth trying to apply PGO to sd. I already performed some benchmarks and want to share my results here.

Test environment

Benchmark setup

As a test file, I use this large enough JSON file. sd is tested with this command line: sd -p "(\w+)" "\$1\$1" dump.json > /dev/null. I took these arguments from the issue https://github.com/chmln/sd/issues/52 . For PGO profile collection the same arguments and test file were used.

PGO optimization is done with cargo-pgo.

All benchmarks are done multiple times, on the same hardware/software setup, with the same background "noise" (as much I can guarantee ofc).

Results

I got the following results:

hyperfine --warmup 10 --min-runs 100 'sd_release -p "(\w+)" "\$1\$1" dump.json > /dev/null' 'sd_pgo_optimized -p "(\w+)" "\$1\$1" dump.json > /dev/null'
Benchmark 1: sd_release -p "(\w+)" "\$1\$1" dump.json > /dev/null
  Time (mean ± σ):     916.7 ms ±  21.3 ms    [User: 881.4 ms, System: 33.1 ms]
  Range (min … max):   875.5 ms … 1032.8 ms    100 runs

Benchmark 2: sd_pgo_optimized -p "(\w+)" "\$1\$1" dump.json > /dev/null
  Time (mean ± σ):     745.3 ms ±   9.4 ms    [User: 710.3 ms, System: 33.1 ms]
  Range (min … max):   713.1 ms … 782.3 ms    100 runs

Summary
  sd_pgo_optimized -p "(\w+)" "\$1\$1" dump.json > /dev/null ran
    1.23 ± 0.03 times faster than sd_release -p "(\w+)" "\$1\$1" dump.json > /dev/null

Just for reference, sd in the Instrumentation mode (during the PGO profile collection) has the following results (in time format):

time sd_pgo_instrumented -p "(\w+)" "\$1\$1" dump.json > /dev/null
sd_pgo_instrumented -p "(\w+)" "\$1\$1" dump.json  1,49s user 0,04s system 99% cpu 1,534 total

At least according to the simple benchmark above, PGO has a measurable positive effect on sd performance.

Further steps

I can suggest the following things to do:

Here are some examples of how PGO is already integrated into other projects' build scripts:

After PGO, I can suggest evaluating LLVM BOLT as an additional optimization step after PGO.