HigherOrderCO / Bend

A massively parallel, high-level programming language
https://higherorderco.com
Apache License 2.0
17.23k stars 424 forks source link

Evaluate using Profile-Guided Optimization (PGO) for the compiler #517

Open zamazan4ik opened 3 months ago

zamazan4ik commented 3 months ago

Is your feature request related to a problem? Please describe. Not a problem - just an idea of how the compiler's performance can be improved.

Describe alternatives you've considered Leave the things as is.

Describe the solution you'd like I checked Profile-Guided Optimization (PGO) on many projects (including compilers) - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Since this, I decided to test PGO with the Bend compiler. Below are the numbers.

Test environment

Benchmark

I decided to perform PGO benchmarks on a simple scenario - bend check sorter.bend (sorter.bend took from here). For PGO optimization I use cargo-pgo tool. Release build is done with cargo build --release, PGO instrumented with cargo pgo build, PGO optimized - cargo pgo optimize build, BOLT instrumented (with PGO optimization) - cargo pgo bolt build --with-pgo, BOLT optimized - cargo pgo bolt optimize --with-pgo. The training workload is the same for PGO and PLO - bend check sorter.bend.

taskset -c 0 is used for reducing the OS scheduler's influence on the results during the benchmarks. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).

Results

I got the following results:

hyperfine --warmup 200 --min-runs 1000 -i -N 'taskset -c 0 ./bend_release check sorter.bend' 'taskset -c 0 ./bend_optimized check sorter.bend' 'taskset -c 0 ./bend_pgo_and_bolt_optimized check sorter.bend'
Benchmark 1: taskset -c 0 ./bend_release check sorter.bend
  Time (mean ± σ):       4.4 ms ±   0.1 ms    [User: 3.0 ms, System: 1.3 ms]
  Range (min … max):     4.3 ms …   4.7 ms    1000 runs

Benchmark 2: taskset -c 0 ./bend_optimized check sorter.bend
  Time (mean ± σ):       3.8 ms ±   0.1 ms    [User: 2.5 ms, System: 1.3 ms]
  Range (min … max):     3.7 ms …   4.1 ms    1000 runs

Benchmark 3: taskset -c 0 ./bend_pgo_and_bolt_optimized check sorter.bend
  Time (mean ± σ):       3.8 ms ±   0.1 ms    [User: 2.4 ms, System: 1.2 ms]
  Range (min … max):     3.7 ms …   4.1 ms    1000 runs

Summary
  taskset -c 0 ./bend_pgo_and_bolt_optimized check sorter.bend ran
    1.01 ± 0.02 times faster than taskset -c 0 ./bend_optimized check sorter.bend
    1.18 ± 0.03 times faster than taskset -c 0 ./bend_release check sorter.bend

where:

According to the results, PGO measurably improves the compiler's performance at least in the simple benchmark above. However, PLO (with LLVM BOLT) doesn't bring measurable improvements at least in this scenario.

For reference, I also measured the slowdown from PGO and PLO instrumentation:

hyperfine --warmup 100 --min-runs 500 -i -N 'taskset -c 0 ./bend_release check sorter.bend' 'taskset -c 0 ./bend_instrumented check sorter.bend' 'taskset -c 0 ./bend_optimized_bolt_instrumented check sorter.bend'
Benchmark 1: taskset -c 0 ./bend_release check sorter.bend
  Time (mean ± σ):       4.4 ms ±   0.1 ms    [User: 3.1 ms, System: 1.2 ms]
  Range (min … max):     4.3 ms …   5.3 ms    676 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: taskset -c 0 ./bend_instrumented check sorter.bend
  Time (mean ± σ):       6.5 ms ±   0.1 ms    [User: 4.2 ms, System: 2.1 ms]
  Range (min … max):     6.3 ms …   6.8 ms    500 runs

Benchmark 3: taskset -c 0 ./bend_optimized_bolt_instrumented check sorter.bend
  Time (mean ± σ):     157.7 ms ±   5.1 ms    [User: 30.4 ms, System: 108.6 ms]
  Range (min … max):   147.7 ms … 259.6 ms    500 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./bend_release check sorter.bend ran
    1.47 ± 0.03 times faster than taskset -c 0 ./bend_instrumented check sorter.bend
   35.57 ± 1.30 times faster than taskset -c 0 ./bend_optimized_bolt_instrumented check sorter.bend

where:

Also, I measured binary size changes in different compilation modes (without stripping):

Further steps

I can suggest the following action points:

I would be happy to answer your questions about PGO and PLO.

By the way, it can be an interesting idea to evaluate PGO optimization applicability to the Bend program itself, not just for the compiler and similar tools (formatters, etc.).

P.S. Please do not treat the issue like a bug or something like that - it's just an improvement idea. Since the "Discussions" functionality is disabled in this repo, I created the Issue instead.

developedby commented 3 months ago

I had no idea that this was a builtin option in the Rust compiler, very cool.

Some thoughts about this:

zamazan4ik commented 3 months ago

Bend goes out of the way to have as few dependencies as possible, especially at its core. So I doubt link-time optimizations will improve too much.

Probably - at least it should improve the binary size. This optimization is already enabled (so I changed nothing for this in the build scripts).

The Bend compiler is NOT optimized, almost at all. Our focus so far has been on correctness and simplicity. Before investigating compile optimizations, we'll have to profile and optimize the logic of Bend itself, which I imagine will bring much greater improvements.

Yep. However, even after performing manual optimizations with a higher probability PGO still will be useful. All modern "mature" compilers like Clang, Rustc, GCC, and many others still have large performance benefits from enabling PGO for them.

I'm much more interested in the performance this can bring to the runtime. We should try doing this to the C runtime. I imagine that for CUDA, PGO is not really a thing?

It depends. If we talk about GPU code - yes, PGO is not a thing yet for it. If we talk about the CPU part - PGO should work as for regular non-CUDA code.