Open zamazan4ik opened 3 months ago
I had no idea that this was a builtin option in the Rust compiler, very cool.
Some thoughts about this:
Bend goes out of the way to have as few dependencies as possible, especially at its core. So I doubt link-time optimizations will improve too much.
Probably - at least it should improve the binary size. This optimization is already enabled (so I changed nothing for this in the build scripts).
The Bend compiler is NOT optimized, almost at all. Our focus so far has been on correctness and simplicity. Before investigating compile optimizations, we'll have to profile and optimize the logic of Bend itself, which I imagine will bring much greater improvements.
Yep. However, even after performing manual optimizations with a higher probability PGO still will be useful. All modern "mature" compilers like Clang, Rustc, GCC, and many others still have large performance benefits from enabling PGO for them.
I'm much more interested in the performance this can bring to the runtime. We should try doing this to the C runtime. I imagine that for CUDA, PGO is not really a thing?
It depends. If we talk about GPU code - yes, PGO is not a thing yet for it. If we talk about the CPU part - PGO should work as for regular non-CUDA code.
Is your feature request related to a problem? Please describe. Not a problem - just an idea of how the compiler's performance can be improved.
Describe alternatives you've considered Leave the things as is.
Describe the solution you'd like I checked Profile-Guided Optimization (PGO) on many projects (including compilers) - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Since this, I decided to test PGO with the Bend compiler. Below are the numbers.
Test environment
Bend
version: the latest for now from themain
branch on commit0b6c3350a4e556df665ef146ac6d04e6b73bbae9
Benchmark
I decided to perform PGO benchmarks on a simple scenario -
bend check sorter.bend
(sorter.bend
took from here). For PGO optimization I use cargo-pgo tool. Release build is done withcargo build --release
, PGO instrumented withcargo pgo build
, PGO optimized -cargo pgo optimize build
, BOLT instrumented (with PGO optimization) -cargo pgo bolt build --with-pgo
, BOLT optimized -cargo pgo bolt optimize --with-pgo
. The training workload is the same for PGO and PLO -bend check sorter.bend
.taskset -c 0
is used for reducing the OS scheduler's influence on the results during the benchmarks. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).Results
I got the following results:
where:
bend_release
- Release + LTO buildbend_optimized
- Release + LTO + PGO optimizedbend_pgo_and_bolt_optimized
- Release + LTO + PGO optimized + BOLT optimizedAccording to the results, PGO measurably improves the compiler's performance at least in the simple benchmark above. However, PLO (with LLVM BOLT) doesn't bring measurable improvements at least in this scenario.
For reference, I also measured the slowdown from PGO and PLO instrumentation:
where:
bend_release
- Release + LTO buildbend_instrumented
- Release + LTO + PGO instrumentedbend_optimized_bolt_instrumented
- Release + LTO + PGO optimized + BOLT instrumentedAlso, I measured binary size changes in different compilation modes (without stripping):
Further steps
I can suggest the following action points:
I would be happy to answer your questions about PGO and PLO.
By the way, it can be an interesting idea to evaluate PGO optimization applicability to the Bend program itself, not just for the compiler and similar tools (formatters, etc.).
P.S. Please do not treat the issue like a bug or something like that - it's just an improvement idea. Since the "Discussions" functionality is disabled in this repo, I created the Issue instead.