dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.15k stars 4.71k forks source link

Usage of BOLT (the LLVM post-optimizer) #93626

Open deeprobin opened 1 year ago

deeprobin commented 1 year ago

In the .NET area, we have already made great strides in the PGO area.

The Rust compiler (rustc) has been optimized with BOLT since last week (see https://github.com/rust-lang/rust/pull/94381). BOLT originally comes from a Facebook Incubator project, which will then be adopted into LLVM at some point.

Apparently they talk about ~2% performance optimization when using that.

Here, of course, we can think about whether this makes much difference in JIT scenarios, since here it basically only optimizes the static parts (e.g. apphost, corerun, ...).

However, I see potential here especially in the NativeAOT area.

The only thing required would be PGO data like perf outputs.

The difference to other optimizers is that this is a post optimizer, so this tries to optimize the binaries. One limitation I could find so far is that it only supports ELF binaries (at least what I could find out).


Would be very interested if you guys see any thoughts, challenges or obstacles here to optimize the .NET binaries with this.

If nothing stands in the way of this I would sit down to a PoC on how we could use this in runtime.

/type:feature /area:PGO

Perksey commented 1 year ago

This does raise an interesting point about .NET using PGO to optimise itself. e.g. consider we could collect static PGO data of .NET compiling itself and bundle that in with R2R images of the various .NET binaries, and if we go the BOLT route as well we can instrument corerun as its compiling itself, and have a high-quality set of PGO data we're ready to run with. It looks like Rust just uses a few common crates to optimise itself.

ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details
In the .NET area, we have already made great strides in the PGO area. The Rust compiler (rustc) has been optimized with BOLT since last week (see https://github.com/rust-lang/rust/pull/94381). [BOLT](https://github.com/llvm/llvm-project/tree/main/bolt) originally comes from a Facebook Incubator project, which will then be adopted into LLVM at some point. Apparently they talk about ~2% performance optimization when using that. Here, of course, we can think about whether this makes much difference in JIT scenarios, since here it basically only optimizes the static parts (e.g. apphost, corerun, ...). However, I see potential here especially in the NativeAOT area. The only thing required would be PGO data like [`perf`](https://perf.wiki.kernel.org/) outputs. The difference to other optimizers is that this is a post optimizer, so this tries to optimize the binaries. One limitation I could find so far is that it only supports ELF binaries (at least what I could find out). --- Would be very interested if you guys see any thoughts, challenges or obstacles here to optimize the .NET binaries with this. If nothing stands in the way of this I would sit down to a PoC on how we could use this in runtime. /type:feature /area:PGO
Author: deeprobin
Assignees: -
Labels: `area-CodeGen-coreclr`, `untriaged`, `needs-area-label`
Milestone: -
JulieLeeMSFT commented 1 year ago

We internally discussed about this recently. CC @kunalspathak, @EgorBo, @AndyAyersMS.

JulieLeeMSFT commented 1 year ago

In .NET 9, we are starting with https://github.com/dotnet/runtime/issues/92915. Opportunistically, we might look into utilizing BOLT.

EgorBo commented 1 year ago

BOLT might makes sense to try for e.g. GC and VM native code but it seems unlikely it will result into anything meanigful (even -march=native and -O3 has no visible impact compared with our defaults).

For NativeAOT/generated managed code it's very unlikely to be usable at all due to GC info, etc so we're sceptical about it, I am sure we have simpler things to do to get 2% improvement rather than hooking a heavy/slow lib into CI 🙂