dotnet / runtimelab

This repo is for experimentation and exploring new ideas that may or may not make it into the main dotnet/runtime repo.
MIT License
1.41k stars 197 forks source link

[NativeAOT-LLVM] Multi-threaded compiler #2293

Open SingleAccretion opened 1 year ago

SingleAccretion commented 1 year ago

This issues is a collection of notes so far collected in the process of thinking about how to make LLVM compilation amenable to multithreading.

Recall, for context, that we have two major contributors to compile time right now: the IL -> LLVM phase (RyuJit) and the LLVM -> WASM phase (Clang). For both of these, we are constrained by LLVM's design to work on Modules. For maximally optimal throughput, working on the same number of modules as the parallelism setting is required.

The hard problem to solve, then, is determinism. There are two possible levels: 1) Deterministic w.r.t. parallelism only, i. e. we allow builds with different parallelism settings to produce different results. This is believed to be achievable with a fairly simple scheme not too far from what we currently do. I don't think it is viable, however, as parallelism, if not specified, is implicitly set to the number of logical cores, so this only achieves a weak form of local determinism. 2) Fully deterministic, i. e. all builds, regardless of parallelism, produce identical results. This is what ILC proper has and what we should have as well.

Let's review what other toolchains do to achieve 2 and how it applies to us: 1) MonoAOT: compiles one module per one managed assembly. Downside: large assemblies like SPC would take a disproportionate amount of time. This is a variation of the "partition the input based on something inherent in the input itself". For ILC, partitioning by assembly is a bit unnatural. 2) ILC proper: sorts the outputs before writing them out to an object file. For us, that would mean sorting the resulting .wasm, as we otherwise appear to have no control on the order LLVM emits things in. This would allow using the most efficient parallelism <-> Module mapping during compilation, but would then be both "hard" (WASM binary format is a fragile one, as, for example, changing the function's index may invalidate all references to it from everywhere), expensive to maintain and adds to compile time by itself. 3) C++ compilers once again use a variant of partitioning the input based on itself (i. e. translation units). Translation units are generally smaller than .NET assemblies, and there are usually more of them than the parallelism setting. It is known that compiling larger Modules is better than many small ones (especially when doing this naively with a new compiler process for each one), though both are obviously better than compiling everything on a single thread.

Another thing to consider is that Module is not just a threading isolation unit, but an optimization one as well. It means that to get best results, it is actually better to compile everything on a single thread into one module, though this can be alleviated with techniques like thin LTO. This means that we will be giving up some CQ (hopefully not much) by supporting a multi-threaded mode at all per the determinism model 2 from above. It also means that sorting the .wasm is not good enough and so can be discarded as the possible solution altogether.

So at the end of the day, it seems the only viable approach is to partition the input in some parallelism-independent manner, TBD which, and use a worklist-like algorithm for the LLVM -> WASM compilation of modules. First iteration of it can remain out-of-proc like it is right now, but long term I expect we will get all things LLVM-related into the ILC process, to save on the file system traffic and get better diagnostic capabilities.

SingleAccretion commented 1 year ago

2297 will put the basic version in place; we need more (in no particular order):

1) Investigate why the ILC compile doesn't saturate all cores. 2) Investigate using a more efficient allocator by default. The ILC profile is quite heavy on CRT allocator routines (not terribly surprising as that's what our codegen does, pretty much). See https://reviews.llvm.org/D71786. 3) Size the number of modules based on the input so that scaling on machines with than 8 logical cores is possible. 4) Move the LLVM -> WASM compilation in-process. 5) Utilize thin LTO by default, full LTO with an opt-in option (perhaps under '$(OptimizationPreference)' != '').