dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.28k stars 4.73k forks source link

[mono] Improve inliner, re-use RyuJIT heuristics #34286

Open EgorBo opened 4 years ago

EgorBo commented 4 years ago

RyuJIT's heuristics (at least some of them) for inlining can be re-used in Mono since our inliner is quite conservative and only takes IL size into account (20 bytes for Jit, 30 bytes for LLVM-backend). RyuJIT also estimates native code size and is able to inline methods above the threshold. It uses so called observations and increases/decreases benefit multiplier based on them (Also, performance and size impact), e.g.:

we can start from some simple heuristics e.g. Inline candidate has an arg that feeds a constant test to inline methods above our IL size limit, e.g.:

public bool IsValid(IntPtr handle)
{
    return handle != IntPtr.Zero; // will IntPtr.op_Inequality be inlined here?
}

IL for System.IntPtr:op_Inequality(long,long):bool :

IL_0000  ldarga.s     0x0
IL_0002  ldfld        0x4000402
IL_0007  ldarga.s     0x1
IL_0009  ldfld        0x4000402
IL_000e  ceq
IL_0010  ldc.i4.0
IL_0011  ceq
IL_0013  ret

RyuJIT's log:

multiplier in methods of promotable struct increased to 3.
Inline candidate has an arg that feeds a constant test.  Multiplier increased to 4.
Inline candidate is mostly loads and stores.  Multiplier increased to 7.
Inline candidate has const arg that feeds a conditional.  Multiplier increased to 10. <-- !!!
Inline candidate callsite is boring.  Multiplier increased to 11.3.
calleeNativeSizeEstimate=112
callsiteNativeSizeEstimate=115
benefit multiplier=11.3
threshold=1299
Native estimate for function size is within threshold for inlining 11.2 <= 129.9 (multiplier = 11.3)
(success!)

While Mono-JIT (without LLVM) simply refuses to inline it based on IntPtr.op_Inequality's IL size (here). So even a simple "candidate has const arg that feeds a conditional" multiplier could help to jump over the limit and get a faster and smaller codegen for this case.

/cc @BrzVlad @vargaz @lewurm @marek-safar

I recently tried to implement my own RyuJIT heuristic just to learn how it works: https://github.com/dotnet/runtime/issues/33338#issuecomment-596149604

BrzVlad commented 4 years ago

I'm not sure how long this would take to do, but as quick workaround we could also mark methods from IntPtr class as AggressiveInlining

AndyAyersMS commented 4 years ago

FWIW I have ambitions about rewriting all the RyuJit inlining heuristics at some point. Some of what's there is worth keeping, but there are lots of areas that need improving.

Pros:

Cons:

Challenges:

GSPP commented 4 years ago

It seems to be a recurring theme that some optimizations are too expensive because the necessary analysis is not available yet or is too expensive altogether. For example, the design document says:

Currently the jit walks it is IR in linear fashion deciding whether to inline each time it sees a candidate. If the decision is yes then the inlined code is spliced in place of the call and (because of the order of the walk) immediately scanned for inlining candidates. Thus the inlining is performed "depth first" and is done without much knowledge of the number or location of other candidates in the code stream. Inlining is done very early on before any significant analysis has been done to the IR -- there is a flow graph but no loop nesting, dataflow, or profile estimates are generally available.

Could precomputation help with this? There could be a build step that analyzes a set of assemblies and writes a summary useful for optimizations to disk. The summary could contain a call graph, for example. It could also contain profile data.

The analysis tool could also concretely evaluate every possible inlining decision by performing the inlining and then simplifying. This would expose inlining opportunities that result in code size reduction but that are not obvious to any simple heuristic.

Data useful for optimizations other than inlining could be included as well. A data flow summary comes to mind (e.g. "this variable can never be null; this other variable always is greater than zero").

This technique would not be AOT. The data would not be machine-specific or versioning-brittle.

AndyAyersMS commented 4 years ago

Could precomputation help with this? Data useful for optimizations other than inlining could be included as well...

Yes, it is one of the things we plan to look into -- using precompilation analysis to leave information for jitting, either as "facts" or "hints".

The analysis tool could also concretely evaluate every possible inlining decision

Combinatorics makes this infeasible. If there are N top level call sites, there are 2^N possible ways to inline, all of which may produce different codegen.

SamMonoRT commented 1 year ago

@jandupej - something for 9.0