[mono] Improve inliner, re-use RyuJIT heuristics

EgorBo commented 4 years ago

RyuJIT's heuristics (at least some of them) for inlining can be re-used in Mono since our inliner is quite conservative and only takes IL size into account (20 bytes for Jit, 30 bytes for LLVM-backend). RyuJIT also estimates native code size and is able to inline methods above the threshold. It uses so called observations and increases/decreases benefit multiplier based on them (Also, performance and size impact), e.g.:

Inline candidate has SIMD type args, locals or return value. Multiplier increased to %g
Inline candidate looks like a wrapper method. Multiplier increased to %g
Inline candidate has an arg that feeds a constant test. Multiplier increased to %g
Inline candidate is mostly loads and stores. Multiplier increased to %g
Inline candidate has arg that feeds range check. Multiplier increased to %g
Inline candidate has const arg that feeds a conditional. Multiplier increased to %g
Prejit root candidate has arg that feeds a conditional. Multiplier increased to %g
Inline candidate callsite is rare. Multiplier limited to %g
Inline candidate callsite is boring. Multiplier increased to %g
Inline candidate callsite is warm. Multiplier increased to %g
Inline candidate callsite is in a loop. Multiplier increased to %g
Inline candidate callsite is hot. Multiplier increased to %g
multiplier in instance constructors increased to %g
multiplier in methods of promotable struct increased to %g

we can start from some simple heuristics e.g. Inline candidate has an arg that feeds a constant test to inline methods above our IL size limit, e.g.:

public bool IsValid(IntPtr handle)
{
    return handle != IntPtr.Zero; // will IntPtr.op_Inequality be inlined here?
}

IL for System.IntPtr:op_Inequality(long,long):bool :

IL_0000  ldarga.s     0x0
IL_0002  ldfld        0x4000402
IL_0007  ldarga.s     0x1
IL_0009  ldfld        0x4000402
IL_000e  ceq
IL_0010  ldc.i4.0
IL_0011  ceq
IL_0013  ret

RyuJIT's log:

multiplier in methods of promotable struct increased to 3.
Inline candidate has an arg that feeds a constant test.  Multiplier increased to 4.
Inline candidate is mostly loads and stores.  Multiplier increased to 7.
Inline candidate has const arg that feeds a conditional.  Multiplier increased to 10. <-- !!!
Inline candidate callsite is boring.  Multiplier increased to 11.3.
calleeNativeSizeEstimate=112
callsiteNativeSizeEstimate=115
benefit multiplier=11.3
threshold=1299
Native estimate for function size is within threshold for inlining 11.2 <= 129.9 (multiplier = 11.3)
(success!)

While Mono-JIT (without LLVM) simply refuses to inline it based on IntPtr.op_Inequality's IL size (here). So even a simple "candidate has const arg that feeds a conditional" multiplier could help to jump over the limit and get a faster and smaller codegen for this case.

/cc @BrzVlad @vargaz @lewurm @marek-safar

I recently tried to implement my own RyuJIT heuristic just to learn how it works: https://github.com/dotnet/runtime/issues/33338#issuecomment-596149604

BrzVlad commented 4 years ago

I'm not sure how long this would take to do, but as quick workaround we could also mark methods from IntPtr class as AggressiveInlining

AndyAyersMS commented 4 years ago

FWIW I have ambitions about rewriting all the RyuJit inlining heuristics at some point. Some of what's there is worth keeping, but there are lots of areas that need improving.

Pros:

observations -> heuristics -> decision is the right overall approach.

Cons:

RyuJit code size model is based on some outdated, x86 based "state machine" and is not very accurate. This perhaps should take machine details (ABI, etc) into account, but it might be better for inlining decisions not to depend (or to only weakly depend) on target machine/abi.
inline observations are driven by an IL walk with very crude stack model, and so are not very accurate
heuristic weights are totally ad-hoc; I'd hoped to derive these via machine learning, but ran into various complications (see notes)
- good news was that (for RyuJit) at least code size estimate seemed like something that could be modeled fairly well
- performance impact was not well modeled

Challenges:

Ultimately profitability needs to incorporate profile feedback in some form, whether it be online, offline, or synthetic
User code base now expects certain behavior from inlining; any change in this area that is not strictly "inline more" will result in regressions; hard to land such a change if regression tolerance is low.
Inlining more will also result in regressions, though perhaps more subtle and harder to spot, but almost certainly, jit will be slower

GSPP commented 4 years ago

It seems to be a recurring theme that some optimizations are too expensive because the necessary analysis is not available yet or is too expensive altogether. For example, the design document says:

Currently the jit walks it is IR in linear fashion deciding whether to inline each time it sees a candidate. If the decision is yes then the inlined code is spliced in place of the call and (because of the order of the walk) immediately scanned for inlining candidates. Thus the inlining is performed "depth first" and is done without much knowledge of the number or location of other candidates in the code stream. Inlining is done very early on before any significant analysis has been done to the IR -- there is a flow graph but no loop nesting, dataflow, or profile estimates are generally available.

Could precomputation help with this? There could be a build step that analyzes a set of assemblies and writes a summary useful for optimizations to disk. The summary could contain a call graph, for example. It could also contain profile data.

The analysis tool could also concretely evaluate every possible inlining decision by performing the inlining and then simplifying. This would expose inlining opportunities that result in code size reduction but that are not obvious to any simple heuristic.

Data useful for optimizations other than inlining could be included as well. A data flow summary comes to mind (e.g. "this variable can never be null; this other variable always is greater than zero").

This technique would not be AOT. The data would not be machine-specific or versioning-brittle.

AndyAyersMS commented 4 years ago

Could precomputation help with this? Data useful for optimizations other than inlining could be included as well...

Yes, it is one of the things we plan to look into -- using precompilation analysis to leave information for jitting, either as "facts" or "hints".

The analysis tool could also concretely evaluate every possible inlining decision

Combinatorics makes this infeasible. If there are N top level call sites, there are 2^N possible ways to inline, all of which may produce different codegen.

SamMonoRT commented 1 year ago

@jandupej - something for 9.0

dotnet / runtime

[mono] Improve inliner, re-use RyuJIT heuristics #34286