dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.63k stars 4.57k forks source link

Increase JIT compiler throughput #6857

Open pkukol opened 7 years ago

pkukol commented 7 years ago

The following is a list of areas being considered for throughput improvements in the near future; if anyone wants to help with these just add a note here, if we have lots of volunteers we can establish a simple tracking system. Note that the stuff gathered in this issue addresses the specific goal of speeding up IL -> machine code conversion in RyuJIT - IOW, things like better heuristics to decide when to use / not use MinOpts, finding more ways to avoid compilation at runtime (via crossgen or whatever) are not covered here.

  1. Importing IL (typically takes 25% of overall JIT time):
    1. Cache results of calling through ICorJitInfo, such as IL token resolution:
      1. Option 1 - do this only for large methods, no caching across compilations.
      2. Option 2 - cache things "globally", across methods; requires retention policy.
    2. Cache the internal format of carefully chosen methods that are frequently inlined:
      1. This only helps "normal opts" unless we extend MinOpts to do "fast" inlining.
      2. Needs retention policy so memory (and any overhead for serializing the IR) is not wasted.
      3. Closely tied to the inliner so this should probably be integrated into the inline policy / etc logic.
    3. Recognize a tiny subset of trivial IL body patterns; for matched methods:
      1. Spit out "canned" IR to bypass the "full" import logic.
      2. Mark "trivial" bodies (or parts of bodies?) and add simplified processing downstream for such bodies (e.g. no jumps, no stores to locals, or whatever).
    4. Find the most frequent paths through the most expensive JIT -> EE calls, and try to speed them up:
      1. For some calls the JIT may not need all the information the EE is currently returning -> add shortcuts and/or subset versions?
      2. Batch / overlap / defer EE calls:
        1. This has ramifications for class loading order / etc, so feasibility is an open question.
        2. When the importer asks about tokens / methods / fields it doesn't need all the info immediately; split the relevant (expensive) EE calls into two parts, the first - hopefully much faster than the whole - would only return the minimal info the importer must have right away, the rest could come on a separate thread / via an asynch callback or something like that.
        3. If we do any "quick look at IL" processing (e.g. to do some of the stuff above for "trivial" methods) we could fire off calls to ask about the tokens we encounter.
  2. Speed up genGenerateCode, i.e. the far back end (takes about 10-20% of total time):
    1. Speed up GC info gathering and encoding:
      1. GC encoder: avoid sorting when possible, speed up bit-twiddling, etc.
      2. Instruction encoding: try to speed up the most frequent emitXxxxx methods, emitter::emitGCregLiveUpd()
      3. Speed up GCInfo::gcMakeRegPtrTable() and related logic.
      4. Other things, such as the scope tracking stuff (CodeGen::siBeginBlock etc).
      5. Note: probably ignore CodeGen::genCodeForTreeNode() even though it's up to 5% - way too many little pieces.
  3. Speed up LSRA (around 20% of total compile time) and make it consume a lot less memory:
    1. Punting this entirely to the RA specialists.
  4. Slim down Lowering::DoPhase (currently up to 10% of JIT time):
    1. Completely bypass fgInterBlockLocalVarLiveness() - up to 2% of total time.
    2. Avoid doing lvaSortByRefCount() for very small numbers of variables (0.5% of total time).
    3. Other things? Returns probably diminish rapidly.
  5. Speed up Morph (around 8% of total compile time):
    1. Spend less time spent recursively walking trees for MinOpts.
  6. Global improvements (few percent):
    1. Shrink GenTree nodes.
    2. Speed up tree walks.
    3. Allocate memory in larger chunks.
  7. Skip more things for MinOpts (few percent):
    1. Skip (parts of) liveness analysis (also see above)?
    2. Bypass some tree optimizations (just do the simplest / easiest thing).
    3. Skip "ordering" passes (evalOrder/blockOrder) for trivial or reducible CFG's or some such.
    4. Short-cut things like lvaSortByRefCount() for small numbers of variables, etc.

category:throughput theme:throughput skill-level:expert cost:extra-large

mazong1123 commented 7 years ago

Count me in. I'd like to help on this.

TIHan commented 8 months ago

It's worth looking at each item in the list considering there may be possible throughput wins.