dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

[wasm] Jiterpreter tracking issue #78428

Open kg opened 1 year ago

kg commented 1 year ago

The jiterpreter (#76477) has pending work needed:

Archived items

ghost commented 1 year ago

Tagging subscribers to 'arch-wasm': @lewing See info in area-owners.md if you want to be subscribed.

Issue Details
The jiterpreter has pending work needed: - [ ] Migrate configuration to options.h (requires improvements to the API) - [ ] Enable jiterpreter features by default - [ ] Better error handling for jiterpreter runtime failures (shut them off after a handful of JIT failures to avoid spamming the console and wasting CPU time) - [ ] Run statistics on blazor applications once the jiterpreter is integrated, to identify any instructions that need to be added - [ ] Investigate integrating jit calls directly into compiled traces - [ ] Remove more unnecessary transition/wrapper glue from do_jit_call and interp_entry paths, as seen here: ![image](https://user-images.githubusercontent.com/198130/202034944-c1fb3439-b564-4fcc-9210-0f02be89c864.png) - [ ] Threading support (incomplete draft to-do list) * [ ] Synchronize wasm function pointer table growth across threads * [ ] Ensure empty function pointer slots are filled with appropriate 'dummy' functions so that threads will not crash when calling them * [ ] When jitting a new function, RPC the wasm blob or compiled module to threads so they can register the pointer * [ ] Thread-safe interpreter opcode patching * [ ] Thread-safe do_jit_call pointer/cache updates - [ ] Caching * [ ] Cache jitted traces across page loads * [ ] Cache do_jit_call trampolines across page loads * [ ] Cache interp_entry wrappers across page loads
Author: kg
Assignees: kg
Labels: `arch-wasm`
Milestone: -
ghost commented 1 year ago

Tagging subscribers to this area: @brzvlad, @kotlarmilos See info in area-owners.md if you want to be subscribed.

Issue Details
The jiterpreter (#76477) has pending work needed: - [ ] Introduce a Jiterpreter CI lane that sets all the tiering thresholds low so that we flush out any issues with obscure interp opcodes or cold code - [ ] Investigate integrating jit calls directly into compiled traces - [ ] Investigate integrating icalls directly into compiled traces - [ ] Run statistics on blazor applications once the jiterpreter is integrated, to identify any instructions that need to be added - [ ] Remove more unnecessary transition/wrapper glue from do_jit_call and interp_entry paths, as seen here: ![image](https://user-images.githubusercontent.com/198130/202034944-c1fb3439-b564-4fcc-9210-0f02be89c864.png) * [ ] Maintain a table of vtable slots containing interp_entry _in wrappers, then patch the vtables (design pending) * [x] When generating dedicated do_jit_call routine, punch through the _out wrapper (model on the mini-generic-sharing.c code generator) * [x] Optimize direct jit calls to turn the common ldloca sp + offset -> tnn.load pair into tnn_load offset * [x] Optimize out passing of ftndesc arg to direct jit call wrappers, target and rgctx can be compiled in (not possible due to generic sharing) - [ ] Cache interpreter stack locals in wasm locals, then flush them back to the interpreter stack on exit - [ ] Cache non-volatile fields in wasm locals, then flush them back to the heap on exit - [ ] Threading support (incomplete draft to-do list) * [ ] Synchronize wasm function pointer table growth across threads * [ ] Ensure empty function pointer slots are filled with appropriate 'dummy' functions so that threads will not crash when calling them * [ ] When jitting a new function, RPC the wasm blob or compiled module to threads so they can register the pointer * [ ] Thread-safe interpreter opcode patching * [ ] Thread-safe do_jit_call pointer/cache updates - [ ] Multi-trace optimizations * [ ] For traces with an offset other than 0 (large ones only?) attempt to reuse other existing traces? * [ ] Stop compiling traces when we encounter an already-compiled trace (likely function prologue -> loop body) * [ ] When we encounter an already compiled trace, call it directly from the current trace - [ ] Heuristic improvements * [x] Don't put trace entry points too close together * [ ] If a trace is likely to conditionally abort early in its execution, don't insert an entry point (requires interpreter to mark blocks as unlikely if they contain a throw) * [x] Identify causes of heuristic accuracy only being ~95% on S.R.T. * [ ] Add 'estimated cost' value for each opcode to mintops.def that estimates cost of running it in interp * [x] Add estimated cost value for each jiterpreter opcode that estimates the quality of generated wasm code * [ ] Instead of using trace length heuristic, only keep traces where estimated jiterpreter cost <= interp cost * [ ] Ensure new system keeps short high value traces like Vector128.Add-with-SIMD * [ ] Improve estimated jiterpreter cost by factoring in (measured on v8 and/or spidermonkey) cost of entering a trace * [ ] Factor in the lack of branch prediction when estimating cost of jiterpreter branches like null checks - [ ] Control flow improvements * [x] Basic backwards branch implementation * [x] Implement CFG tracker that assembles module at the end * [x] Eliminate branch block comparison(s) for forward branches * [ ] Eliminate branch block comparison(s) for backward branches * [x] Don't generate dispatch table entries for branch targets that cannot be reached by backward branches * [x] Don't generate a dispatch table if all back branches in a trace go to a single place * [ ] Identify cases where each back branch target is independent, and generate separate loops * [x] Record each CALL_HANDLER target and use that to implement ENDFINALLY * [x] When we emit an unconditional bailout, set a 'prune opcodes' flag and don't translate any unreachable opcodes after it until we hit a branch target block * [ ] Outline bailouts and exits to a shared return at the end of traces * [ ] Change all bailouts to be the form `if (cond) { br bailout_block }` or `br_if bailout_block` - [ ] Monitoring phase improvements * [ ] Tune threshold * [x] Fix `Span.Reverse` regression * [x] Generate a mapping table from return values (we know the possible set) to executed opcode or uop count * [x] Set threshold in terms of opcodes or uops * [x] Discard mapping table after monitoring phase - [ ] Load-to-store forwarding * [x] If a series of opcodes r/w overwrite a dreg, drop the store/load pair for the leading opcodes, i.e. `a = b * 2; a = a + 1;` (this turns out to make things slower in v8 for some reason, so prototype won't land) * [ ] Use a wasm local instead of leave-on-stack * [ ] Fully optimize out stores and loads for cases where the dreg is only read once by leaving it on the wasm stack - [ ] Re-enable early trace abort with back branches active but only once a trace is long enough to justify it - [ ] Add typecheck-free version of stelem_ref (only possible for sealed types, must be generated in interp) - [ ] Update the msbuild targets to generate a single export arg to emcc instead of one per exported function - [ ] Ensure IEEE spec compliance for the f32 and f64 opcodes that rely on libc or wasm opcodes - [ ] Zero region optimizations * [x] Fuse null check and length check for arrays * [x] Fuse null check and length check for strings * [ ] Fuse null check and length check for spans * [x] Fuse null check and type check for MINT_CASTCLASS/MINT_ISINST - [ ] Interpreter migration * [ ] Move cpblk unrolling into interpeter superinsn pass as mint_cpblk_imm * [ ] Add new null-check-free versions of hot field opcodes * [ ] Add new information table tracking things like known not-null state per local that are exposed to jiterpreter * [x] Consume information table from jiterpreter to do null check elimination * [ ] Optimize size of null check bitset as described in https://github.com/dotnet/runtime/pull/84058#discussion_r1155691712 * [ ] Investigate migrating the trace generator into transform.c and doing it during the tiering process - [ ] SIMD * [x] Implement interpreter V128 intrinsics * [x] Implement PackedSimd intrinsics * [x] Implement PackedSimd in interpreter or implement a jiterpreter passthrough mechanism * [x] Identify and fix the simd issue that causes testResults XML truncation on CI * [x] Enable interpreter V128 support on WASM by default * [x] Enable PackedSimd in interpreter mode by default * [x] Implement I2 and I4 shuffles * [ ] Use splat encoding for v128.const 0 once v8 ships optimization for it, or use an implicitly zero-initialized local * [x] Optimize constant I2 and I4 shuffle vectors * [ ] Implement the rest of PackedSimd - [x] Raise interpreter inlining limit to 30 * [ ] Investigate raising it a bit further - [ ] Caching * [ ] Record a list of which methods are tiered in the interp so they can tier immediately on future runs * [ ] Record a list of which traces we compile so that we can compile them early on future runs * [ ] Cache jitted traces across page loads * [ ] Cache do_jit_call trampolines across page loads * [ ] Cache interp_entry wrappers across page loads - [ ] Interp integration * [ ] If interpreter verbose is set for a method the jiterpreter should honor that Archived items - [x] Write a custom assembler and use it to generate and inline do-jit-call and simd detect modules ( [#81691](https://github.com/dotnet/runtime/issues/81691) ) - [x] Also unroll memcpy like memset - [x] Investigate possible startup time regressions - [x] Investigate possible .wasm size regressions - [x] Update memmove unroller to ensure it does the correct thing for overlapping src/dest - [x] Enable jiterpreter jitcall and interp_entry JITs by default - [x] Enable jiterpreter traces by default - [x] Don't bail out for safepoints * [x] Do the 'is a safepoint needed' check inline in the trace instead of in the import - [x] Inline strlen into traces - [x] Inline getchr_ref into traces - [x] Inline getitem_span into traces - [x] Inline get_element_address_with_size_ref into traces - [x] Optimize out the eip local and initialization for traces containing no branches - [x] Generate import section after generating function body and omit unused imports - [x] Do another pass over intrinsics and superinsns to add any missing ones (like the log2 used for vectorization) - [x] Remove generated opcode info table and fetch opcode info from the interpreter's tables on demand to reduce file size - [x] Don't discard known not-null / known constant information when crossing branches, only branch targets - [x] Migrate configuration to options.h (requires improvements to the API) - [x] Verify that no debugging scenarios regress - [x] Better error handling for jiterpreter runtime failures (shut them off after a handful of JIT failures to avoid spamming the console and wasting CPU time) - [x] Optimize out memory.fill for common sizes (it produces an expensive function call on x86 and x64) - [x] Handle jiterpreter opcodes in non-wasm interp using the same path as other unreachable opcodes - [x] Fix floating point compares in jiterpreter
Author: kg
Assignees: kg
Labels: `arch-wasm`, `area-Codegen-Interpreter-mono`
Milestone: 8.0.0
SamMonoRT commented 11 months ago

Moving tracking issues to 9.0