[wasm] Jiterpreter tracking issue

kg commented 1 year ago

The jiterpreter (#76477) has pending work needed:

[ ] Introduce a Jiterpreter CI lane that sets all the tiering thresholds low so that we flush out any issues with obscure interp opcodes or cold code
[ ] Investigate integrating jit calls directly into compiled traces
[ ] Investigate integrating icalls directly into compiled traces
[ ] Run statistics on blazor applications once the jiterpreter is integrated, to identify any instructions that need to be added
[ ] Remove more unnecessary transition/wrapper glue from do_jit_call and interp_entry paths, as seen here:
- [ ] Maintain a table of vtable slots containing interp_entry _in wrappers, then patch the vtables (design pending)
- [x] When generating dedicated do_jit_call routine, punch through the _out wrapper (model on the mini-generic-sharing.c code generator)
- [x] Optimize direct jit calls to turn the common ldloca sp + offset -> tnn.load pair into tnn_load offset
- [x] Optimize out passing of ftndesc arg to direct jit call wrappers, target and rgctx can be compiled in (not possible due to generic sharing)
[ ] Cache interpreter stack locals in wasm locals, then flush them back to the interpreter stack on exit
[ ] Cache non-volatile fields in wasm locals, then flush them back to the heap on exit
[ ] Threading support (incomplete draft to-do list)
- [x] Pre-grow function pointer table to a set size at startup in each thread
- [x] Ensure empty function pointer slots are filled with appropriate 'dummy' functions so that threads will not crash when calling them
- [ ] When jitting a new function, RPC the wasm blob or compiled module to threads so they can register the pointer
- [x] Thread-safe interpreter opcode patching
- [ ] Thread-safe do_jit_call pointer/cache updates
[ ] Multi-trace optimizations
- [ ] For traces with an offset other than 0 (large ones only?) attempt to reuse other existing traces?
- [ ] Stop compiling traces when we encounter an already-compiled trace (likely function prologue -> loop body)
- [ ] When we encounter an already compiled trace, call it directly from the current trace
[ ] Heuristic improvements
- [x] Don't put trace entry points too close together
- [ ] If a trace is likely to conditionally abort early in its execution, don't insert an entry point (requires interpreter to mark blocks as unlikely if they contain a throw)
- [x] Add 'estimated cost' value for each opcode to mintops.def that estimates cost of running it in interp
- [x] Add estimated cost value for each jiterpreter opcode that estimates the quality of generated wasm code
- [ ] Instead of using trace length heuristic, only keep traces where estimated jiterpreter cost <= interp cost
- [x] Ensure new system keeps short high value traces like Vector128.Add-with-SIMD
- [ ] Improve estimated jiterpreter cost by factoring in (measured on v8 and/or spidermonkey) cost of entering a trace
- [ ] Factor in the lack of branch prediction when estimating cost of jiterpreter branches like null checks
- [ ] Insert entry points periodically in very large basic blocks so that the jiterp can resume when a trace ends due to being too large
[ ] Control flow improvements
- [x] Basic backwards branch implementation
- [x] Implement CFG tracker that assembles module at the end
- [x] Eliminate branch block comparison(s) for forward branches
- [ ] Eliminate branch block comparison(s) for backward branches
- [x] Don't generate dispatch table entries for branch targets that cannot be reached by backward branches
- [x] Don't generate a dispatch table if all back branches in a trace go to a single place
- [ ] Identify cases where each back branch target is independent, and generate separate loops
- [x] Record each CALL_HANDLER target and use that to implement ENDFINALLY
- [x] When we emit an unconditional bailout, set a 'prune opcodes' flag and don't translate any unreachable opcodes after it until we hit a branch target block
- [ ] Outline bailouts and exits to a shared return at the end of traces
- [x] Change all bailouts to be the form if (cond) { br bailout_block } or br_if bailout_block
[ ] Monitoring phase improvements
- [ ] Tune threshold
- [x] Generate a mapping table from return values (we know the possible set) to executed opcode or uop count
- [x] Set threshold in terms of opcodes or uops
- [x] Discard mapping table after monitoring phase
[ ] Store-to-load forwarding
- [x] If a series of opcodes r/w overwrite a dreg, drop the store/load pair for the leading opcodes, i.e. a = b * 2; a = a + 1; (this turns out to make things slower in v8 for some reason, so prototype won't land)
- [ ] Use a wasm local instead of leave-on-stack
- [ ] Fully optimize out stores and loads for cases where the dreg is only read once by leaving it on the wasm stack
- [x] Forward constants from their most recent store to load(s) that use them (https://github.com/dotnet/runtime/pull/99706)
[ ] Re-enable early trace abort with back branches active but only once a trace is long enough to justify it
[ ] Add typecheck-free version of stelem_ref (only possible for sealed types, must be generated in interp) (https://github.com/dotnet/runtime/pull/99829)
[ ] Update the msbuild targets to generate a single export arg to emcc instead of one per exported function
[ ] Ensure IEEE spec compliance for the f32 and f64 opcodes that rely on libc or wasm opcodes
[ ] Cache the this-reference (locals[0]) in a wasm local since it can't change
[ ] Zero region optimizations
- [x] Fuse null check and length check for arrays
- [x] Fuse null check and length check for strings
- [ ] Fuse null check and length check for spans
- [x] Fuse null check and type check for MINT_CASTCLASS/MINT_ISINST
[ ] Interpreter integration
- [ ] Move cpblk unrolling into interpreter superinsn pass as mint_cpblk_imm
- [ ] Add new null-check-free versions of hot field opcodes
- [ ] Add new information table tracking things like known not-null state per local that are exposed to jiterpreter
- [x] Consume information table from jiterpreter to do null check elimination
- [ ] Optimize size of null check bitset as described in https://github.com/dotnet/runtime/pull/84058#discussion_r1155691712
- [ ] Investigate migrating the trace generator into transform.c and doing it during the tiering process
- [x] If interpreter verbose is set for a method the jiterpreter should honor that
[x] SIMD
[x] Raise interpreter inlining limit to 30
- [ ] Investigate raising it a bit further
[ ] Caching / PGO
- [ ] Record a list of which methods are tiered in the interp so they can tier immediately on future runs
- [ ] Record a list of which traces we compile so that we can compile them early on future runs
- [ ] Cache jitted traces across page loads
- [ ] Cache do_jit_call trampolines across page loads
- [ ] Cache interp_entry wrappers across page loads
[x] Make sure that call_handler/leave work correctly in the event that we bail out from a trace into the interp (https://github.com/dotnet/runtime/issues/98577)
[ ] Cleanup
- [ ] Remove most jiterp cprop once we can rely on the interpreter to do it, for correctness reasons

Archived items

[x] Write a custom assembler and use it to generate and inline do-jit-call and simd detect modules ( #81691 )
[x] Also unroll memcpy like memset
[x] Investigate possible startup time regressions
[x] Investigate possible .wasm size regressions
[x] Update memmove unroller to ensure it does the correct thing for overlapping src/dest
[x] Enable jiterpreter jitcall and interp_entry JITs by default
[x] Enable jiterpreter traces by default
[x] Don't bail out for safepoints
- [x] Do the 'is a safepoint needed' check inline in the trace instead of in the import
[x] Inline strlen into traces
[x] Inline getchr_ref into traces
[x] Inline getitem_span into traces
[x] Inline get_element_address_with_size_ref into traces
[x] Optimize out the eip local and initialization for traces containing no branches
[x] Generate import section after generating function body and omit unused imports
[x] Do another pass over intrinsics and superinsns to add any missing ones (like the log2 used for vectorization)
[x] Remove generated opcode info table and fetch opcode info from the interpreter's tables on demand to reduce file size
[x] Don't discard known not-null / known constant information when crossing branches, only branch targets
[x] Migrate configuration to options.h (requires improvements to the API)
[x] Verify that no debugging scenarios regress
[x] Better error handling for jiterpreter runtime failures (shut them off after a handful of JIT failures to avoid spamming the console and wasting CPU time)
[x] Optimize out memory.fill for common sizes (it produces an expensive function call on x86 and x64)
[x] Handle jiterpreter opcodes in non-wasm interp using the same path as other unreachable opcodes
[x] Fix floating point compares in jiterpreter

ghost commented 1 year ago

Tagging subscribers to 'arch-wasm': @lewing See info in area-owners.md if you want to be subscribed.

Issue Details

The jiterpreter has pending work needed: - [ ] Migrate configuration to options.h (requires improvements to the API) - [ ] Enable jiterpreter features by default - [ ] Better error handling for jiterpreter runtime failures (shut them off after a handful of JIT failures to avoid spamming the console and wasting CPU time) - [ ] Run statistics on blazor applications once the jiterpreter is integrated, to identify any instructions that need to be added - [ ] Investigate integrating jit calls directly into compiled traces - [ ] Remove more unnecessary transition/wrapper glue from do_jit_call and interp_entry paths, as seen here: ![image](https://user-images.githubusercontent.com/198130/202034944-c1fb3439-b564-4fcc-9210-0f02be89c864.png) - [ ] Threading support (incomplete draft to-do list) * [ ] Synchronize wasm function pointer table growth across threads * [ ] Ensure empty function pointer slots are filled with appropriate 'dummy' functions so that threads will not crash when calling them * [ ] When jitting a new function, RPC the wasm blob or compiled module to threads so they can register the pointer * [ ] Thread-safe interpreter opcode patching * [ ] Thread-safe do_jit_call pointer/cache updates - [ ] Caching * [ ] Cache jitted traces across page loads * [ ] Cache do_jit_call trampolines across page loads * [ ] Cache interp_entry wrappers across page loads

Author:	kg
Assignees:	kg
Labels:	`arch-wasm`
Milestone:	-

ghost commented 1 year ago

Tagging subscribers to this area: @brzvlad, @kotlarmilos See info in area-owners.md if you want to be subscribed.

Issue Details

The jiterpreter (#76477) has pending work needed: - [ ] Introduce a Jiterpreter CI lane that sets all the tiering thresholds low so that we flush out any issues with obscure interp opcodes or cold code - [ ] Investigate integrating jit calls directly into compiled traces - [ ] Investigate integrating icalls directly into compiled traces - [ ] Run statistics on blazor applications once the jiterpreter is integrated, to identify any instructions that need to be added - [ ] Remove more unnecessary transition/wrapper glue from do_jit_call and interp_entry paths, as seen here: ![image](https://user-images.githubusercontent.com/198130/202034944-c1fb3439-b564-4fcc-9210-0f02be89c864.png) * [ ] Maintain a table of vtable slots containing interp_entry _in wrappers, then patch the vtables (design pending) * [x] When generating dedicated do_jit_call routine, punch through the _out wrapper (model on the mini-generic-sharing.c code generator) * [x] Optimize direct jit calls to turn the common ldloca sp + offset -> tnn.load pair into tnn_load offset * [x] Optimize out passing of ftndesc arg to direct jit call wrappers, target and rgctx can be compiled in (not possible due to generic sharing) - [ ] Cache interpreter stack locals in wasm locals, then flush them back to the interpreter stack on exit - [ ] Cache non-volatile fields in wasm locals, then flush them back to the heap on exit - [ ] Threading support (incomplete draft to-do list) * [ ] Synchronize wasm function pointer table growth across threads * [ ] Ensure empty function pointer slots are filled with appropriate 'dummy' functions so that threads will not crash when calling them * [ ] When jitting a new function, RPC the wasm blob or compiled module to threads so they can register the pointer * [ ] Thread-safe interpreter opcode patching * [ ] Thread-safe do_jit_call pointer/cache updates - [ ] Multi-trace optimizations * [ ] For traces with an offset other than 0 (large ones only?) attempt to reuse other existing traces? * [ ] Stop compiling traces when we encounter an already-compiled trace (likely function prologue -> loop body) * [ ] When we encounter an already compiled trace, call it directly from the current trace - [ ] Heuristic improvements * [x] Don't put trace entry points too close together * [ ] If a trace is likely to conditionally abort early in its execution, don't insert an entry point (requires interpreter to mark blocks as unlikely if they contain a throw) * [x] Identify causes of heuristic accuracy only being ~95% on S.R.T. * [ ] Add 'estimated cost' value for each opcode to mintops.def that estimates cost of running it in interp * [x] Add estimated cost value for each jiterpreter opcode that estimates the quality of generated wasm code * [ ] Instead of using trace length heuristic, only keep traces where estimated jiterpreter cost <= interp cost * [ ] Ensure new system keeps short high value traces like Vector128.Add-with-SIMD * [ ] Improve estimated jiterpreter cost by factoring in (measured on v8 and/or spidermonkey) cost of entering a trace * [ ] Factor in the lack of branch prediction when estimating cost of jiterpreter branches like null checks - [ ] Control flow improvements * [x] Basic backwards branch implementation * [x] Implement CFG tracker that assembles module at the end * [x] Eliminate branch block comparison(s) for forward branches * [ ] Eliminate branch block comparison(s) for backward branches * [x] Don't generate dispatch table entries for branch targets that cannot be reached by backward branches * [x] Don't generate a dispatch table if all back branches in a trace go to a single place * [ ] Identify cases where each back branch target is independent, and generate separate loops * [x] Record each CALL_HANDLER target and use that to implement ENDFINALLY * [x] When we emit an unconditional bailout, set a 'prune opcodes' flag and don't translate any unreachable opcodes after it until we hit a branch target block * [ ] Outline bailouts and exits to a shared return at the end of traces * [ ] Change all bailouts to be the form `if (cond) { br bailout_block }` or `br_if bailout_block` - [ ] Monitoring phase improvements * [ ] Tune threshold * [x] Fix `Span.Reverse` regression * [x] Generate a mapping table from return values (we know the possible set) to executed opcode or uop count * [x] Set threshold in terms of opcodes or uops * [x] Discard mapping table after monitoring phase - [ ] Load-to-store forwarding * [x] If a series of opcodes r/w overwrite a dreg, drop the store/load pair for the leading opcodes, i.e. `a = b * 2; a = a + 1;` (this turns out to make things slower in v8 for some reason, so prototype won't land) * [ ] Use a wasm local instead of leave-on-stack * [ ] Fully optimize out stores and loads for cases where the dreg is only read once by leaving it on the wasm stack - [ ] Re-enable early trace abort with back branches active but only once a trace is long enough to justify it - [ ] Add typecheck-free version of stelem_ref (only possible for sealed types, must be generated in interp) - [ ] Update the msbuild targets to generate a single export arg to emcc instead of one per exported function - [ ] Ensure IEEE spec compliance for the f32 and f64 opcodes that rely on libc or wasm opcodes - [ ] Zero region optimizations * [x] Fuse null check and length check for arrays * [x] Fuse null check and length check for strings * [ ] Fuse null check and length check for spans * [x] Fuse null check and type check for MINT_CASTCLASS/MINT_ISINST - [ ] Interpreter migration * [ ] Move cpblk unrolling into interpeter superinsn pass as mint_cpblk_imm * [ ] Add new null-check-free versions of hot field opcodes * [ ] Add new information table tracking things like known not-null state per local that are exposed to jiterpreter * [x] Consume information table from jiterpreter to do null check elimination * [ ] Optimize size of null check bitset as described in https://github.com/dotnet/runtime/pull/84058#discussion_r1155691712 * [ ] Investigate migrating the trace generator into transform.c and doing it during the tiering process - [ ] SIMD * [x] Implement interpreter V128 intrinsics * [x] Implement PackedSimd intrinsics * [x] Implement PackedSimd in interpreter or implement a jiterpreter passthrough mechanism * [x] Identify and fix the simd issue that causes testResults XML truncation on CI * [x] Enable interpreter V128 support on WASM by default * [x] Enable PackedSimd in interpreter mode by default * [x] Implement I2 and I4 shuffles * [ ] Use splat encoding for v128.const 0 once v8 ships optimization for it, or use an implicitly zero-initialized local * [x] Optimize constant I2 and I4 shuffle vectors * [ ] Implement the rest of PackedSimd - [x] Raise interpreter inlining limit to 30 * [ ] Investigate raising it a bit further - [ ] Caching * [ ] Record a list of which methods are tiered in the interp so they can tier immediately on future runs * [ ] Record a list of which traces we compile so that we can compile them early on future runs * [ ] Cache jitted traces across page loads * [ ] Cache do_jit_call trampolines across page loads * [ ] Cache interp_entry wrappers across page loads - [ ] Interp integration * [ ] If interpreter verbose is set for a method the jiterpreter should honor that Archived items - [x] Write a custom assembler and use it to generate and inline do-jit-call and simd detect modules ( [#81691](https://github.com/dotnet/runtime/issues/81691) ) - [x] Also unroll memcpy like memset - [x] Investigate possible startup time regressions - [x] Investigate possible .wasm size regressions - [x] Update memmove unroller to ensure it does the correct thing for overlapping src/dest - [x] Enable jiterpreter jitcall and interp_entry JITs by default - [x] Enable jiterpreter traces by default - [x] Don't bail out for safepoints * [x] Do the 'is a safepoint needed' check inline in the trace instead of in the import - [x] Inline strlen into traces - [x] Inline getchr_ref into traces - [x] Inline getitem_span into traces - [x] Inline get_element_address_with_size_ref into traces - [x] Optimize out the eip local and initialization for traces containing no branches - [x] Generate import section after generating function body and omit unused imports - [x] Do another pass over intrinsics and superinsns to add any missing ones (like the log2 used for vectorization) - [x] Remove generated opcode info table and fetch opcode info from the interpreter's tables on demand to reduce file size - [x] Don't discard known not-null / known constant information when crossing branches, only branch targets - [x] Migrate configuration to options.h (requires improvements to the API) - [x] Verify that no debugging scenarios regress - [x] Better error handling for jiterpreter runtime failures (shut them off after a handful of JIT failures to avoid spamming the console and wasting CPU time) - [x] Optimize out memory.fill for common sizes (it produces an expensive function call on x86 and x64) - [x] Handle jiterpreter opcodes in non-wasm interp using the same path as other unreachable opcodes - [x] Fix floating point compares in jiterpreter

Author:	kg
Assignees:	kg
Labels:	`arch-wasm`, `area-Codegen-Interpreter-mono`
Milestone:	8.0.0

SamMonoRT commented 11 months ago

Moving tracking issues to 9.0

dotnet / runtime

[wasm] Jiterpreter tracking issue #78428