dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.25k stars 4.73k forks source link

Proposal: evolve RyuJIT emitter to improve flexibility #80589

Open BruceForstall opened 1 year ago

BruceForstall commented 1 year ago

The RyuJIT emitter uses a machine-level program representation which is not designed for modification after it is generated, such as modification due to peephole optimization or other low-level code transformations. It has long been known that certain "easy" optimizations could be made by adding peephole optimization capability, and a few specific cases have been added with severe limitations due to the current representation. This issue describes some of the problems and proposes incremental changes to increase its flexibility.

Current status

RyuJIT IR (internal representation) for representing a program has three basic forms:

In RyuJIT, MIR is a single-linked list of IGs, each of which contains a packed memory buffer of variable-sized instrDescs. There is no facility to navigate to a previous instrDesc from an instrDesc pointer, delete an instrDesc, or insert a new one. An IG roughly corresponds to a BasicBlock in LIR and stores the GC state that exists at the beginning of the IG. Each instrDesc encodes enough information to describe how the instruction affects GC state.

During code generation (codegen), MIR is incrementally created by iterating over LIR and generating machine instructions as IG/ID lists. After codegen, branch tightening and alignment processing occurs. After this point, we know how much memory to request from the VM, and the final memory space for the generated code is allocated. MIR is traversed, encoding instructions into this memory, and also computing and updating GC information, collecting GC state changes at final instruction offsets to hand back to the VM.

Final instruction offsets are needed for other needs as well: (1) IP->native mapping, (2) variable live ranges, (3) unwind information.

One key mechanism used to compute final instruction offsets is the emitLocation type. During code generation, any time a location in the final code stream is needed, an emitLocation is created. It's basically a pointer to the current or specified location of generated instrDescs. Crucially, it can give the actual code offset of this location after branch tightening has changed the initial estimated instruction offsets and we have final offsets.

However, making changes to the instrDesc instruction stream after an emitLocation has captured a particular location might invalidate the emitLocation, thus requiring anyone making such modification to find and update all the appropriate captured emitLocation, which is currently difficult and costly. Thus, we currently do not have any machine-level optimizations that remove any already-generated instructions.

Proposal: Remove emitLocation

To improve emitter flexibility, eliminate the use of emitLocation. These are all cases where the code needs a final instruction offset, but during codegen, that is not available. Instead of emitLocation, create new instrDesc representing the thing for which location information is needed, such as an IP->native mapping or the birth or death of a variable. These "pseudo-instructions" (or "MIR nodes") would be represented within the instrDesc list but would not generate any code. Branch tightening would leave them alone, but they would retain their relative position. Then, either before or after instruction encoding, as needed, walk the MIR to process these pseudo-instructions as necessary, such as building up and reporting variable live range data structures.

With this, a peephole optimization would be free to delete an instruction without worrying about updating any "pointers" to the instruction being removed. The optimizations would need to be aware of non-code MIR nodes, however, and adjust behavior appropriately. This is good, though, as those nodes represent semantics that would need to be respected by the optimization.

Proposal: Support backwards navigation

To support peephole optimizations, the JIT currently maintains a "last instruction" pointer (and soon might maintain a larger, though fixed-size and limited, history of "last instruction" pointers to support looking backwards more than a single instruction).

An alternative is to introduce a "previous" pointer to the instrDesc.

Currently, given an instrDesc*, you can use emitAdvanceInstrDesc() or emitNextID() to move to the next instrDesc in the list. To move backwards, we could add a byte to an instrDesc that specifies the number of bytes to subtract from an instrDesc* to reach the previous one, or zero if there is no previous one in the ID list for this IG. This presumes that all instrDesc are less than 256 bytes (which I believe is true, but otherwise we could use a short). To support going backwards across IGs, we would need to add an insGroup* igPrev "previous" pointer to insGroup. Then, add emitter::emitPrevInstrDesc(insGroup** ig, instrDesc** id) for navigation.

Alternatives

This proposal is an incremental change over the existing RyuJIT data structures. An interesting alternative is to scrap using IG/ID as MIR altogether, and reuse the existing linear GenTree representation, and LIR operations, instead. This would be done by creating new machine-level GenTree nodes, with machine-level instructions / opcodes and formats, much as is currently stored in instrDescs. Codegen would transform from one kind of GenTree to another.

This would re-use much existing mechanism, and possibly make it easier to do something like re-use our SSA package on machine-level IR. However, transforming the current code to this would require care to avoid rewriting a significant amount of very complicated instruction encoding logic. Also, memory usage and throughput would need consideration.

ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak See info in area-owners.md if you want to be subscribed.

Issue Details
The RyuJIT emitter uses a machine-level program representation which is not designed for modification after it is generated, such as modification due to peephole optimization or other low-level code transformations. It has long been known that certain "easy" optimizations could be made by adding peephole optimization capability, and a few specific cases have been added with severe limitations due to the current representation. This issue describes some of the problems and proposes incremental changes to increase its flexibility. # Current status RyuJIT IR (internal representation) for representing a program has three basic forms: - tree-based GenTrees, sometimes referred to as HIR (high-level IR) to distinguish it from LIR and reuse the terminology used in some compiler literature - linear GenTree LIR (linear IR, or low-level IR), used after Lowering - Instruction Groups + Instruction Descriptors (IG + ID; in code: insGroup + instrDesc). I propose to call this MIR for machine-level IR, since it represents machine instructions and arguments that can be used for binary encoding. (Note that some compiler literature instead uses MIR to stand for "medium-level IR".) In RyuJIT, MIR is a single-linked list of IGs, each of which contains a packed memory buffer of variable-sized instrDescs. There is no facility to navigate to a previous instrDesc from an instrDesc pointer, delete an instrDesc, or insert a new one. An IG roughly corresponds to a BasicBlock in LIR and stores the GC state that exists at the beginning of the IG. Each instrDesc encodes enough information to describe how the instruction affects GC state. During code generation (codegen), MIR is incrementally created by iterating over LIR and generating machine instructions as IG/ID lists. After codegen, branch tightening and alignment processing occurs. After this point, we know how much memory to request from the VM, and the final memory space for the generated code is allocated. MIR is traversed, encoding instructions into this memory, and also computing and updating GC information, collecting GC state changes at final instruction offsets to hand back to the VM. Final instruction offsets are needed for other needs as well: (1) IP->native mapping, (2) variable live ranges, (3) unwind information. One key mechanism used to compute final instruction offsets is the `emitLocation` type. During code generation, any time a location in the final code stream is needed, an emitLocation is created. It's basically a pointer to the current or specified location of generated instrDescs. Crucially, it can give the actual code offset of this location after branch tightening has changed the initial estimated instruction offsets and we have final offsets. However, making changes to the instrDesc instruction stream after an `emitLocation` has captured a particular location might invalidate the `emitLocation`, thus requiring anyone making such modification to find and update all the appropriate captured `emitLocation`, which is currently difficult and costly. Thus, we currently do not have any machine-level optimizations that remove any already-generated instructions. # Proposal: Remove `emitLocation` To improve emitter flexibility, eliminate the use of `emitLocation`. These are all cases where the code needs a final instruction offset, but during codegen, that is not available. Instead of `emitLocation`, create new instrDesc representing the thing for which location information is needed, such as an IP->native mapping or the birth or death of a variable. These "pseudo-instructions" (or "MIR nodes") would be represented within the instrDesc list but would not generate any code. Branch tightening would leave them alone, but they would retain their relative position. Then, either before or after instruction encoding, as needed, walk the MIR to process these pseudo-instructions as necessary, such as building up and reporting variable live range data structures. With this, a peephole optimization would be free to delete an instruction without worrying about updating any "pointers" to the instruction being removed. The optimizations would need to be aware of non-code MIR nodes, however, and adjust behavior appropriately. This is good, though, as those nodes represent semantics that would need to be respected by the optimization. # Proposal: Support backwards navigation To support peephole optimizations, the JIT currently maintains a "last instruction" pointer (and soon might maintain a larger, though fixed-size and limited, history of "last instruction" pointers to support looking backwards more than a single instruction). An alternative is to introduce a "previous" pointer to the instrDesc. Currently, given an `instrDesc*`, you can use `emitAdvanceInstrDesc()` to move to the next instrDesc in the list (it currently can only move within a single IG, and not move to subsequent IGs). To move backwards, we could add a `byte` to an `instrDesc` that specifies the number of bytes to subtract from an `instrDesc*` to reach the previous one, or zero if there is no previous one in the ID list for this IG. This presumes that all `instrDesc` are less than 256 bytes (which I believe is true, but otherwise we could use a `short`). To support going backwards across IGs, we would need to add an `insGroup* igPrev` "previous" pointer to `insGroup`. Then, add `emitter::emitPrevInstrDesc(insGroup** ig, instrDesc** id)` for navigation. # Alternatives This proposal is an incremental change over the existing RyuJIT data structures. An interesting alternative is to scrap using IG/ID as MIR altogether, and reuse the existing linear GenTree representation, and LIR operations, instead. This would be done by creating new machine-level GenTree nodes, with machine-level instructions / opcodes and formats, much as is currently stored in instrDescs. Codegen would transform from one kind of GenTree to another. This would re-use much existing mechanism, and possibly make it easier to do something like re-use our SSA package on machine-level IR. However, transforming the current code to this would require care to avoid rewriting a significant amount of very complicated instruction encoding logic. Also, memory usage and throughput would need consideration.
Author: BruceForstall
Assignees: -
Labels: `area-CodeGen-coreclr`
Milestone: -
BruceForstall commented 1 year ago

@dotnet/jit-contrib

markples commented 1 year ago

Thanks for writing this helpful summary and proposal, Bruce.

One thing that stands out to me is the specific issue of stale pointers (emitLocation) vs other concerns, and this leads me to wonder what kinds of abstractions could help in general, be incremental changes, etc. For example, an abstract location could be implemented today by a pointer, then changed to something more symbolic, potentially back to a pointer to a linked list node (if we ended up there), and so on. If it's abstract, then the group could control it (update it when necessary). A sufficient abstraction could even allow backwards traversal to be implemented by rewalking from the group start (not really suggesting exactly that, but it's an example of a time/space tradeoff that could happen).

Whether those abstract locations point to real instructions or a kind of pseudo-instruction would be TBD. Note that this framing is retaining the "external reference to the IR view" - so the pseudo-instructions basically become labels. The need for a location is reduced or goes away completely if everything is in the IR itself.

I think the reason I've seen it this way is that it feels like the stale pointer and "it's hard to update" from become intertwined. In a very abstract sense, it doesn't matter to a peephole optimizer if the "additional information" is on an instruction or in a separate one or even in a side table - assuming that the optimizer has access to the info in order to manipulate it. I suppose some cases "just work" (ldr -> ldp might be able to ignore/persist the effects of the first load, add the effects of the second, delete the ldr, add a ldp), but is that generally true?

Sufficient abstraction would allow additional storage tricks to be used. For example, if updates are fairly few, insertion can be implemented by replacing an instrDesc with an "outline" instrDesc that points to a buffer (or whatever) with N instructions. If next, prev, traversal, etc., all just work, then this is possible. If every piece of code needs to watch out for the "outline" case, then it becomes quite messy.

BruceForstall commented 1 year ago

Thanks for the comments Mark.

emitLocation today is actually fairly abstract: it holds a <IG*, instruction number within IG> tuple. The problem we have is that these are effectively "external pointers" to the MIR (that is, IG/ID stream) that might need to be updated if MIR modifications/optimizations take place. E.g., for the "ldr(1)+ldr(2) => ldp" optimization, if the ldr(1) instruction is deleted, and some emitLocation pointed after it, we need to not only find that emitLocation, but we need to determine what the appropriate action is w.r.t. the optimization: update it to point before ldr(1), update it to point after the replaced ldp, or delete it entirely? There is a semantic question to be answered. If the locations represented today by emitLocation are instead interleaved in the IR as "labels", it is certainly easier to find them, but the same semantic questions need to be addressed.

AndyAyersMS commented 1 year ago

I was chatting with Bruce and think it would be helpful to keep possible longer-term ambitions in mind. Some examples:

These all involve (to lesser/greater degree) the ability to analyze and perhaps modify instructions very late.

The current implementation keeps a lot of implicit ambient state as it streams through codegen and emission, making it tricky to alter what's already been done, or anticipate what is coming next. Moving towards a more explicit representation seems like a good way to open the door to more flexible working arrangements.

I also am not sure how much emphasis should be placed on saving memory. Presumably all the jit-allocated memory gets reclaimed/recycled so any additional footprint is transient. Back in the days of full framework and especially 32 bits it could be that private bytes / address space was at a premium. It might make sense to opt for a simpler but less memory efficient design, if it leads to easier to reason about data or faster overall processing.