Cranelift: let regalloc manage callee-saved registers

bytecodealliance / wasmtime

A fast and secure runtime for WebAssembly

https://wasmtime.dev/

Apache License 2.0

15.15k stars 1.27k forks source link

Cranelift: let regalloc manage callee-saved registers #7727

Open jameysharp opened 9 months ago

jameysharp commented 9 months ago

Feature

Instead of writing target/ABI-specific code in each backend to spill callee-saved registers to the stack in the function prologue and restore them in the epilogue, we've been discussing that we should let the register allocator decide whether, when, and where to move those registers.

Benefit

The register allocator has enough information about register pressure to be able to decide whether it needs to make use of the callee-saved registers at all, or to delay spilling them in case there's a simple early-return path with low register pressure.

In addition, when the callee-saved registers must be spilled, the register allocator already knows how to do that, so we can delete a bunch of code from the backends which handles the special case of spilling callee-saved registers.

Implementation

Currently, every backend has fake "instructions" called args and rets. These pseudo-instructions don't generate any real instructions when they're emitted to machine code. They represent the ABI requirements for how arguments and return values are passed between functions, and they only exist to record information for register allocation. They work by pairing up a virtual register representing an argument with a physical register that is where that argument must live on entry to the function. The virtual register is then used everywhere else in the function so that the register allocator is allowed to change which physical register the value lives in, or even move it to the stack if necessary.

We can use those pseudo-instructions to describe how callee-saved registers work too. Callee-saved registers act like extra arguments to a function, which are never used inside the function but must be returned in the same registers at the end. By tying them to virtual registers, we can let the register allocator decide whether to leave those values in-place or move them out of the way to make those registers available for other uses.

I don't think we can correctly use the rets instruction before return_call instructions used for tail-calls. So a first version of this probably shouldn't try to change our tail-call ABI, which currently has no callee-saved registers. However, once the basic idea is working, it should be straightforward to add the functionality of the rets instruction into the return_call instruction, and then we can fix #6759.

Some backends (aarch64 at least) are currently able to make use of special "store-multiple" and "load-multiple" instructions for pushing and popping multiple callee-saved registers at once. A good enhancement here would be to find a way to let RA2 make use of those instructions any time it needs to spill or reload multiple registers at once, even if they're not the callee-saved register spills. However that shouldn't be necessary for a first version of this idea.

Alternatives

The current implementation works fine, it just seems more complex than necessary. In particular, @fitzgen and I found callee-saved registers were difficult to deal with when implementing tail-call support in Cranelift, and that led to the issue in #6759.

Credits

This started from a suggestion by @sunfishcode and I've discussed it further with @elliottt and others.

cfallin commented 9 months ago

I really like this idea! It would definitely simplify things overall. A few thoughts to add:

The main sticking point is going to be unwind-info, I think. Alongside the dedicated save sequence in the prologue, we emit Unwind pseudoinsts that note the locations of callee-saves so that the unwinder can restore them. This doesn't matter much for Wasm execution currently, because we never expect to continue executing in Wasm context after an unwind happens (the unwinder is load-bearing only to generate backtraces), but does matter much more for other uses of Cranelift (e.g. rustc), and will matter once we have Wasm exceptions.

DWARF at least I think is fully general enough to note "at this point, r12 is now saved at [rbp-16]", so in theory this is possible to work around on Linux/macOS/BSD (still a bit tricky, as the Unwind pseudoinst mechanism isn't designed to handle arbitrary control flow merge/join, just the straightline prologue, currently). On Windows however there is a hard ABI constraint that callee-saves are saved in the prologue, so we might not be able to use this technique there.
The "ignore clobbers" mechanism (which would go away because it's part of the current dedicated callee-save handling) is load-bearing for an important optimization: AArch64 specifies that only the lower half of vector registers v8-v15 are callee-saved. We otherwise have to approximate this conservatively both in the clobber-set computation and as effects on call instructions, so a function that calls another function will always save these 8 regs in the prologue, no matter what. When caller and callee at a callsite have the same ABI, we can "pass through" the clobbered-ness (if f calls g, f never has to save any callee-saves solely because of the call to g, because g would save any that it clobbers).

We'd need to find a way to preserve that (very important) optimization with the pseudoinst approach. Right now regalloc can't reason about "half-registers", and that would be a fairly big change to its semantics. Maybe there's a way to continue to treat callsites specially though. Basically they need to be seen differently by surrounding code (v8 is always clobbered) and the prologue (v8 is only clobbered as much as I'm allowed to clobber) and so there needs to be something different about the pseudoinsts.

I don't think either of those are necessarily fundamental (or, well, maybe the Windows part is), just two potential issues we'd have to work through!

ghostway0 commented 8 months ago

I think I've got a solution after talking with Jamey for a bit. Where does Cranelift push the clobbered callee-saved regs?

cfallin commented 8 months ago

This is implemented in the machinst::abi module and the corresponding trait impls for each ISA; grep for gen_clobber_save. Could you say more about your idea?

ghostway0 commented 8 months ago

Thanks! Sure, I plan to add a method for the ABI trait to get the callee-saved registers, save the ArgPairs and add it to args and rets. How does that sound? I think I've implemented it already - as in, correctly - just didn't test it yet. Sorry for the delayed response

cfallin commented 8 months ago

Ah, that could work. My main concerns would be:

This creates a bunch of liveranges that span the whole function; I'd be worried about nontrivial impact on compile time, and also on the generated code (the large conflict sets might not play well with splitting heuristics, either directly or indirectly by creating more pressure). Especially for small functions (the common case!) this inflates the LR set by perhaps 2x or more. (Sorry I didn't think of this sooner -- it's an issue with any regalloc-based approach, not just your idea) We'd want to carefully benchmark both. In essence one can see the current mechanism as a "special case" (morally part of regalloc, if not actually) for whole-func LRs carrying pregs.
The generated spill code may still not be Windows ABI compliant; we're supposed to save clobbers all in line with no moves, and RA2 doesn't guarantee that. Actually it may split and save the value in a reg instead of stack, which the Windows unwinder can't handle. So I think we'd need to keep the current mechanism for Windows regardless.

Thoughts?

ghostway0 commented 8 months ago

(a) Isn't the argument that RA2 is smart enough to handle those? (b) That was the plan, I have a TODO there. How should I handle that? Windows is annoying :)

cfallin commented 8 months ago

(a) Isn't the argument that RA2 is smart enough to handle those?

It certainly can handle function-long liveranges, sure. My main concern described above is performance (of the compiler, and of its output). The appeal of the more general solution is its simplicity but it may present a tradeoff because it uses more general code (the main RA2 solver) to replace a very specific and hardcoded mechanism. Fortunately it's an empirical question and one can benchmark it! :-)

Another aspect of generated-code perf that just occurred to me: on aarch64, we can do more efficient "bulk saves" with STP/LDP (store/load pair instructions). RA2 isn't able to coalesce adjacent spills into combined store instructions. Not impossible to do, but then we're just moving complexity rather than eliminating it.

(b) That was the plan, I have a TODO there. How should I handle that? Windows is annoying :)

Unfortunately I haven't been able to come up with a good idea here. The fastcall ABI is pretty specific about what the prologue should look like -- save frame pointer, save clobbered callee-saves, with nothing else in the way (because its unwinder does abstract interpretation of the prologue code, rather than interpreting something like DWARF metadata, which is pretty wild).

We could get that form by putting specific code generation for it in the args pseudoinst, but then we've just moved the explicit codegen for saving clobbers to a different place, rather than eliminated it (while the original motivation was to eliminate it).

So I guess what I'm coming around to is: Windows fastcall support may force the issue; and if that weren't the case, there are some performance concerns I hypothesize, but those can be answered empirically. Maybe all this does motivate why we have specific code for clobber-saves: it's just different enough in requirements that sharing with general spills doesn't reduce complexity; and also it's just different enough that we can address it with a different algorithm, possibly with better performance. I'm curious if anyone else has ideas on this though, especially the fastcall bit!

ghostway0 commented 8 months ago

Another aspect of generated-code perf that just occurred to me: on aarch64, we can do more efficient "bulk saves" with STP/LDP (store/load pair instructions). RA2 isn't able to coalesce adjacent spills into combined store instructions. Not impossible to do, but then we're just moving complexity rather than eliminating it.

Doesn't RA2 use parallel moves everywhere for these? If those bulk saves would be in the same bulk mov...

Unfortunately I haven't been able to come up with a good idea here. The fastcall ABI is pretty specific about what the prologue should look like -- save frame pointer, save clobbered callee-saves, with nothing else in the way (because its unwinder does abstract interpretation of the prologue code, rather than interpreting something like DWARF metadata, which is pretty wild).

Oof. The easiest solution would probably be to check the abi wherever machinst does clobber saves, no?

In a backtracking allocator you could set live ranges' priority, right? Why not have callee-saved registers' bundles at a lower priority? I think there's a constraint forcing live ranges to live on the stack, too. What do you think?

cfallin commented 8 months ago

Another aspect of generated-code perf that just occurred to me: on aarch64, we can do more efficient "bulk saves" with STP/LDP (store/load pair instructions). RA2 isn't able to coalesce adjacent spills into combined store instructions. Not impossible to do, but then we're just moving complexity rather than eliminating it.

Doesn't RA2 use parallel moves everywhere for these? If those bulk saves would be in the same bulk mov...

Yes, a parallel move is used for all spills at the same program point (at the same priority); the resolver could try to group moves together to make use of store/load-pair, at the cost of more complexity.

Unfortunately I haven't been able to come up with a good idea here. The fastcall ABI is pretty specific about what the prologue should look like -- save frame pointer, save clobbered callee-saves, with nothing else in the way (because its unwinder does abstract interpretation of the prologue code, rather than interpreting something like DWARF metadata, which is pretty wild).

Oof. The easiest solution would probably be to check the abi wherever machinst does clobber saves, no?

In a backtracking allocator you could set live ranges' priority, right? Why not have callee-saved registers' bundles at a lower priority? I think there's a constraint forcing live ranges to live on the stack, too. What do you think?

Yes, one could probably construct constraints to get things mostly into the right shape, but my main concern is that this is a fragile "just-so" construction: the necessary prologue shape would then be an emergent behavior of constraints attached to the pseudoinsts, handling of those constraints, and RA2's particular implementation choices for spills. Changes anywhere in that stack would break things, and in a way that could be somewhat hard to detect and debug.

Both of these actually get to the heart of my concern that has developed after working through these details: we're coming up with complex ways to adjust the general solver and RA backend (e.g. parallel move resolver), and feed it the right constraints, to emit code in just the right shape; to replace a small handwritten prologue generation function that emits code in just the right shape.

The original idea had a lot of appeal to me because it truly seemed simpler: remove the open-coded sequence, add some RA2 operands instead. But we've quickly gone the other way here to recover all the necessary properties: careful sorting and coalescing of adjacent spillslot pairs in parallel moves (for aarch64 ldp/stp), including a notion of pair stores/loads in the RA2 API where we only had generic moves before; careful use of subpriorities to push spills to a certain location and make the prologue into the right shape for fastcall; perhaps some other as-yet-unknown optimizations to make this match the original performance (e.g., special-casing these LRs in RA2's main loop and trying not to split them -- actually I think we need to ensure we don't split them, given that we describe their location only once, and that creates additional special cases in the solver core); and then reliance on all of these pieces continuing to interoperate in order to satisfy the ABI. In contrast, today we have ~100 LoC of code per backend to generate clobber-saves. From a "least global complexity" standpoint the latter seems better to me, at least.

jameysharp commented 8 months ago

It sounds to me like at minimum we shouldn't try to make RA2 handle callee-saved register spills on Windows. In fact this makes me think that tail-calls need very careful thought on Windows, since it seems we need callee-saved registers for general function-call performance (per #6759) but I bet unwinding isn't prepared to deal with tail-calls; but that's a separate concern. Also I wonder whether s390x would cause trouble for this plan.

I still think it's worth testing this approach on the other calling conventions, primarily because giving regalloc room to consider alternate plans should allow generating better code, and it would be nice to characterize how much better and what it costs in compile time. Even if we can't have that win on Windows, we do have a lot of people using Cranelift on non-Windows.

Also, if we could make RA2 use stp/ldp for parallel moves, that could be a win for code outside the prologue/epilogue too. If so it would be a good experiment independent of this one.

I certainly was hoping that this idea would let us delete all the special handling of callee-saved registers and I'm disappointed that Windows sinks that plan, but I'm not yet fully convinced that means we should throw out this idea entirely.

cfallin commented 8 months ago

For sure, I'd be interested in the possible performance upside of this too -- @ghostway0, would you be able to do some compile+runtime tests with your prototype? We're happy to provide details on how to do this if you need.

Also, if we could make RA2 use stp/ldp for parallel moves, that could be a win for code outside the prologue/epilogue too. If so it would be a good experiment independent of this one.

In principle that'd be nice, yep -- but since the paired memory accesses need adjacent spillslots, RA2's effectively random assignment of slots (depending on priorities and processing order and what happens to split or backtrack and ...) would eliminate most opportunistic uses of these, I suspect. To make this work well one would need to influence both the assignment of spillslots and the order of emitted moves, so that stores/loads to adjacent pairs of spillslots happen next to each other. That seems like a painfully complex change to RA2... actually I think possibly exponential complexity (we have the n! permutations of spillslot assignments, and for each permutation, certain spill/reload pairs will or won't be ldp/stp depending on whether adjacent addresses. Perhaps a greedy algorithm would do OK, I'm not sure). It also has weird phasing implications: spillslots are assigned before move insertion, but we need to know which moves are adjacent if we want this pair-aware assignment. All that to say, I'm not sure how to do this; I'd love to be proven wrong though.