Simplify stack-to-stack moves

jameysharp commented 1 year ago

Most of the changes here are simply refactoring.

Resolving stack-to-stack moves first fills in the scratch register for the parallel moves, if one is needed. So we can split that into two phases, inlining the stack_to_stack function which was not used anywhere else. That lets us establish the invariant that we only continue into the second phase if there is at least one stack-to-stack move.

Because compute consumes self, we know that the lazily-allocated stack-to-stack scratch register has not been allocated yet at entry to the loop, but because the loop only runs if there are stack-to-stack moves, we know the scratch register will definitely be allocated. So we can remove the corresponding state from the MoveVecWithScratch struct and keep it in locals with shorter names instead, eagerly initialized before the start of the loop.

The above changes should have no effect on the output. However I then added state tracking for whether the scratch register or the save slot already contain the right value and therefore don't need to be the target of another move. This should remove the need for the redundant-move eliminator to clean up after parallel moves.

My first approach assumed the correct value was in exactly one of these places, but both allocations can be correct, and we can save more copies by acknowledging that.

cfallin commented 1 year ago

(Ah, and I just saw the review request for Trevor as this tab was stale, sorry -- feel free to take over if you'd prefer otherwise I'm happy to review followup!)

jameysharp commented 1 year ago

Yeah, I thought I'd give you a break, but your review is welcome anyway Chris. :laughing:

I felt like the two bools were the most clear way to express their state: 1) do we need to restore the scratch register after trashing it? and 2) do we need to save the scratch register's contents to the save slot? It might be nice to encode that they must not both be dirty at the same time, which would mean we've lost the value that was supposed to be in the scratch register and can't recover. I'm not completely sold that it's worth defining an enum for that, but I could be convinced if you feel strongly.

I do find it a little unsatisfying that the presence or absence of the save_slot only guards one of the blocks. I've convinced myself (and fuzzing confirms) that the other if statements can only be entered if that first block runs, but that's not obvious at a glance.

So I've just pushed a little rework to make it more clear how save_slot and scratch_dirty are related. Does that help?

Other thoughts: It would be nice to not require the caller to specify a "victim" register, partly for convenience and partly because I don't want to think about what to rename that to since I don't like the current terminology. So when find_free_reg fails it would be nice to pick a register from those already used by the moves.

However we can't require that there are any registers used in moves. It could happen that all the current register contents need to be preserved in-place, so there are no free registers but also no register moves. So the caller needs to provide a last-resort register in any case. Sigh.

There is potentially some room for reducing the number of generated moves though. For one thing, if we sort the moves so that moves into registers usually come after moves to the stack (except when dependencies prevent that), then we improve the odds that our selected scratch register will be overwritten after the last stack-to-stack move and we won't have to restore it from the save-slot.

If the selected register is not used as a source between the first stack-to-stack move and being used as a destination, we don't actually need the save-slot at all, because the needed values are all live in other places when we need them. Finding a register in the set of moves that satisfies this property would be a nice optimization, though I don't know how often it would come up.

A way to think about this is one could make the optimal choice in linear time by simulating the stack-to-stack move-resolution algorithm for each choice of register, counting the number of moves that would be inserted, and picking the register with the minimum count. No, I'm not proposing that we actually do that, but it does mean you can fairly easily evaluate ideas for heuristics by simulating the algorithm in your head.

cfallin commented 1 year ago

Cool, yeah, I don't feel super-strongly about reworking into a formal state machine; this is clear enough, thanks!

Re: finding a scratch register (and I agree, I didn't think hard enough about the current naming, sorry; let's call it "borrowed scratch register" or somesuch maybe?) -- I agree in principle it'd be nice to not require the caller to make the arbitrary choice; allowing the resolver the freedom to use more complex heuristics as you say. It strikes me that at this point we're several niche conditions into a rabbit hole though, and I might suggest seeing how often the case in question occurs before trying to optimize further. At least the thinking when originally building the current solution was: (i) not requiring a dedicated scratch in the MachineEnv was a potentially significant impact, because register pressure; but (ii) everything that happens when we hit the last-resort logic is a rare case that just needs correctness, not performance. Happy to change my mind with data of course!

jameysharp commented 1 year ago

"Borrowed scratch register" turns out to fit nicely, done!

To be clear, I wasn't suggesting we optimize this any further now; I just wanted to note down my thoughts from making sure I understood how this works. Data indicating whether this has any impact would absolutely be good to have; I think the starting point is to look for the frequency of trace log messages saying "scratch resolver: stack-to-stack borrowing" (or prior to this PR, "scratch resolver: stack-to-stack using victim"). I would have guessed that running out of free registers is pretty common though; is it that stack-to-stack moves are rare?

cfallin commented 1 year ago

That's my intuition, yeah, though to be honest it's at best anecdata; I never did a detailed study. I'm satisfied of course that this PR maintains the status quo so there's no immediate need to collect better data but it would be interesting!

bytecodealliance / regalloc2

Simplify stack-to-stack moves #159