bytecodealliance / regalloc2

A new register allocator
Apache License 2.0
217 stars 38 forks source link

Disallow blockparams on branches with multiple successors #170

Open Amanieu opened 11 months ago

Amanieu commented 11 months ago

Since the input CFG is required to not have critical edges, such blockparams are useless. The incoming blockparams can simply be replaced with the vreg from the (unique) predecessor.

It's better to let the client's lowering code handle this so regalloc2 doesn't need to.

elliottt commented 11 months ago

I think that this is a reasonable change to make, but cranelift is still generating code that would violate this new invariant. I believe that it would be pretty straightforward to fix that, but we probably won't be able to get to that until next year. Here's the issue that I added describing where the fix would need to be made: https://github.com/bytecodealliance/wasmtime/issues/7639

cfallin commented 11 months ago

(Popping in to give my thoughts on this in a timely manner, though I'm on parental leave at the moment)

IMHO, there is some philosophical question here of where to place complexity: in the embedder (user of regalloc2) or in regalloc2 itself. Right now, regalloc2 accepts a well-defined IR: SSA, with blockparams, used in any way that SSA allows. Introducing further restrictions on that could make sense if we were developing the allocator from scratch, or if it otherwise were a substantial improvement in complexity or speed, but re: complexity we see a "+90 -92" diff, and re: speed there is no improvement cited here. (Perhaps some improvement does occur and I'd be curious if you've measured any!) Finally, existing consumers (notably Cranelift) generate code without keeping this extra restriction in mind, and it creates additional work (and potential for bugs) to adapt to the new requirement.

Such blockparams are indeed "useless" if one simplifies completely, but generating the code now requires additional care, and there is value in having components or stages of a compiler that accept a more general input without required canonicalization, especially when offering a library.

So: since the allocator already exists and supports this more general input, why remove that capability? (More succinctly: is there a motivation for this PR, beyond the "useless" value judgment above?)

elliottt commented 11 months ago

My argument in favor would be that any simplifications we can make to regalloc2 lessen the maintenance burden, and adapting cranelift as it stands is not a huge effort. Certainly we can leave things as they are, but given how few users of regalloc2 there are, I think that simplification is a good goal.

cfallin commented 11 months ago

Right, agreed with that; I guess my question is really whether this is a simplification, in a global sense; it doesn't seem to have removed any logic in RA2 here, and I don't think it simplifies the allocation problem in any appreciable way. (Please do correct me if I'm missing some aspect here though!) Additional constraints/requirements to keep in mind on the user side are also complexity, so from my perspective at the moment, this change pushes us into higher complexity overall.

(Disappearing back into my cave now, but happy to discuss more in February when I'm back; just wanted to inject this perspective in case this were to move forward otherwise)

cfallin commented 11 months ago

To amend the above a bit: please don't let me block this if it is indeed a simplification to the allocation problem, for reasons I'm missing; I just wasn't sure what the motivation was, and wanted to ask. (The only changes to ion/ are in liveranges.rs where we change iteration over blockparams as appropriate; nothing in the solver core changes.) If there were additional simplifications that could follow on, or new algorithms that this enables, or performance optimizations we can now make, or something like that, that would be very good additional context to have!

cfallin commented 11 months ago

Chatted a bit offline with @elliottt about this and we surmised that there may be indirect benefits from this: the changes to CL necessary to fit these constraints plausibly could improve allocation time and/or runtime. In any case, @Amanieu I'm still curious to hear more of your motivation or backstory for this -- were you working toward other simplifications or seeing some direct benefits or ...?

Amanieu commented 11 months ago

This isn't intended to improve performance, it is mainly about simplifying the logic a bit. The motivating code for this is actually on a branch that I'm working on, specifically this commit.

bjorn3 commented 11 months ago

My branch for adding unwinding support to cranelift makes use of blockparams for terminators with multiple successors. The invoke terminator has multiple successors, each of which has block params. For the regular return case it uses the blockparams to pass the call return value and for the unwinding case to pass the exception data.

elliottt commented 11 months ago

My branch for adding unwinding support to cranelift makes use of blockparams for terminators with multiple successors. The invoke terminator has multiple successors, each of which has block params. For the regular return case it uses the blockparams to pass the call return value and for the unwinding case to pass the exception data.

Does your branch require blockparams on branches after lowering?

bjorn3 commented 11 months ago

Currently it doesn't do that, but I think it will be necessary for regalloc to not insert moves between the call and the end of the block. The unwind path would skip those moves. I haven't seen any miscompilation from this yet though.

Amanieu commented 11 months ago

At the regalloc2 level you should be able to model the invoke as a block terminator with 2 successor blocks. AFAIK branch instructions are allowed to produce output values as long as blockparams are not used.

Amanieu commented 11 months ago

I added a check in the fuzzer (and fixed the checker to handle it) that branch instructions that produce outputs work properly. This would allow you to model invoke as a branch instruction that produces outputs in registers. This allows subsequent blocks (normal return & unwind) to use these values.