Cranelift: reuse compare results (flags) when possible

cfallin commented 2 years ago

Currently, in both the x64 and aarch64 backends, we generate flags "locally" whenever needed. So, for example, the sequence

    v0 = icmp ...
    v1 = select v0, ...
    brnz v0, blockN
    jmp blockM

would generate the compare for v0 twice.

This is generally a simplification over the old iflags approach, which we prefer for some of the reasons described in #3249. In particular, iflags-typed values are "weird" (cannot be stored or loaded, only one can be live at a time, cannot be directly observed) and these restrictions complicate other analyses/transforms. The tradeoff of effectively regenerating them on each use has been reasonable so far.

However, locally in cases where we do use the results of a compare more than once, we should be able to share a single compare operation. We might be able to reason about this by building a forward pass that tracks the last generated flags and using this information from within the backend's pattern-matching. For extra credit, we might be able to factor this information into the "unique use" framework, allowing a compare with multiple uses but only one codegen'd occurrence to merge load operations directly on x64.

sparker-arm commented 2 years ago

Are there any plans to enable machine-level optimizations, or scheduling? I feel like there's a high possibility of similar sub-optimal codegen cases where machine-level CSE would be useful for a post-isel cleanup.

Of course, at least for AArch64, we'd have to change the way we define the flag setting instructions to actually define something. I imagine this will also be necessary if we want to verify flag setting isel patterns too.

cfallin commented 2 years ago

I definitely think there's a place for post-lowering opts of some sort! I've noticed cases where reuse would be possible; for example, address mode lowering will sometimes give up pattern matching and emit a new add, and those adds could be GVN'd together. This issue's topic (reuse flags) is another significant one.

I'm not sure what the framework would look like yet, though. There is likely a need for both architecture-specific opts and generic ones. Actually I could see most being generic if we're careful: one could define trait methods on MachInst that mean something like "is a pure binary op", "produces flags", "consumes flags", etc, and then the logic to (i) GVN redundant ops at the MachInst level and (ii) remove redundant identical flags producers could be shared.

As an interesting historical footnote, IIRC a big part of the motivation for the single-level IR in the "old-style backend" design was exactly this, that one could optimize the common bits of heap-address computation, etc after legalizing (lowering). There are a bunch of other good reasons for the two-level IR now, and it gives us much better codegen in other ways (any N-to-1 matching case, e.g. address modes), but this is a specific benefit we lost and is something we could improve.

sparker-arm commented 2 years ago

Actually I could see most being generic if we're careful: one could define trait methods on MachInst that mean something like "is a pure binary op", "produces flags", "consumes flags"

This is basically how it is done in LLVM. It doesn't matter what the opcode is, as long as it's not marked as doing something other than producing a value, 'pure' in your terms I guess, then a generic pass can do a lot of the work. As long as we still have SSA, we should be able to do quite a bit - but only if the instructions encode enough information.

Flags tend to be a bit more of a pain though, I've seen it get ugly, quickly, trying to optimise these during isel.

bytecodealliance / wasmtime

Cranelift: reuse compare results (flags) when possible #4124