Support architectural delay slots

xen0n commented 5 years ago

Some architectures make use of branch delay slots, such as MIPS and SPARC; some even more exotic ones have load delay slots as well. Often enough there are also instructions with results only usable after certain number of cycles.

I'm trying to port Cranelift to MIPS64. I glanced over the codebase and found no support for filling the delay slots; I had to emit NOPs after every branch in the meantime. (Indeed I didn't expect any because all currently supported architectures have no delay slots.)

So what's the current plan regarding this? A roadmap or implementation guide would be great!

sunfishcode commented 5 years ago

We don't have a plan yet :-}. Are delay slots something that can be handled by a very late fixup pass, or do they require phases such as register allocation to be aware of them?

xen0n commented 5 years ago

The LLVM MIPS delay slot filler implementation works on MachineInstr level. From quick inspection of the code, it only reorders instructions and does no register (re-)allocation, as it should. The GCC delay slot filling is similarly late.

sunfishcode commented 5 years ago

Ok, great. So here's a rough roadmap then: first add an Instruction constructor argument and attribute to cranelift-codegen/meta-python/cdsl/instructions.py similar to can_store etc. named has_delay_slot or so, so that your instruction descriptions can set that flag to indicate instructions which have delay slots. Then add a method to the TargetIsa trait in cranelift-codegen/src/isa/mod.rs to indicate whether the target has delay slots -- it can default to false, and MIPS can override it to set it to true. The next step is to write a delay-slot filler pass, which can be gated on the TargetIsa flag we just added.

xen0n commented 5 years ago

Okay, I'll stub this out first then. 👍

xen0n commented 5 years ago

On a closer look it seems the emitter doesn't support re-ordering instructions. A simple way of implementing the delay slot filling pass would be doing the re-ordering at IR-level, but wouldn't it make the IR target-dependent (albeit very late to the point that it wouldn't matter anymore)?

bjorn3 commented 5 years ago

but wouldn't it make the IR target-dependent

After legalization the IR is already target-dependent. https://github.com/CraneStation/cranelift/blob/c47ca7bafc8fc48358f1baa72360e61fc1f7a0f2/cranelift-codegen/meta-python/isa/x86/legalize.py has the legalizations for x86. It uses x86 specific instructions from https://github.com/CraneStation/cranelift/blob/c47ca7bafc8fc48358f1baa72360e61fc1f7a0f2/cranelift-codegen/meta-python/isa/x86/instructions.py.

bnjbvr commented 5 years ago

Yes indeed. One way to support this could be to have a legalization MIPS-specific that transforms e.g. jump instructions into nop + jump (or a new delayed_jump instruction). Using a nop for the delayed slot might not be the optimal thing, but I seem to recall that it is what gcc does?

xen0n commented 5 years ago

@bnjbvr That's what Go does (facepalm). GCC and LLVM both have pretty decent delay slot fillers judging from final code appearance...

EDIT: I meant the final effect on generated code... it seems you're referring to the legalization pass instead. I don't exactly remember what GCC does wrt legalization... Personally I prefer the LLVM approach where delay slot instructions are inserted/adjusted really late.

bnjbvr commented 5 years ago

I was actually referring to the final effect on generated code too, but maybe I looked at an older version of GCC or I was looking at code compiled in debug mode. Anyways, it was just a random idea, I would personally prefer of course an approach where we don't have to insert NOPs.

xen0n commented 5 years ago

Ahh I see what you mean. The fact is the current delay-slot-unaware Cranelift would generate incorrect code when ported to MIPS, so at least we must support unconditionally inserting NOPs for the delay slots which BTW is easy (just emit a NOP together with any branch). But I'm obviously aiming for a more-or-less "proper" pass that reorders instructions as much as possible; that's what all state-of-art code generators do and what I'm asking the roadmap for. :grinning:

sunfishcode commented 5 years ago

Yeah, at the very end of codegen, the IR is very much target-dependent, but it still does uphold some basic target-independent invariants. Currently that includes instructions executing sequentially. But I think we can relax that to allow instructions to be moved after branches to fill delay slots. We don't do a lot after regalloc right now, which helps.

The one thing that does come to mind that we do need though is RegDiversions. The way regalloc currently works, while it assigns virtual registers to locations, it also has the ability to locally divert virtual registers into different locations, in order to satisfy local register allocation constraints. This means that after regalloc, anyone who wants to know what registers are used where has to maintain a RegDiversions instance to track the diversions.

Do MIPS delay slots read and write their register operands before or after the branch does? For example, in this:

        beqz    $2, foo
        addiu   $2, $2, 1

Does the beqz see $2 before or after the addiu?

If the beqz sees it first, then in theory, reordering the instructions should be ok with the RegDiversions mechanism, because the register uses and defs happen in the order they appear, even if the actual branching doesn't.

So my tentative answer is: go ahead and move instructions in the IR after the branches to represent their being placed in delay slots. And do this in as late a pass as you can, to avoid affecting other passes :-).

bjorn3 commented 3 years ago

With the new backend framework can this be done for each backend that needs it individually?

cfallin commented 3 years ago

This would probably need to be done at the MachBuffer level in the new backend infrastructure, since it edits branches late, and a disappearing branch could otherwise change semantics unexpectedly. Let's leave this open to track, and we can sketch it out more when needed.

bytecodealliance / wasmtime

Support architectural delay slots #1077