bytecodealliance / wasmtime

A fast and secure runtime for WebAssembly
https://wasmtime.dev/
Apache License 2.0
15.17k stars 1.27k forks source link

riscv64: Implementing Fixed Width SIMD with the V Vector extension #6118

Closed afonso360 closed 1 year ago

afonso360 commented 1 year ago

:wave: Hey,

I've been thinking about how to implement SIMD operations on the RISC-V backend. These are sort of my notes / thoughts about this. Hopefully I haven't missed something too big that invalidates all of this.

Feedback would be appreciated!

RISC-V Vector Intro

This is a small introduction to the Vector Extensions in case people aren't familiar with it.

I like this blog post that explains how this all works in a much better way.

1. Planned implementation

With some careful orchestration we can operate the Vector hardware for fixed width SIMD operations.

The general idea is that we can emulate a iadd.i32x4 by emitting the following code:

vsetivli zero, 4, e32, m1, ta, ma
vadd.vv v0, v1, v2

Here's an explanation of that:

vsetivli  ;; This instruction configures the vector hardware
         zero, ;; Ignore the amount of processed elements. (We know that we can process them all in one go)
               4, ;; Process at most 4 elements
                  e32, ;; Each element is 32 bits
                       m1, ;; Do not group registers
                           ta, ;; Tail-Agnostic Mode (The rest of the register past 128bits is left undefined)
                               ma ;; Mask-Agnostic Mode (Similar to ta but for the mask register)

vadd.vv ;; Vector Add (Vector-Vector)
        v0, ;; Store the results in v0 (It is a usable register for vectors unlike x0)
            v1, ;; LHS is the v1 register
                v2 ;; RHS is the v2 register

vsetivli has an output register that saves the amount of elements it can do, but since we know that all processors support a minimum of 128bits per vector register we have a guarantee that all elements will be processed by a single instruction and don't need to check the output register. So we set it to the zero register to ignore it. (There are some asterisks here, see Regalloc fun for small vectors implementations for more details on that!)

We also only need to reconfigure the vector hardware when we change element-width or element-count. So this CLIF code:

v0 = fadd.f32x4 v1, v2
v3 = iadd.i32x4 v4, v5
v6 = iadd.i64x2 v7, v8

Could be lowered to:

vsetivli zero, 4, e32, m1, ta, ma ;; 4 elements of 32bit size
vfadd.vv v0, v1, v2
;; Here we don't actually need to change the hardware despite it being a different CLIF type!
vadd.vv v3, v4, v5
vsetivli zero, 2, e64, m1, ta, ma ;; 2 elements of 64bit size
vadd.vv v6, v7, v8

Switching vector modes is not done during instruction selection, but on a VCode pass that runs post Instruction Selection.

Each lowered vector instruction carries the full vector configuration that it needs, and in the VCode Pass we insert vsetvli's as necessary (i.e. between instructions with different vector configurations).

1.1 VCode Instructions

The first step is carrying the full vector configuration in each VCode instruction. Here's how I expect these instructions to look like

vfadd.vv v0, v1, v2 #avl=4 #vtype=(e32, m1, ta, ma)
 vadd.vv v3, v4, v5 #avl=4 #vtype=(e32, m1, ta, ma)
 vadd.vv v6, v7, v8 #avl=2 #vtype=(e64, m1, ta, ma)

I've lifted these names out of the ISA spec.

avl (Application Vector Length) is the maximum number of elements that we want to process. For this SIMD proposal it's always an immediate with the number of lanes. However for the purposes of VCode it can also be a register, this is required for interoperability with a future dynamic vector implementation.

vtype is the rest of the configuration data for the vector hardware.

There is additional state that I'm ignoring here:

Not sure if we need these, but we can handle them in the same manner as vtype, and insert their respective mode switching instructions in the same pass.

Additionally each instruction has an optional mask register. When unmasked this does not show up in the assembly, this is handled as a normal register input to each instruction.

1.2 The VCode Pass

After Instruction Selection (but before Register Allocation!) we need to run a custom VCode pass.

This pass walks the VCode forwards and keeps track of the "current" vector configuration. Whenever a instruction requests a different one we emit a vsetvli.

The reason for this being done Pre-Regalloc is that for the actual dynamic vectors. avl is probably not an immediate, but a register with the number of elements that we want to process. So we also need to have that interaction with regalloc. I don't expect to have to do that for SIMD yet, but this pass should probably work for both the current SIMD implementation and a future Dynamic Vectors implementation.

The current calling convention clobbers the vector configuration on all calls. So we also need to keep track of that and query the ABI layer.

A neat idea to further optimize this is by inheriting the vector configuration if all the dominator blocks to the current block end in the same vector configuration. This avoids us having to emit a vsetvli on each basic block if the configuration never changes.

A downside of this pass is that we need to run it to get correct codegen. Even if we never emit a vector instruction. I don't know the performance implications of this, but it's something to keep in mind.

This approach is quite similar to what LLVM does, see Page 41 of this presentation for more details on that.

Some other ideas in Alternative ways of emitting vsetvli

2. Additional considerations

2.1. Spills and Reloads

We can't do a dynamic sized spill/reload, which is going to be an issue for implementing full dynamic vectors. (See also the discussion here: https://github.com/bytecodealliance/rfcs/pull/19#issuecomment-998999682)

But since that isn't implemented yet, and we don't use vector registers for anything else maybe we can do a fixed size 128b store/load for now?

This is definitely incompatible with a full Dynamic Vector implementation. But for that to work we need to save the full registers and with that the scheme above should still work.

2.2. Calling Convention for Vectors

All registers are Caller-Saved, vl and vtype are also Caller-Saved.

Standard States:

Vector registers are not used for passing arguments or return values; we intend to define a new calling convention variant to allow that as a future software optimization.

Clang does a dynamic stack store and seems to pass everything via stack.. This is the same problem as 2.1. Spills and Reloads

2.3. Regalloc fun for small vectors implementations

(See §18.1. Zvl*: Minimum Vector Length Standard Extensions of the V Extension Spec)

The minimum vector register width is 32bits. This means that in the worse case we need to group up 4 registers to process a single 128b operation. (This is something you can do with RISC-V Vector hardware, but hopefully we won't have to)

This affects regalloc since if we compile with a minimum vector register width of 32bits, we effectively only have 8 registers to work with.

This is a problem because we have to share our regalloc space with the float registers since we don't have space for an additional register class (see: regalloc2/#47). This means that we need to have the same number of float registers as vector registers. (At least I'm not seeing any clever regalloc tricks that we can pull off here)

My solution for this is to ignore Zvl32b and Zvl64b for now.

Additionally §18.3. V: Vector Extension for Application Processors states:

The V vector extension requires Zvl128b.

So it seems like a reasonable expectation that Linux running RISC-V CPU's will have Zvl128b, and if this turns out to be untrue we can change regalloc to deal with it.

3. Alternative ways of emitting vsetvli

This applies to both the SIMD implementation and future Dynamic Vector implementations so we need to keep that in mind.

3.1 Keeping state during instruction selection

It would be neat if we could query the last set element length during instruction selection, that way we could minimize the amount of emitted vsetvli instructions

Instruction selection is done in backwards order, so this seems unfeasable. (If anyone has any ideas here let me know!)

3.2 Post-Lowering VCode Delete pass

I'm including this for completeness because it doesen't seem like a great option

To avoid making the VCode pass mandatory we could emit a vsetvli + inst pair for every instruction that we lower. That way the initial lowering would be "correct", just not great.

After that we can have a optional forward VCode Pass that removes redundant vsetvli instructions.

The advantage here is that this pass is optional! If it is slower to run this than emitting (almost) double the instructions we can avoid it when lowering unoptimized code.

I don't like this option very much.

3.3 Using the Register allocator somehow?

While writing this up I kept thinking, all of this seems awfully similar to a small register allocator. Where vtype is an implicit register to each instruction. Can we make this somehow interact with regalloc?

We're out of regclasses anyway so I didn't consider this option for too long.

alexcrichton commented 1 year ago

I've got no prior knowledge of RISC-V, and I'm by no means a Cranelift expert, but this all sounds quite good! Two random thoughts I had reading over this:

ta, ;; Tail-Agnostic Mode (The rest of the register past 128bits is left undefined)

This reminds me of SSE-vs-AVX on the x64 architecture. SSE instructions only modify the lower 128-bits of each register, not modifying the upper bits of the register. AVX instructions, however, explicitly all zero the upper portions of each register. I believe one of the reasons for doing this was to prevent false dependencies between instructions and registers. For example if an SSE instruction modifies a register and then that's fed into an AVX instruction the AVX instruction might mistakenly think that whatever instruction prior to the SSE instruction which defined the register is additionally a dependency and must stall the processor pipeline until its done. This won't be necessary in the case that the AVX instruction only operates on the lower 128-bits though.

Now all of this is x64-specific and also I think it has to do with really specific details of how the processor is implemented (e.g. it assumes the processor has its own means of tracking dependencies between instructions and such). I bring all this up because it sounded pretty similar to this Tail-Agnostic mode? Is this something that the RISC-V spec recommends and/or are we following suit with LLVM here? I would otherwise naively say that the lesson learned from x64 was to explicitly zero upper bits of registers.

Although that being said I'm not sure how this fits in RISC-V since if you explicitly configure the hardware with what size of register you're operating over I'm not sure where the tail end shows up since we'll configure 128-bit registers and all operations will work over 128-bit registers.

After Instruction Selection (but before Register Allocation!) we need to run a custom VCode pass.

I actually naively thought that after emission we had lost all block-related structure and loops and such. This control flow is naturally quite important for a pass such as this, however. You also mention though that we can see if configuration can be inherited from dominating blocks so it sounds like this sort of control flow isn't lost? Mostly just wanted to confirm that we've still got all known control flow at this point.

Also is block domination the metric we'd want to use here? If block A dominates block B then that doesn't guarantee that A's vector configuration will make its way to B because all we can say is that A definitely executed before B but we can't say that nothing else executed between A and B?

alexcrichton commented 1 year ago

Oh also something that wasn't clear to me:

A downside of this pass is that we need to run it to get correct codegen. Even if we never emit a vector instruction.

Why is this? I'm not able to piece together why we'd run this pass even if we didn't emit vector instructions.

afonso360 commented 1 year ago

bring all this up because it sounded pretty similar to this Tail-Agnostic mode? Is this something that the RISC-V spec recommends and/or are we following suit with LLVM here? I would otherwise naively say that the lesson learned from x64 was to explicitly zero upper bits of registers.

The other mode we have is tu Tail-Undisturbed which preserves the previous value of the register. My understanding is that tail agnostic is an optimization since it allows processors to ignore the rest of the register which has some performance benefits.

Here's the relevant part in the spec

The agnostic policy was added to accommodate machines with vector register renaming. With an undisturbed policy, all elements would have to be read from the old physical destination vector register to be copied into the new physical destination vector register. This causes an inefficiency when these inactive or tail values are not required for subsequent calculations.


Although that being said I'm not sure how this fits in RISC-V since if you explicitly configure the hardware with what size of register you're operating over I'm not sure where the tail end shows up since we'll configure 128-bit registers and all operations will work over 128-bit registers.

The tail only shows up if we are mixing different vector sizes, which we don't with the current SIMD implementation.


I actually naively thought that after emission we had lost all block-related structure and loops and such. This control flow is naturally quite important for a pass such as this, however. You also mention though that we can see if configuration can be inherited from dominating blocks so it sounds like this sort of control flow isn't lost? Mostly just wanted to confirm that we've still got all known control flow at this point.

After emission I think we no longer have that. But I was planning on inserting this after the initial VCode creation. So we still have the full VCode layout which we modify and feed something different to regalloc.

As far as I understand we still have the full block information there. (And I really hope I'm not wrong on this since it would put quite a dent on this plan)


Also is block domination the metric we'd want to use here? If block A dominates block B then that doesn't guarantee that A's vector configuration will make its way to B because all we can say is that A definitely executed before B but we can't say that nothing else executed between A and B?

Oh that's right! That's not really what we want, I guess we want to check Immediate Dominators of this block. (i.e. no intermediate blocks)


A downside of this pass is that we need to run it to get correct codegen. Even if we never emit a vector instruction. Why is this? I'm not able to piece together why we'd run this pass even if we didn't emit vector instructions.

I was going to say that we don't know if that happened. But we can always keep track of that? Do we have a way of storing that information in the vcode?

Otherwise I agree, if we emit no vector instructions we shouldn't need that pass.

cfallin commented 1 year ago

Oh that's right! That's not really what we want, I guess we want to check Immediate Dominators of this block. (i.e. no intermediate blocks)

A quick note on this before diving into the full writeup (thanks for thinking through all of this, by the way!): idom isn't quite right either, I think. The idom of a block isn't necessarily an immediate predecessor, just the closest block that does dominate (does not dom any other block that doms the block in question). For example, in an if-else diamond, the tail block's idom is the head block (before the if-else), but either of the if-branch or else-branch could have modified state.

So I think what's needed is actually a forward dataflow analysis. We could avoid doing the fixpoint by always assuming state is unknown entering a loop (any block with a pred that is not earlier in RPO) and then do a forward scan, accumulating and meeting (on the semilattice) state at the end of forward edges as we see them.

I'd actually be curious how much difference there would be with a much simpler approach: assume state is unknown entering every basic block, and configure vector state as needed. This could even be tracked in the EmitState, so would not need a separate pass before VCode::emit (it would just need a "reset at top of basic block" hook on the EmitState). Then any vector inst could potentially emit the "set up state for what I expect" before the actual inst, and update EmitState appropriately. (Detail to take care of: max inst size needs to account for that.)

alexcrichton commented 1 year ago

The other mode we have is tu Tail-Undisturbed which preserves the previous value of the register

Ah nice! Ok so sounds like what you're recommending matches with the "lesson learned" from x64, so sounds good!


The tail only shows up if we are mixing different vector sizes, which we don't with the current SIMD implementation.

Ok now you've got me curious though. I realize it's probably not relevant for wasm/Cranelift since it only supports 128-bit vectors for now, but in the abstract, I'm still not actually sure how this would show up? It sounded like each instruction would run in the configuration previously configured, so in that sense isn't the width of the registers always fully defined and instructions all operate on full registers widths? Or are there some instructions which operate on, for example, half of whatever the current register width is?

Given my curiosity here I should probably also go read over the blog you linked which likely has an example of when ta-vs-tu comes up.


I was going to say that we don't know if that happened. But we can always keep track of that? Do we have a way of storing that information in the vcode?

Ah ok makes sense! I do agree that it might be a bit tricky to track whether vector instructions were emitted, and it wouldn't be the first thing that I would reach for. My gut instinct though would be to implement the pass you've outlined (pending an alternative strategy) and only have this sort of optimization to skip the pass entirely if it's measured to be too expensive (for some threshold of "too expensive")

jameysharp commented 1 year ago

Thank you for the thorough explanation!

Regarding adding more register classes, it looks like Cranelift quit using regalloc2::OperandKind::Mod as of #5073 last October. So hopefully we're on track for fixing bytecodealliance/regalloc2#47 "soon". But it sounds like you can safely get this working with the assumption that vector registers are at least 128 bits wide, and extend to grouped registers later if necessary.

I actually like the "Post-Lowering VCode Delete pass" option better than a pass that inserts instructions. Vec::retain is O(n) in the number of instructions, works in-place, and is in std. By contrast, if you want to insert a bunch of randomly-located elements in linear time, the easiest way is to allocate a new buffer and copy into it. If you know how many instructions you need to insert then you can do it in linear time by working backward, but that isn't conveniently provided by std.

We wouldn't want to track in the vcode whether any vector instructions were used, but I think lowering should be able to carry a flag through the lowering context. So if you do need to do a pass over the vcode I don't think it's hard to avoid paying that cost when you don't need it.

That said, does the register allocator need to see the vsetvli instructions? If you keep some sort of side table about which points require reconfiguring the vector mode, you could look up during instruction emission whether you need to emit a vsetvli first, and avoid actually editing the vcode. There's a similar idea for emitting moves that register allocation inserted, without editing the vcode.

I was thinking that emission happened in backward order, but only lowering does that I guess. So yeah, I like Chris' suggestion about tracking state during emission within only the local basic block. A forward dataflow pass may get better results but it's worth seeing how far we can get with the simpler option first.

cfallin commented 1 year ago

Indeed, with the removal of Mod I reassigned the associated bit in Operand to the class field, so we can have up to four classes now.

Re: extra pass -- to be clear, I think we can get away without any extra pass, as opposed to the two options above of "deleting pass" and "inserting pass". It would work the same way that our pseudoinsts do today: in e.g. the emit impl for a vector ALU instruction, we do something like

if let Some(vsetvli) = emit_state.setup_for_state(lane_config) {
  vsetvli.emit(buf, ...);
}
emit(actual_opcodes);

that avoids the overhead of the pass, and of materializing the insts into a buffer.

afonso360 commented 1 year ago

A quick note on this before diving into the full writeup (thanks for thinking through all of this, by the way!): idom isn't quite right either, I think. The idom of a block isn't necessarily an immediate predecessor, just the closest block that does dominate (does not dom any other block that doms the block in question).

Ha! I need to go read that stuff again more carefully!

I'd actually be curious how much difference there would be with a much simpler approach: assume state is unknown entering every basic block, and configure vector state as needed. This could even be tracked in the EmitState, so would not need a separate pass before VCode::emit (it would just need a "reset at top of basic block" hook on the EmitState). Then any vector inst could potentially emit the "set up state for what I expect" before the actual inst, and update EmitState appropriately. (Detail to take care of: max inst size needs to account for that.)

Not needing a separate pass would be really nice. And I think it might work!

I think this prevents analysis of the previous vector state between blocks right? But even if it does, we can always move to a separate pass later.

I really like this!


Or are there some instructions which operate on, for example, half of whatever the current register width is?

Yeah, the simplest one I can think of is vmv.s.x which moves a scalar register into element 0 of a vector. And leaves the rest of the vector up to the tail policy.

Also Vector Reduction Operations do a reduction of the vector into the first element of the vector and consider the rest of the vector the "tail".

So a vredsum.vs (Vector Reduce Sum) instruction with tail undisturbed would place the sum in the first element of the vector, but leave the other elements unchanged.

I'm not sure which instruction we have for for example f64 -> f32, but I would assume those would also leave half the register up to the tail policy

Edit: Actually Vector Narrowing Operations work the other way. One of the registers is accessed at twice the current element width. So we would configure it for 4x32bits elements, and it accesses 4x64 bits in one of the registers.


But it sounds like you can safely get this working with the assumption that vector registers are at least 128 bits wide, and extend to grouped registers later if necessary.

I think we should be ok for now. I just didn't want to outright exclude any implementation of vectors that isn't >=128b.

I was thinking that emission happened in backward order, but only lowering does that I guess. So yeah, I like Chris' suggestion about tracking state during emission within only the local basic block. A forward dataflow pass may get better results but it's worth seeing how far we can get with the simpler option first.

Yeah, I agree with that! It seems like a good idea for a first implementation.

cfallin commented 1 year ago

I think this prevents analysis of the previous vector state between blocks right? But even if it does, we can always move to a separate pass later.

Right, it needs to be basic-block-local for now. We could definitely do better in several ways if we need to later... e.g. I can see a way to build the single-pass-only variant of the analysis into the emit pass by giving EmitState methods to "observe EmitState from source at destination of forward edge" (merge vector states, go to "unknown" if conflict) and "observe destination of backward edge" (clear known vector state).

jameysharp commented 1 year ago

If it turns out to matter later, I think we could build a summary of inter-block vector state during lowering. We'll have seen all blocks during that phase and we don't need the information until the emission phase. If there's a call or vector instruction in a block, we know that block's final state with certainty. For the remaining blocks, we can either do a forward dataflow analysis to fill in the state, or just treat them as having unknown state. Then determining each block's entry state is a linear pass over all edges.

afonso360 commented 1 year ago

Closing this since the initial part has landed in #6240.