Proof-of-concept draft: Add a simple vector extension to femtorv

Here is a draft of my ideas that I came up with yesterday.

Caveat emptor

Take it for what it is: A sketchbook proof of concept experiment, totally untested. I did not even try to build it, and I am pretty sure that some state transitions will not work properly for certain vector operations.

Functionality

As described in the code comment, this change tries to:

Map vector registers on top of the scalar register file.
Adds the VSETVL instruction (from the V extension) to manipulate the vector length (VL) register.
Adds logic for iterating over the vector register elements while staying in the EXECUTE or EXECUTE+WAIT_ALU_OR_MEM states, until VL vector elements have been processed.

Instruction encoding

To encode vector instructions, the two least significant bits of the instruction word are used (in RV32I these bits are always 11, so anything else indicates a vector operation). This is not compatible with the C extension, for instance, so some other encoding trick must be used if you want to support that (I am not very versed in RISC-V instruction encoding, but the CUSTOM_0 - CUSTOM_3 pages could be a possibility).

Bugs / refactoring

I think that the source register lookup and destination register index (rdId) is broken for multi-cycle instructions (load/store/div). Specifically vecIdx is not always updated in the right state/cycle.

Furthermore the source register lookup is currently done in two different places (really it needs to be done in three different places IIUIC). It feels like this part can be refactored to solve both the out-of-sync vecIdx problem and possibly reduce LUT usage.

Possible improvements

More vector registers

The current implementation only provides eight vector registers, of which 3-5 are usable in practice (V0 can never be used, and some scalar registers must be spared for scalar operations). It would be very simple, and valuable, to add more vector registers. All that is required is to double (or quadruple?) the number of scalar registers in registerFile. It is mostly a matter of balancing the size of the core (e.g. the number of LUT:s).

Stride based load/store

Another functionality that I have not added, but that is quite powerful, is support for on-the-fly generation of address strides. I think that a feasible solution would be to add special handling of the case when src2IsVec = 1 and src2 is an immediate value (e.g. isLoad | isStore | isALUimm), such that the immediate value is replaced by an incrementing (registered) value as follows [0, IMM, 2*IMM, 3*IMM, ...].

Writing programs

This is of course a major problem at the moment. No compiler / toolchain supports the new vector instructions (except for VSETVL).

For prototyping purposes I would personally only write vectorized code directly in assembler language (that also gives better control over scalar register allocation), by first compiling the corresponding scalar code, and then hand-modifying the generated machine code to use the encoding for vector instructions (i.e. modify the 2 LSB:s), and emitting them as .word directives.

For instance, the following C code:

void foo(int* dst, const int* src, int num) {
    for (int i = 0; i < num; ++i) {
        dst[i] = src[i];
    }
}

...could be implemented in RISC-V assembler with vector instructions (assuming that stride based load/store is supported):

foo:
    blez    a2, .L2
.L1:
    vsetvl  a4, a2, zero
    .word   0x0045a381  # lw    v7, +4(a1)
    .word   0x00752221  # sw    v7, +4(a0)
    sub a2, a2, a4
    addi    a0, a0, 16
    addi    a1, a1, 16
    bnez    a2, .L1
.L2:
    ret

As it's far from convenient, in the long run you probably want to patch some toolchain (e.g. binutils/as) to support these instructions to some degree.

BrunoLevy / learn-fpga