Ravenslofty / blog

The Critical Path - a rambly FPGA blog
49 stars 2 forks source link

04/03/2024: a sketch of a silly risc-v cpu #15

Open Ravenslofty opened 8 months ago

Ravenslofty commented 8 months ago

If somebody opens this hoping that there'll be actual HDL here, they're going to be a bit disappointed. I don't really have the time budget required to develop a CPU, even if for a hobby, but I wanted to sketch out an idea that has been floating around in my head for some time, and maybe it could inspire somebody to develop their own CPU based on this.

What surprised me most about RISC-V is not so much the instruction set (which I have...opinions on), but how people have taken that instruction set and - perhaps as a self-imposed challenge - implemented it in a variety of nonconventional ways. For example, implementing it bit-serially, or with microcode.

Over the Christmas holiday I studied the AT&T Hobbit CPU (which...perhaps deserves its own article), but one thing which amazed me is that the CPU supports a limited form of instruction fusion: in the internal (semi-documented) uop format, every instruction is a branch to the next instruction, which means unconditional branches and certain conditional branches can be fused into the pervious instruction. It allows the CPU to easily handle variable-width instructions, too. I think that's pretty neat for a early-90's embedded CPU.

So, roughly thirty years later, let's ask what a CPU designed for instruction-fusion maximalism might look like.

First, let's set an instruction set. This would arguably be applicable to most of the RISC-V subsets, but let's pick RV32I as the minimal proof of concept. Let's also specify that we're looking for sequences that are at most 4 instructions long (so, 16 bytes at a time).

When looking for vaguely interesting instruction sequences, I will shamelessly admit that I put Dhrystone through Clang and examined the output.

When one thinks about RV instruction fusion, they generally think about the pattern of loading a 32-bit offset for something split over a 20-bit high section and a 12-bit low section; for example:

        lui     a0, %hi(.Lstr.46)
        addi    a0, a0, %lo(.Lstr.46)
        j       .LBB0_7

We could fuse these together into a uop that can load a 32-bit immediate - let's call it u.li32 - turning it into something more like this:

        u.li32  a0, .L.str.46
        j       .LBB0_7

But in our uop format, every uop is a jump, so we're jumping to an unconditional jump. Let's say, I dunno, u.li.j means "load immediate and jump" which is implemented by setting the jump address in the uop as the branch target address.

        u.li.j  a0, .L.str.46, .LBB0_7

Perhaps this looks familiar if you squint: the lij instruction is a generalisation of jal, in that a register is set to an immediate value and then an unconditional branch occurs.

So far, this implies that our uop format has a field for internal operation (add, subtract, shift, whatever); a field for first register operand; a 32-bit field for immediates/second register operand; a field for destination address/register; and a field for next PC.

What about branches? Here you can add a second field for "alternate next PC", ordering fields such that the "alternate next PC" is selected if the next branch is not taken.

I admit that a single post to a relatively obscure blog is unlikely to have much effect, but I wanted to share this concept with people all the same: I think it'd be funny to have a CPU that can squeeze three or four instructions down into a single uop.