Ravenslofty / blog

The Critical Path - a rambly FPGA blog
49 stars 2 forks source link

10/01/2023: vliw software pipelining #10

Open Ravenslofty opened 1 year ago

Ravenslofty commented 1 year ago

I'll be posting a few more articles from cohost, on the basis that one copy is none copy if nothing else.

the fediverse wanted me to post about this, but fedi itself is not so great for long-form content, so, cohost it is.

so, the itanium is generally considered to be quite crazy, but it comes with a surprising number of tricks up its sleeve. one of them is software pipelining, although the documentation also calls it modulo-scheduled loops because they like technical terms.

are you sitting comfortably? this'll be a relatively long explanation.


let's take a relatively simple loop, with four steps that must be executed in sequence. I'm going to omit the "check loop counter and branch" section for simplicity.

a diagram showing four boxes arranged vertically labeled "A", "B", "C", "D" connected with arrows in that order, with big arrows connecting D to A.

imagine that A loads from memory, B and C are logic operations, and then D writes to memory.

if we assume each step takes 1 cycle, then this loop takes 4 cycles to process 1 element. as a VLIW which would like to issue more than one instruction per cycle, this code makes us sad (Itanium can issue up to six instructions per cycle).

a better approach would be to execute N loop iterations simultaneously:

the above diagram, but with six parallel lines of A, B, C and D

this runs into a problem: it supposes we can issue all these instructions at once. if A and D are memory operations, then we need 6 load/store units in hardware; if B and C are logic operations, then we need 6 ALUs as well. this makes the CPU need a lot of silicon (although with SIMD operations in modern processors they did end up paying the silicon cost). the itanium was already a relatively huge chip, so Intel did not pay the silicon cost there.

what if we overlap loop iterations instead of executing them in parallel?

a loop of lines of boxes "A", "B", "C", and "D" arranged horizontally; there are arrows connecting the A of the first iteration to the B of the next iteration, for B to C and for C to D. above the loop is a prologue, where the B/C/D boxes of the first cycle are gone, then the C/D boxes are gone in the second cycle, and so on. below the loop is an epilogue, where the A box of the first cycle is gone, then the A/B boxes are gone in the second cycle, and so on.

I have unrolled the loop a little to make the dependencies clearer, because the values being carried across loop iterations is hard to draw. assuming that A and D are load/store operations and B and C are ALU operations, this code needs only two load/store units and two ALUs, while achieving 4 instructions per cycle at peak. however, a prologue and epilogue are needed to populate and depopulate the registers around the loop.

prologues and epilogues do take up space, however, and Itanium doesn't actually need them; it supports conditional execution, where the result of an operation doesn't take effect if a predicate register is zero.

imagine that A, B, C and D were predicated on their own independent register pA/pB/pC/pD; the prologue consists of setting pA on the first cycle, then pB on the second, then pC on the third. then pD is set on the fourth cycle, and we can loop until "almost done", clear pA on the first cycle, pB on the second, pC on the third. this is what the br.ctop instruction does.

so, on Itanium, the loop looks like this:

a line of boxes "A", "B", "C", and "D" arranged horizontally; there are arrows connecting the bottom to the top to show it's a loop

pretty neat, huh?