OpenXiangShan / XiangShan

Open-source high-performance RISC-V processor
https://xiangshan.cc
Other
4.83k stars 661 forks source link

Why does only one execution unit support vppu? & vcompress optimization suggestion #3488

Closed camel-cdr closed 2 months ago

camel-cdr commented 2 months ago

Before start

Describe the question

I noticed that the current XiangShan configuration has only one execution unit that can execute the RVV permutation instructions (supports vppu). It's also on an execution unit that supports vector integer operations (vialuFix).

I think that the permutation instructions will be used more often in integer workloads than on floating point ones. Having only one execution unit support vppu is also not competitive with even existing RISC-V processors, the C910/C920 cores support dual issue of permutation instructions. Since LMUL>1 vrgather requires LMUL^2 uops, this would greatly impact the viability of LMUL=8 and LMUL=4 vrgather, as you go from 64 to 32, and 16 to 8 cycles, while the surrounding regular vector instructions take 4 and 2 cycles.

I think it would be better to move the vppu support from VFEX0 to VFEX3, and if justifiably by implementation budget also add vppu support to VFEX4.

Suggestion:

before:                              after:
VFEX0: vfma,vialuFix,vimac,vppu      VFEX0: vfma,vialuFix,vimac
VFEX1: vfalu,vfcvt,vipu,vsetrvfwvf   VFEX1: vfalu,vfcvt,vipu,vsetrvfwvf
VFEX2: vfma,vialuFix            ---> VFEX2: vfma,vialuFix
VFEX3: vfalu,vfcvt                   VFEX3: vfalu,vfcvt,vppu
VFEX4: vfdiv,vidiv                   VFEX4: vfdiv,vidiv,vppu

I had an idea of how to implement LMUL>1 vcompress linearly instead of quadratically scaling with LMUL, as the current implementation does: https://gist.github.com/camel-cdr/f2cc9cdf6ac9499f069357784f53b324 Does this sound reasonable? I'm not a hardware person, so maybe I've overlooked something.

By just having two permutation capable execution units, you might be able to get LMUL=4 SEW=8 vcompress from the current 18 cycles down to 6 cycles, and LMUL=8 from 117 cycles to 12. The current cycle counts was measured by an unrolled loop in the RTL simulation. Here is how I'd imagine the execution to work:

LMUL=4 vcompress:
          VFEX3 vppu                                VFEX4 vppu                           VFEX1 vipu
cycle=1                                                                                  idx, off = advance_idx_and_off(idx, off, m[0])
cycle=2   vd[0]   = vcompress(vs[0], m[0]);         tmp1      = vcompress(vs[1], m[1])   idx, off = advance_idx_and_off(idx, off, m[1])
cycle=3   vd[idx] = vslideup(vd[idx], tmp1, off);   vd[idx+1] = vslidedown(tmp1, off);
cycle=4   tmp1    = vcompress(vs[2], m[2]);         tmp2      = vcompress(vs[3], m[3]);  idx, off = advance_idx_and_off(idx, off, m[2])
cycle=5   vd[idx] = vslideup(vd[idx], tmp1, off);   vd[idx+1] = vslidedown(tmp1, off);   idx, off = advance_idx_and_off(idx, off, m[3])
cycle=6   vd[idx] = vslideup(vd[idx], tmp2, off);   vd[idx+1] = vslidedown(tmp2, off);

Note: idx, and off require 21 bits, so you could pack them into a single temporary integer register.

So this in total requires two vppu capable execution units, two temporary vector registers, one temporary integer register, and a new uop in the vipu that does off += vcpop(mask); idx += upper_bit_set(off); off = zero_upper_bit(off);.

See the gist for more detail, and a proof of concept.

Ziyue-Zhang commented 2 months ago

Thanks for the new ideas!

In current Kunminghu version, we focus on the correctness of the vector extension implementation, as well as performance improvements in some of the SPEC CPU workloads. Since more complex vppu implementations bring timing problem and the permutation instruction doesn't appear very often in SPEC CPU workloads. Thus, optimizing the vppu module is not in the plan of current version.

In the next version, we focus on the performance of the vector extension and the imporvements in SPEC CPU workloads, vector workloads and ai workloads. Optimizing the vppu module is in our plan. We are going to implement a state machine for the lmul>1's permutation instruction to do the computation, so that we no longer need to split up too many uop's, and at the same time we can reduce the number of cycles required to compute one permutation instruction. This process is similar as the current implementation of vector load/store. This part may start next month or at the end of this year.

We are looking forward to working together to improve the performance of the vector extension in XiangShan!

camel-cdr commented 2 months ago

Great to hear, is the next version a new version of Kunminghu, or the another version (as Kunminghu in to Nanhu)? What's the status on vsetvl* prediction/speculation, I saw the talk at RISC-V Summit China that mentions it, but I wasn't able to pick up on what the status is.

Ziyue-Zhang commented 2 months ago

That's the new version of Kunminghu.

In the current implementation, only vtype speculation in vset instructions is implemented, we maintain a vtype at decoding stage, which is used for all vector instructions, and is rollback if the vtype is updated using a vset instruction on the wrong path caused by branch misprediction. In the new version, we are going to implement vl prediction to optimize performance