Will VexiiRiscv be extended to support configuratble multi-issue ?

SpinalHDL / VexiiRiscv

Like VexRiscv, but, Harder, Better, Faster, Stronger

MIT License

63 stars 7 forks source link

Will VexiiRiscv be extended to support configuratble multi-issue ? #15

Open franktaTian opened 1 month ago

franktaTian commented 1 month ago

Hi， Will VexiiRiscv be extended to support configuratble multi-issue ? For example, 4-issue not just 1 or 2 issue(s).

Dolu1990 commented 1 month ago

Hi,

Technicaly speaking, i think it "already" support it, just that the ParamSimple class (thing which provide a way to configure the CPU in a easy way) doesn't provide support for more than 2. It is the only play where things relating that are hardcoded. Note that i never tested anythign with more than 2 issue.

franktaTian commented 1 month ago

Hi,

Technicaly speaking, i think it "already" support it, just that the ParamSimple class (thing which provide a way to configure the CPU in a easy way) doesn't provide support for more than 2. It is the only play where things relating that are hardcoded. Note that i never tested anythign with more than 2 issue.

Cool!

Jzjerry commented 4 days ago

I tried to get a 4-issue Vexii simply by copying and adding if (lanes>=3) and if (lanes>=4 ) in Param.scala just like: https://github.com/SpinalHDL/VexiiRiscv/blob/05ed94c61b7042b7e5e5f8798a9b9e85f6d4d8c2/src/main/scala/vexiiriscv/Param.scala#L629-L653 and also added num of decoders to 4. There is no problem for the generation and simulation.

I benchmarked 3-issue and 4-issue RV32IMC on dhrystone and coremark:

2-issue: 16149 Dhrystones/Second, 0.76 DMIPS/MHz. 1.53 Coremark/MHz.
3-issue: 16619 Dhrystones/Second, 0.78 DMIPS/MHz. 1.54 Coremark/MHz.
4-issue: 16711 Dhrystones/Second, 0.79 DMIPS/MHz. 1.55 Coremark/MHz.

The performance difference will be higher if you toggle more performance options like late-alu. Anyway, I believe there is no big problem with multi-issue, you can modify Param.scala to get even more lanes XD

Dolu1990 commented 4 days ago

and also added num of decoders to 4. There is no problem for the generation and simulation.

LOL Nice :)

2-issue: 16149 Dhrystones/Second, 0.76 DMIPS/MHz. 1.53 Coremark/MHz.

Hmm that is weird, the performance are well bellow what it should be.

Did you enabled the branch predictors aswell ? Did you had caches ? One thing is that by default, most performance oriented feature are disabled.

The one thing were i can see have so many lane scale, is for AES (for instance) and well optimized code, as GCC will likely generate coupled code which do not take advantages of in order execution over all those lanes

Jzjerry commented 4 days ago

Hmm that is weird, the performance are well bellow what it should be.

Did you enabled the branch predictors aswell ? Did you had caches ? One thing is that by default, most performance oriented feature are disabled.

I didn't enable anything beyond the default LOL. If those performance features are enabled, we can get a larger gap between 2-issue and 4-issue, like 4.16 Coremark/MHz v.s. 4.38 Coremark/MHz (tested with late-alu, lsu-l1, fetch-l1 and predictors).

Dolu1990 commented 4 days ago

There is a few more : withDispatcherBuffer = true // may do a big difference withAlignerBuffer = true // will not make a big difference

lsu-l1, fetch-l1

Did you increase the number of way to at least 4 ?

Jzjerry commented 4 days ago

There is a few more : withDispatcherBuffer = true // may do a big difference withAlignerBuffer = true // will not make a big difference

lsu-l1, fetch-l1

Did you increase the number of way to at least 4 ?

Yeah, they do amplify the advantage of multi-issue! Got a 4.32 Coremark/MHz v.s. 4.85 Coremark/MHz after adding dispatcher buffer, and a 4.51 v.s. 5.04 after enabling all of them and increasing to 4 ways of L1. Looks like the dispatcher buffer matters more🤔.