bluespec / Toooba

RISC-V Core; superscalar, out-of-order, multi-core capable; based on RISCY-OOO from MIT
Other
161 stars 36 forks source link

scheduling and timing #27

Closed jonwoodruff closed 1 year ago

jonwoodruff commented 2 years ago

These are a raft of improvements that we have made to our CHERI branch which apply to general performance.

  1. Scheduling improvements Toooba (RISCYOO) previously required that many, many rules in the pipeline conflict with any rule that could report a mispeculation. (Specifically, two ALU rules that finished a misspeculated branch, or some commit rules.) Many rules in the pipeline that consumed speculative state would have to dynamically wait for the conditions of those rules to resolve before they could fire. This both caused timing issues, due to the dynamic dependencies on these rules, and also bluespec scheduling issues, as these rules had to fit into a schedule relative to all the others. In this patch, the issue is greatly alleviated from multiple angles. Firstly, the wrongSpec report is deterministically always given priority. (Previously this was left up to the scheduler, and the wrong decision would cost ~10% cycles CoreMark, for example.) This also simplifies the code. Secondly, the wrongSpec report is buffered in GlobalSpecUpdate so that all rules that are dependent on the broadcast depend only on the one broadcast rule. Buffering is also added in SpecFifo modules, though changes are applied before any reads in the next cycle, so there is no logical change. These changes resulted in small amounts of additional improved performance presumably due to decoupling the scheduling of rules that previously were deemed to conflict.

  2. Timing improvements. Several rounds of timing improvements were made in an attempt to reach 50MHz on the Stratix X with two cores. These included buffering and pre-calculation in branch prediction modules, as well as approximating the decode epoch so that prediction could progress more in parallel.

  3. Area improvements. Mainly using NonPipelined versions of the Divider and SquareRooter in the FPU. This can reduce area by about 10%. There is also now compression in the BTB, with n ways of compressed targets (where the upper bits (>16) of the target match the branch PC), and a single way is dedicated to full branch targets. This one also improves performance, as an associativity of 2 for the BTB now has an extra way for "far" branches.

These changes have been tested with synthesis (dual core) and booting FreeBSD on the vcu118.

There should be some need to clean up these commits a bit, so feel free to comment!