enjoy-digital / litex

Build your hardware, easily!
Other
2.99k stars 568 forks source link

Rocket: 'mem_axi' <--> LiteDRAM data-width impact on performance #299

Closed gsomlo closed 4 years ago

gsomlo commented 5 years ago

Assuming a dedicated Rocket mem_axi <-> LiteDRAM data port link, and a direct, "native" point-to-point AXI connection between the two, will an increased AXI data_width improve performance?

Given that Rocket transfers entire cache lines over 'mem_axi', would shorter, wider bursts actually improve performance as compared to longer, narrower ones?

I built Rocket Litex/Linux variants using "WithMedCores", with 'mem_axi' data width set to three different values: standard (64-bit), wide (128-bit) and double-wide (256-bit). By picking subsets of ('dm', 'dq', 'dqs_p') pins in litex-boards/litex_boards/partner/platforms/trellisboard.py, I forced the LiteDRAM data port to have a data width matching the 'axi_mem' data width of the Rocket version under test, thus enabling native point-to-point AXI linkage between Rocket and LiteDRAM.

I built a fourth bitstream, where the data width gap difference between the default 'mem_axi' (64bit) and default trellisboard LiteDRAM port (256bit) was bridged using the Wishbone data width converter. This is presumably inefficient, and a native AXI converter might do a (much?) better job (to be determined by some benchmarking upon implementation). However, I wanted to throw it in as an additional data point for comparison.

All CPUs are otherwise identical, except for the axi data width (or for the presence of a Wishbone-based width converter). The system clock runs at 65MHz in all four cases.

I then loaded a BBL-wrapped Linux kernel with an embedded, Busybox based initrd. The BBL-embedded DT is configured to tell Linux that 128MB of RAM are available, which works because it never exceeds the actual physical memory exposed by the trellisboard LiteDRAM (minimum 256MB at 64-bit width, maximum 1GB at 256-bit width).

On top of this setup, I ran Coremark, Linpack (single and double precision), and NBench (integer and fp performance as a fraction of a 90MHz Pentium, and memory, integer, and fp performance as a fraction of a 233MHz AMD K6).

NOTE1: All CPU variants tested use bbl-based FPU emulation, so fp benchmark results are therefore simply an indirect measure of integer performance, through the execution of BBL's fp emulation trap handler!

NOTE2: Rocket's 'WithMedCores' is designed to fit on the ecp5versa, and has a relatively small L1 cache. Performance would almost certainly be improved more by a larger L1 cache than by increasing the L1 <-> LiteDRAM connection width (and shortening the cache-line burst accordingly). I did not vary or increase the L1 cache for this test, nor did I decrease it (which might have brought the first-order effect of L1 <-> LiteDRAM data width influence into sharper emphasis!

NOTE3: Yosys/Trellis/Nextpnr were used to build trellisboard bitstream. The 256-bit variant would constantly fail memtest on bios boot, and fail both main clock and ethernet clock timing requirements during p&r. Adding '-nowidelut' to yosys alleviated that, and so I re-ran the tests on bitstreams built with '-nowidelut' for the narrower variants as well. There seems to be no difference in performance depending on whether 'nowidelut' was used during bitstream build, which makes sense -- it's the same RTL running at the same clock speed, after all!

The results are shown below:

Data   CoreMark     LinPack              NBench              Remarks
Width           -------------  ---------------------------
(AXI)           single double  p5i   p5f   k6m   k6i   k6f

 64   46.307016 43.128 27.748 0.295 0.002 0.071 0.075 0.001
 64   46.544101 43.202 27.607 0.269 0.002 0.070 0.065 0.001
 64   46.594704 42.691 27.983 0.294 0.002 0.071 0.075 0.001
 64   44.696067 41.410 27.956 0.294 0.002 0.071 0.075 0.001 'nowidelut'
 64   46.346362 43.213 27.951 0.294 0.002 0.071 0.075 0.001 'nowidelut'
 64   46.565774 43.235 27.277 0.293 0.002 0.071 0.075 0.001 'nowidelut'

128   47.058824 47.185 30.716 0.310 0.003 0.076 0.078 0.001
128   45.620438 47.266 30.737 0.310 0.003 0.076 0.078 0.001
128   45.627376 47.264 30.409 0.309 0.003 0.076 0.078 0.001
128   47.329810 47.271 30.725 0.310 0.003 0.076 0.078 0.001 'nowidelut'
128   47.303690 47.235 30.713 0.303 0.003 0.075 0.076 0.001 'nowidelut'
128   47.303690 47.250 30.727 0.309 0.003 0.076 0.078 0.001 'nowidelut'

256   45.672528 47.922 30.842 0.313 0.003 0.077 0.079 0.001 'nowidelut'
256   47.442081 47.919 28.864 0.311 0.003 0.076 0.079 0.001 'nowidelut'
256   47.389622 47.875 30.745 0.313 0.003 0.077 0.079 0.001 'nowidelut'

 WB   35.440047 17.483  8.927 0.150 0.001 0.032 0.043 0.000 64<->256 via WB
 WB   35.402407 17.377  8.991 0.150 0.001 0.032 0.043 0.000 64<->256 via WB
 WB   35.396142 17.374  8.993 0.150 0.001 0.032 0.042 0.000 64<->256 via WB

Legend:
  p5i: integer performance as fraction of 90MHz Pentium
  p5f: floating-point performance as fraction of 90MHz Pentium

  k6m: memory performance as fraction of 233MHz AMD-K6
  k6i: integer performance as fraction of 233MHz AMD-K6
  k6f: floating-point performance as fraction of 233MHz AMD-K6

Analysis: A wider AXI link between Rocket's L1 cache and LiteDRAM is clearly beneficial, at least when going from 64-bit to 128-bit width.

The Nbench k6m column value increases by 7% when going to 128-bit width, and by one more percent when going to 256-bit. This is also reflected by improved results in the other benchmark tests.

I have not tested the effects on a system with NO L1 cache (the benefits of a larger width might be larger, as they won't be obscured by the effect of the cache).

It would also be very interesting to see where on this spectrum we'd end up with a native AXI-only data width converter.

gsomlo commented 4 years ago

resolved by commit #321