Assuming a dedicated Rocket mem_axi <-> LiteDRAM data port link, and a
direct, "native" point-to-point AXI connection between the two, will an
increased AXI data_width improve performance?
Given that Rocket transfers entire cache lines over 'mem_axi', would
shorter, wider bursts actually improve performance as compared to longer,
narrower ones?
I built Rocket Litex/Linux variants using "WithMedCores", with 'mem_axi'
data width set to three different values: standard (64-bit), wide (128-bit)
and double-wide (256-bit). By picking subsets of ('dm', 'dq', 'dqs_p') pins
in litex-boards/litex_boards/partner/platforms/trellisboard.py, I forced
the LiteDRAM data port to have a data width matching the 'axi_mem' data
width of the Rocket version under test, thus enabling native point-to-point
AXI linkage between Rocket and LiteDRAM.
I built a fourth bitstream, where the data width gap difference between the
default 'mem_axi' (64bit) and default trellisboard LiteDRAM port (256bit)
was bridged using the Wishbone data width converter. This is presumably
inefficient, and a native AXI converter might do a (much?) better job
(to be determined by some benchmarking upon implementation). However, I
wanted to throw it in as an additional data point for comparison.
All CPUs are otherwise identical, except for the axi data width (or for
the presence of a Wishbone-based width converter). The system clock runs
at 65MHz in all four cases.
I then loaded a BBL-wrapped Linux kernel with an embedded, Busybox based
initrd. The BBL-embedded DT is configured to tell Linux that 128MB of RAM
are available, which works because it never exceeds the actual physical
memory exposed by the trellisboard LiteDRAM (minimum 256MB at 64-bit
width, maximum 1GB at 256-bit width).
On top of this setup, I ran Coremark, Linpack (single and double precision),
and NBench (integer and fp performance as a fraction of a 90MHz Pentium,
and memory, integer, and fp performance as a fraction of a 233MHz AMD K6).
NOTE1: All CPU variants tested use bbl-based FPU emulation, so fp benchmark
results are therefore simply an indirect measure of integer performance,
through the execution of BBL's fp emulation trap handler!
NOTE2: Rocket's 'WithMedCores' is designed to fit on the ecp5versa, and has
a relatively small L1 cache. Performance would almost certainly be improved
more by a larger L1 cache than by increasing the L1 <-> LiteDRAM connection
width (and shortening the cache-line burst accordingly). I did not vary or
increase the L1 cache for this test, nor did I decrease it (which might
have brought the first-order effect of L1 <-> LiteDRAM data width influence
into sharper emphasis!
NOTE3: Yosys/Trellis/Nextpnr were used to build trellisboard bitstream.
The 256-bit variant would constantly fail memtest on bios boot, and
fail both main clock and ethernet clock timing requirements during p&r.
Adding '-nowidelut' to yosys alleviated that, and so I re-ran the tests on
bitstreams built with '-nowidelut' for the narrower variants as well. There
seems to be no difference in performance depending on whether 'nowidelut'
was used during bitstream build, which makes sense -- it's the same RTL
running at the same clock speed, after all!
Analysis: A wider AXI link between Rocket's L1 cache and LiteDRAM is
clearly beneficial, at least when going from 64-bit to 128-bit width.
The Nbench k6m column value increases by 7% when going to 128-bit width,
and by one more percent when going to 256-bit. This is also reflected
by improved results in the other benchmark tests.
I have not tested the effects on a system with NO L1 cache (the benefits
of a larger width might be larger, as they won't be obscured by the effect
of the cache).
It would also be very interesting to see where on this spectrum we'd end
up with a native AXI-only data width converter.
Assuming a dedicated Rocket mem_axi <-> LiteDRAM data port link, and a direct, "native" point-to-point AXI connection between the two, will an increased AXI data_width improve performance?
Given that Rocket transfers entire cache lines over 'mem_axi', would shorter, wider bursts actually improve performance as compared to longer, narrower ones?
I built Rocket Litex/Linux variants using "WithMedCores", with 'mem_axi' data width set to three different values: standard (64-bit), wide (128-bit) and double-wide (256-bit). By picking subsets of ('dm', 'dq', 'dqs_p') pins in litex-boards/litex_boards/partner/platforms/trellisboard.py, I forced the LiteDRAM data port to have a data width matching the 'axi_mem' data width of the Rocket version under test, thus enabling native point-to-point AXI linkage between Rocket and LiteDRAM.
I built a fourth bitstream, where the data width gap difference between the default 'mem_axi' (64bit) and default trellisboard LiteDRAM port (256bit) was bridged using the Wishbone data width converter. This is presumably inefficient, and a native AXI converter might do a (much?) better job (to be determined by some benchmarking upon implementation). However, I wanted to throw it in as an additional data point for comparison.
All CPUs are otherwise identical, except for the axi data width (or for the presence of a Wishbone-based width converter). The system clock runs at 65MHz in all four cases.
I then loaded a BBL-wrapped Linux kernel with an embedded, Busybox based initrd. The BBL-embedded DT is configured to tell Linux that 128MB of RAM are available, which works because it never exceeds the actual physical memory exposed by the trellisboard LiteDRAM (minimum 256MB at 64-bit width, maximum 1GB at 256-bit width).
On top of this setup, I ran Coremark, Linpack (single and double precision), and NBench (integer and fp performance as a fraction of a 90MHz Pentium, and memory, integer, and fp performance as a fraction of a 233MHz AMD K6).
NOTE1: All CPU variants tested use bbl-based FPU emulation, so fp benchmark results are therefore simply an indirect measure of integer performance, through the execution of BBL's fp emulation trap handler!
NOTE2: Rocket's 'WithMedCores' is designed to fit on the ecp5versa, and has a relatively small L1 cache. Performance would almost certainly be improved more by a larger L1 cache than by increasing the L1 <-> LiteDRAM connection width (and shortening the cache-line burst accordingly). I did not vary or increase the L1 cache for this test, nor did I decrease it (which might have brought the first-order effect of L1 <-> LiteDRAM data width influence into sharper emphasis!
NOTE3: Yosys/Trellis/Nextpnr were used to build trellisboard bitstream. The 256-bit variant would constantly fail memtest on bios boot, and fail both main clock and ethernet clock timing requirements during p&r. Adding '-nowidelut' to yosys alleviated that, and so I re-ran the tests on bitstreams built with '-nowidelut' for the narrower variants as well. There seems to be no difference in performance depending on whether 'nowidelut' was used during bitstream build, which makes sense -- it's the same RTL running at the same clock speed, after all!
The results are shown below:
Analysis: A wider AXI link between Rocket's L1 cache and LiteDRAM is clearly beneficial, at least when going from 64-bit to 128-bit width.
The Nbench k6m column value increases by 7% when going to 128-bit width, and by one more percent when going to 256-bit. This is also reflected by improved results in the other benchmark tests.
I have not tested the effects on a system with NO L1 cache (the benefits of a larger width might be larger, as they won't be obscured by the effect of the cache).
It would also be very interesting to see where on this spectrum we'd end up with a native AXI-only data width converter.