enjoy-digital / litedram

Small footprint and configurable DRAM core
Other
384 stars 122 forks source link

Improve crossbar's interface to avoid bottleneck when accessing different banks from the same port. #209

Open enjoy-digital opened 4 years ago

enjoy-digital commented 4 years ago

Identified bottleneck:

The VexRiscv SMP cluster is directly connected to LiteDRAM through 2x 128-bit Instruction /Data native LiteDRAM ports:

image

While doing initial tests with VexRiscv SMP, a bottleneck on LiteDRAM's crossbar has been identified:

The current BankMachine lock mechanism is providing a simple way to avoid data buffering in the crossbar while also ensuring order of the transactions but is now limiting performance and should be improved.

Reproducing the issue with VexRiscv SMP:

In https://github.com/enjoy-digital/litex_vexriscv_smp apply the following patch to crt0.S:

boot_helper:
    li a0, 0x40000000
    li s0, 0x800
    add a1, a0, s0
    add a2, a1, s0
    add a3, a2, s0
    add a4, a3, s0
loop_me:
    sw x0, (a0)
    sw x0, (a1)
    sw x0, (a0)
    sw x0, (a1)
    sw x0, (a0)
    sw x0, (a1)
    sw x0, (a0)
    sw x0, (a1)
    j loop_me
    sw x10, smp_lottery_args  , x14
    sw x10, smp_lottery_args  , x14
    sw x11, smp_lottery_args+4, x14
    sw x11, smp_lottery_args+4, x14

And run the simulation with the traces (--trace), the bottleneck can be observed by looking at the native LiteDRAM port between the VexRiscv SMP cluster and LiteDRAM.

Proposed solution:

To remove this bottleneck, the lock mechanism should probably be removed and others mechanisms introduced for writes and reads:

Write path:

For the write path, each port could maintain cmd_idx and pending_xfers values (up to a N that should be configurable) and for each write:

Read path:

For the read path, each port could maintain cmd_idx, return_idx and pending_xfers values (up to a N that should be configurable) and for each read:

cc @jedrzejboczar, @dolu1990, @kgugala.

Dolu1990 commented 4 years ago

If i remember well, the lock was reducing the bandwidth by 75%. In practice when 4 CPU were doing busy work, it realy hit the performances.