Improve crossbar's interface to avoid bottleneck when accessing different banks from the same port.

Identified bottleneck:

The VexRiscv SMP cluster is directly connected to LiteDRAM through 2x 128-bit Instruction /Data native LiteDRAM ports:

While doing initial tests with VexRiscv SMP, a bottleneck on LiteDRAM's crossbar has been identified:

The CPUs of the cluster are sharing the same LiteDRAM interface and want potentially to access different banks of the DRAM.
A port can currently only access one bank at a time and has to wait for the command to be emitted on the DRAM bus to switch to another bank (since the BankMachine will lock this port).

The current BankMachine lock mechanism is providing a simple way to avoid data buffering in the crossbar while also ensuring order of the transactions but is now limiting performance and should be improved.

Reproducing the issue with VexRiscv SMP:

In https://github.com/enjoy-digital/litex_vexriscv_smp apply the following patch to crt0.S:

boot_helper:
    li a0, 0x40000000
    li s0, 0x800
    add a1, a0, s0
    add a2, a1, s0
    add a3, a2, s0
    add a4, a3, s0
loop_me:
    sw x0, (a0)
    sw x0, (a1)
    sw x0, (a0)
    sw x0, (a1)
    sw x0, (a0)
    sw x0, (a1)
    sw x0, (a0)
    sw x0, (a1)
    j loop_me
    sw x10, smp_lottery_args  , x14
    sw x10, smp_lottery_args  , x14
    sw x11, smp_lottery_args+4, x14
    sw x11, smp_lottery_args+4, x14

And run the simulation with the traces (--trace), the bottleneck can be observed by looking at the native LiteDRAM port between the VexRiscv SMP cluster and LiteDRAM.

Proposed solution:

To remove this bottleneck, the lock mechanism should probably be removed and others mechanisms introduced for writes and reads:

Write path:

For the write path, each port could maintain cmd_idx and pending_xfers values (up to a N that should be configurable) and for each write:

Send the command to the BankMachine along with the cmd_idx if pending_xfers < N, else wait until condition is satisfied.
Store the write data to a data-width*N memory at cmd_idx location.
Increment the cmd_idx (modulo N) and pending_xfers.
Let the BankMachine retrieve the data from cmd_idx that was passed to it and decrement pending_xfers when BankMachine is accessing the data memory.

Read path:

For the read path, each port could maintain cmd_idx, return_idx and pending_xfers values (up to a N that should be configurable) and for each read:

Send the command to the BankMachine along with the cmd_idx if pending_xfers < N, else wait until condition is satisfied.
Increment the cmd_idx (modulo N) and pending_xfers.
Let the BankMachine return the read data along with the cmd_idx, the data will be written to the returned cmd_idx location.
Return the read data to the port if memory has valid data at return_idx location. Once data is presented and accepted, return_idx memory location should be invalidated, return_idx incremented (modulo N) and pending_xfers decremented.

cc @jedrzejboczar, @dolu1990, @kgugala.

enjoy-digital / litedram