Architecture: Scatter Gather implementation

Hi guys

For a 2-core Rocket-based Linux system, I need to implement a non-cached non-paged Scatter-Gather DMA as a part of the ROCC module. My DMA should have a high-throughput low-latency access to a multibank DRAM and a word-aligned access. The max burst size is 8kB in a SG mode, which may be mapped to 1024 beats and split across multiple AXI4 transactions due to the limit of 256 beats for AXI4 protocol.

The ROCC computed result should be shared among multiple Rocket cores. The ROCC output buffer is also max 8kB.

What is the best architectural way to implement such system ? There are gonna be several long latency operations ( 1000s of cycles) and they will suspend an issuing core as per ROCC implementation.

chipsalliance / rocket-chip

Architecture: Scatter Gather implementation #1607