SpinalHDL / VexRiscv

A FPGA friendly 32 bit RISC-V CPU implementation
MIT License
2.51k stars 420 forks source link

Question regarding APB MMIO cycles #128

Closed wenwang3122 closed 4 years ago

wenwang3122 commented 4 years ago

Hi!

The following question is not about a bug in the code, it is more of an IO overhead question regarding my software-hardware co-design experiments based on Murax SoC. I am not sure if it is suitable to post such question here, please feel free to let me know if it is not okay to do so.

Here is the background: I am using Murax SoC as a platform to do software-hardware co-designs, and the hardware accelerators are added as APB peripherals. I attached a sample diagram below (murax-co-design-sample-diagram.png).

murax-co-design-sample-diagram

In this setup, the software-hardware interface overhead is critical for the overall performance. During my experiments, I found out that:

  1. Typically, one APB write takes 3 clock cycles while one APB read takes 4 clock cycles. There doesn't seem to exist a way to further reduce the cycle counts.

Suppose data_in is an MMIO register, the code is as follows: data_in[0] = variable_reg;

  1. When APB write is combined with memory read/ APB read is combined with memory write, as follows (suppose data_out is an MMIO register):

data_in[0] = variable_array[i]; result_array[i] = data_out[0];

Then the cycles for APB write increases from 3 to 6/8/9 cycles (different scenarios), and APB read increases from 4 to 8/9/10 cycles (different scenarios). The increases (APB write/read alone vs APB write/read + memory access) exceeds the memory access alone cycles (based on my benchmarks).

Here are my questions:

  1. Do you have any ideas about the cycles overhead?
  2. Or do you have any recommendations about how to do the software-hardware co-design based on VexRiscv in a different setup to reduce the IO overhead?
  3. Would switching to Briey SoC (with Dcache, Icache, MMU, AXI bus) help reduce the IO overhead (although still the interface with peripherals is APB)?

Thank you very much! Wen

Dolu1990 commented 4 years ago

I am not sure if it is suitable to post such question here

No worries, that's fine ^^

Do you have any ideas about the cycles overhead?

Hmm would need a wave to figure out the timings. If you want i can point out some key signals ? It might be en part because the instruction bus is accessing the bus at the same time, which create some conflicts. The data bus and instruction bus share the same bus to access peripherals and ram, see "val mainBusArbiter = new MuraxMasterArbiter(pipelinedMemoryBusConfig)" , your diagram isn't accurate on that point. (dBus+iBus) -> mainBusArbiter -> (ram + peripheral)

Or do you have any recommendations about how to do the software-hardware co-design based on VexRiscv in a different setup to reduce the IO overhead?

Yes, basicaly, could add a custom VexRiscv instruction to directly feed the accelerator with CPU data, else, less optimal, but still better than nothing, is to avoid APB and directly define a SimpleBus peripheral

Would switching to Briey SoC (with Dcache, Icache, MMU, AXI bus) help reduce the IO overhead

Yes and no. Yes because it would avoid the shared memory bus issue, no because the way the d$ access MMIO is less dirrect and has additional penalities compared to the $less design..

I think the very best would realy be to use a custom instruction to drive some stream of data, if that's an option for you :)

wenwang3122 commented 4 years ago

Thank you for the feedback, that helps a lot in understanding the experimental results and I have a pretty good understanding about how to optimize the design now.