esl-epfl / x-heep

eXtendable Heterogeneous Energy-Efficient Platform based on RISC-V
Other
146 stars 80 forks source link

[Question] Status of Multichannel DMA feature #569

Closed cousteaulecommandant closed 3 months ago

cousteaulecommandant commented 3 months ago

I have seen that there was a recent commit on X-HEEP (#517) which added multichannel DMA capabilities. This is something that sounds very interesting to me, since one of the issues I have with X-HEEP is its memory bandwidth limitation, and having multiple DMAs operating in parallel would solve this issue, since I could use multiple DMAs to move data from multiple memory blocks to multiple peripherals in parallel (provided that I use an NtoM bus configuration). However, I have noticed that the new dma_subsystem contains only one "channel" on its bus interface (ch0), and internally multiplexes all the DMA channels (dma instances) into a single bus port. Therefore, this multichannel DMA can only move one 32-bit word at a time, and if multiple channels are active at the same time, they must take turns to access the bus and the memory blocks / peripherals (even if the bus is configured as NtoM and there are multiple memory blocks).

My questions:

davideschiavone commented 3 months ago

Dear @cousteaulecommandant - for us it was important to keep the bus complexity small as we think to use dual channel between FLASH-MEM and MEM-MEM copies, so the first one is gonna be less frequent than the second one, thus performance won't be too bad.

However, you scenario is also nice, so you can make a PR when by default the dma_subsystem (and DMA num masters is 3), but if in the hjson it is written that you want another configuration, e.g. dma_memory_ports_sharing: false (true by default) - then the dma_subsystem could have as many ports as num_channles*3 - of course you have to increase the N_MASTERS in the core_v_mini_mcu_pkg file.

Feel free to make the PR

cousteaulecommandant commented 3 months ago

I see.

I'm not quite sure in which scenarios would DMA "port sharing" make sense.

Overall, I was failing to see in which situation we would want to have multiple DMA channels that can handle multiple concurrent DMA transactions but not parallel transactions; I was wondering which use case would benefit from this. Am I correct to understand that this was meant primarily for DMA transactions with a very low throughput, which may stall for several clock cycles to transmit each word?

davideschiavone commented 3 months ago

correct but if one DMA channel writes every (let's say) 100 cycles cause it is reading/writing from the SPI, and one DMA channel every cycle - the question is: can I give up a single cycle every 100 of stall for keeping the number of masters to 3 instead of 6? my guess was yes, so we set it that way - but again, if your scenario is having multiple channels writing in parallel to the memory, then let's implement it as indeed it is a useful scenario - so I am waiting for the PR :)

TommiTerza commented 3 months ago

Dear @cousteaulecommandant, I'm glad to hear that you find the new DMA interesting! One of the goals of this project was to tackle the exact limitation you pointed out, i.e. the memory bandwidth. I can confirm that there will be a commit coming soon that will introduce, among other features, a multi-master system that solves the issues you pointed out. It will be possible to tune the DMA subsystem using parameters in the DMA field in mcu_cfg.hjson. They will enable you to choose:

This last parameter seems odd, but it has been introduced to add flexibility to the configuration.

e.g. With 4 channels and 2 master ports, two arrangements are possible:

1) Allocate 1 port for 2 channels -> 2 channels per port max 2) Allocate 1 port for 3 channels and 1 port for the remaining one -> 3 channels per port

According to these parameters, suitable crossbars will be instantiated to manage the N-to-M flow. It is of course possible to have N channels and N master ports, which will not instantiate any crossbars. This amount of flexibility will be leveraged to evaluate area/performance tradeoffs.

Without the multi-master capabilities, there are two cases, in my opinion, in which a multi channel system can still bring advantages:

Finally, with this new commit there will be an updated and improved documentation that will explain in detail the features introduced with the new DMA subsystem.

I'm happy to expand on any of the points I have made if needed!

cousteaulecommandant commented 3 months ago

Dear @TommiTerza, Those are great news! That sounds like a great improvement on the multichannel DMA that could boost performance of X-HEEP based systems.

I was wondering, is a similar feature planned to be implemented on the peripheral subsystem (which I just realized has a single port to connect to the OBI bus)? Or does this focus on externally connected peripherals, with the "internal" peripheral subsystem meant only for simple, low-bandwidth peripherals?

TommiTerza commented 3 months ago

Dear @cousteaulecommandant , for now we only implemented the multi-master for the DMA, because of its crucial role in memory intense applications. At the moment there are no plans to extend multi-master capabilities to other X-Heep domains that I know of.

cousteaulecommandant commented 3 months ago

Just to clarify, I meant multi-slave rather than multi-master for the peripheral subsystem (or "multi-port" in general), so that each DMA channel could drive an individual peripheral. But I suppose that's beyond the scope of the peripheral subsystem.

For now I'll close this since it's already been answered. Thanks!