[Question] Status of Multichannel DMA feature

cousteaulecommandant commented 3 months ago

I have seen that there was a recent commit on X-HEEP (#517) which added multichannel DMA capabilities. This is something that sounds very interesting to me, since one of the issues I have with X-HEEP is its memory bandwidth limitation, and having multiple DMAs operating in parallel would solve this issue, since I could use multiple DMAs to move data from multiple memory blocks to multiple peripherals in parallel (provided that I use an NtoM bus configuration). However, I have noticed that the new dma_subsystem contains only one "channel" on its bus interface (ch0), and internally multiplexes all the DMA channels (dma instances) into a single bus port. Therefore, this multichannel DMA can only move one 32-bit word at a time, and if multiple channels are active at the same time, they must take turns to access the bus and the memory blocks / peripherals (even if the bus is configured as NtoM and there are multiple memory blocks).

My questions:

Is the development of the multichannel DMA considered finished, or was I just looking at the first commit in a series of many related to a feature still in development? Should I wait to start using multichannel DMA, or is it already considered a mature feature?
Are you considering adding multi-port to the DMA subsystem in the near future, so that each channel has its own bus port and multiple DMA transactions can be carried out in parallel, effectively augmenting the DMA bandwidth?
If not, what would be the purpose / use case for a multi-channel DMA that can only access the bus in a sequenced manner?

davideschiavone commented 3 months ago

Dear @cousteaulecommandant - for us it was important to keep the bus complexity small as we think to use dual channel between FLASH-MEM and MEM-MEM copies, so the first one is gonna be less frequent than the second one, thus performance won't be too bad.

However, you scenario is also nice, so you can make a PR when by default the dma_subsystem (and DMA num masters is 3), but if in the hjson it is written that you want another configuration, e.g. dma_memory_ports_sharing: false (true by default) - then the dma_subsystem could have as many ports as num_channles*3 - of course you have to increase the N_MASTERS in the core_v_mini_mcu_pkg file.

Feel free to make the PR

cousteaulecommandant commented 3 months ago

I see.

I'm not quite sure in which scenarios would DMA "port sharing" make sense.

If the system bus is configured as onetoM, it is clear that a multi-port DMA isn't going to do anything useful since only one transaction can happen at a time. But then again, a multi-port DMA connected to a onetoM bus would be multiplexing all the ports of the DMA inside the bus, as opposed to multiplexing them inside the dma_subsystem module, so from my understanding it would make no difference -- we're either making the bus mux more complex or moving the complexity to another module. (Although maybe the way in which transactions are interleaved changes if you do it one way or the other; not sure what's the scheduling method for the system bus.)
If the system bus is configured as NtoM, it's because we want the bus to have a higher performance at a cost of a higher complexity; and if we have instantiated multiple DMA channels it's because we want higher DMA capabilities, isn't it? (Then again, perhaps one may want "high performance, but not that high; it may be OK to have N=5 but not N=14 masters".)

Overall, I was failing to see in which situation we would want to have multiple DMA channels that can handle multiple concurrent DMA transactions but not parallel transactions; I was wondering which use case would benefit from this. Am I correct to understand that this was meant primarily for DMA transactions with a very low throughput, which may stall for several clock cycles to transmit each word?

davideschiavone commented 3 months ago

correct but if one DMA channel writes every (let's say) 100 cycles cause it is reading/writing from the SPI, and one DMA channel every cycle - the question is: can I give up a single cycle every 100 of stall for keeping the number of masters to 3 instead of 6? my guess was yes, so we set it that way - but again, if your scenario is having multiple channels writing in parallel to the memory, then let's implement it as indeed it is a useful scenario - so I am waiting for the PR :)

TommiTerza commented 3 months ago

Dear @cousteaulecommandant, I'm glad to hear that you find the new DMA interesting! One of the goals of this project was to tackle the exact limitation you pointed out, i.e. the memory bandwidth. I can confirm that there will be a commit coming soon that will introduce, among other features, a multi-master system that solves the issues you pointed out. It will be possible to tune the DMA subsystem using parameters in the DMA field in mcu_cfg.hjson. They will enable you to choose:

Number of master ports to allocate to the DMA subsystem
Maximum number of DMA channels per master ports

This last parameter seems odd, but it has been introduced to add flexibility to the configuration.

e.g. With 4 channels and 2 master ports, two arrangements are possible:

1) Allocate 1 port for 2 channels -> 2 channels per port max 2) Allocate 1 port for 3 channels and 1 port for the remaining one -> 3 channels per port

According to these parameters, suitable crossbars will be instantiated to manage the N-to-M flow. It is of course possible to have N channels and N master ports, which will not instantiate any crossbars. This amount of flexibility will be leveraged to evaluate area/performance tradeoffs.

Without the multi-master capabilities, there are two cases, in my opinion, in which a multi channel system can still bring advantages:

Applications that need to perform cyclic transactions. e.g. An application that needs to read from the SPI some data, move a batch of it to an external accelerator and finally move the processed data to a memory location. These tasks can be implemented using different DMA channels, with minimum CPU utilization.
Applications in which the overhead of setting up a DMA transaction is non negligible and thus even concurrent transactions bring an advantage in performance. e.g. Applications that performs matrix manipulations, like im2col. Sharing the workload on several channels is observable when the computation of the transaction parameters has a comparable execution time w.r.t. a single transaction.

Finally, with this new commit there will be an updated and improved documentation that will explain in detail the features introduced with the new DMA subsystem.

I'm happy to expand on any of the points I have made if needed!

cousteaulecommandant commented 3 months ago

Dear @TommiTerza, Those are great news! That sounds like a great improvement on the multichannel DMA that could boost performance of X-HEEP based systems.

I was wondering, is a similar feature planned to be implemented on the peripheral subsystem (which I just realized has a single port to connect to the OBI bus)? Or does this focus on externally connected peripherals, with the "internal" peripheral subsystem meant only for simple, low-bandwidth peripherals?

TommiTerza commented 3 months ago

Dear @cousteaulecommandant , for now we only implemented the multi-master for the DMA, because of its crucial role in memory intense applications. At the moment there are no plans to extend multi-master capabilities to other X-Heep domains that I know of.

cousteaulecommandant commented 3 months ago

Just to clarify, I meant multi-slave rather than multi-master for the peripheral subsystem (or "multi-port" in general), so that each DMA channel could drive an individual peripheral. But I suppose that's beyond the scope of the peripheral subsystem.

For now I'll close this since it's already been answered. Thanks!

esl-epfl / x-heep

[Question] Status of Multichannel DMA feature #569