Closed cousteaulecommandant closed 3 months ago
Dear @cousteaulecommandant - for us it was important to keep the bus complexity small as we think to use dual channel between FLASH-MEM and MEM-MEM copies, so the first one is gonna be less frequent than the second one, thus performance won't be too bad.
However, you scenario is also nice, so you can make a PR when by default the dma_subsystem (and DMA num masters is 3), but if in the hjson it is written that you want another configuration, e.g. dma_memory_ports_sharing: false
(true
by default) - then the dma_subsystem could have as many ports as num_channles*3 - of course you have to increase the N_MASTERS in the core_v_mini_mcu_pkg file.
Feel free to make the PR
I see.
I'm not quite sure in which scenarios would DMA "port sharing" make sense.
onetoM
, it is clear that a multi-port DMA isn't going to do anything useful since only one transaction can happen at a time. But then again, a multi-port DMA connected to a onetoM
bus would be multiplexing all the ports of the DMA inside the bus, as opposed to multiplexing them inside the dma_subsystem
module, so from my understanding it would make no difference -- we're either making the bus mux more complex or moving the complexity to another module. (Although maybe the way in which transactions are interleaved changes if you do it one way or the other; not sure what's the scheduling method for the system bus.)NtoM
, it's because we want the bus to have a higher performance at a cost of a higher complexity; and if we have instantiated multiple DMA channels it's because we want higher DMA capabilities, isn't it? (Then again, perhaps one may want "high performance, but not that high; it may be OK to have N=5 but not N=14 masters".)Overall, I was failing to see in which situation we would want to have multiple DMA channels that can handle multiple concurrent DMA transactions but not parallel transactions; I was wondering which use case would benefit from this. Am I correct to understand that this was meant primarily for DMA transactions with a very low throughput, which may stall for several clock cycles to transmit each word?
correct but if one DMA channel writes every (let's say) 100 cycles cause it is reading/writing from the SPI, and one DMA channel every cycle - the question is: can I give up a single cycle every 100 of stall for keeping the number of masters to 3 instead of 6? my guess was yes, so we set it that way - but again, if your scenario is having multiple channels writing in parallel to the memory, then let's implement it as indeed it is a useful scenario - so I am waiting for the PR :)
Dear @cousteaulecommandant, I'm glad to hear that you find the new DMA interesting! One of the goals of this project was to tackle the exact limitation you pointed out, i.e. the memory bandwidth. I can confirm that there will be a commit coming soon that will introduce, among other features, a multi-master system that solves the issues you pointed out. It will be possible to tune the DMA subsystem using parameters in the DMA field in mcu_cfg.hjson. They will enable you to choose:
This last parameter seems odd, but it has been introduced to add flexibility to the configuration.
e.g. With 4 channels and 2 master ports, two arrangements are possible:
1) Allocate 1 port for 2 channels -> 2 channels per port max 2) Allocate 1 port for 3 channels and 1 port for the remaining one -> 3 channels per port
According to these parameters, suitable crossbars will be instantiated to manage the N-to-M flow. It is of course possible to have N channels and N master ports, which will not instantiate any crossbars. This amount of flexibility will be leveraged to evaluate area/performance tradeoffs.
Without the multi-master capabilities, there are two cases, in my opinion, in which a multi channel system can still bring advantages:
Finally, with this new commit there will be an updated and improved documentation that will explain in detail the features introduced with the new DMA subsystem.
I'm happy to expand on any of the points I have made if needed!
Dear @TommiTerza, Those are great news! That sounds like a great improvement on the multichannel DMA that could boost performance of X-HEEP based systems.
I was wondering, is a similar feature planned to be implemented on the peripheral subsystem (which I just realized has a single port to connect to the OBI bus)? Or does this focus on externally connected peripherals, with the "internal" peripheral subsystem meant only for simple, low-bandwidth peripherals?
Dear @cousteaulecommandant , for now we only implemented the multi-master for the DMA, because of its crucial role in memory intense applications. At the moment there are no plans to extend multi-master capabilities to other X-Heep domains that I know of.
Just to clarify, I meant multi-slave rather than multi-master for the peripheral subsystem (or "multi-port" in general), so that each DMA channel could drive an individual peripheral. But I suppose that's beyond the scope of the peripheral subsystem.
For now I'll close this since it's already been answered. Thanks!
I have seen that there was a recent commit on X-HEEP (#517) which added multichannel DMA capabilities. This is something that sounds very interesting to me, since one of the issues I have with X-HEEP is its memory bandwidth limitation, and having multiple DMAs operating in parallel would solve this issue, since I could use multiple DMAs to move data from multiple memory blocks to multiple peripherals in parallel (provided that I use an NtoM bus configuration). However, I have noticed that the new
dma_subsystem
contains only one "channel" on its bus interface (ch0), and internally multiplexes all the DMA channels (dma
instances) into a single bus port. Therefore, this multichannel DMA can only move one 32-bit word at a time, and if multiple channels are active at the same time, they must take turns to access the bus and the memory blocks / peripherals (even if the bus is configured as NtoM and there are multiple memory blocks).My questions: