STM32 DMA double-buffering

Dirbaio commented 2 years ago

We want to add some form of first-class double-buffering support, to allow endless streaming of data.

Example use cases

Streaming samples from ADC
Streaming samples to DAC
Streaming to/from I2S/SAI
DMA-powered buffered UART

Requirements

Support double-buffering.
Gap between transfers must be none (the latency of an irq or of a wake is too much.)
There must not be UB even if irqs are delayed arbitrarily long (DMA must not wrap around and start overwriting the slice the user code is touching)

How to do this?

Satisfying the requirements is tricky. 3 essentially means we can't use DMA modes that "wrap around by default". For example, with circular buffer you might do this:

Start read onto a buffer in circular mode
Loop {
    Wait for HTIE, this means the 1st half is filled
    Hand the 1st half to the user, they process it
    Wait for TCIE, this means the 2nd half is filled
    Hand the 2nd half to the user, they process it
}}

However, if user takes too long to process the 1st half, DMA might wrap around and overwrite it from under them -> UB.

Unfortunately I believe it's "fundamentally impossible" to wrap DMA circular mode in a safe rust API :'(

The way we use DMA has to be something like "start writing to buf1, queue a write to buf2. When you're done with buf1 or buf2 tell me. but DO NOT wrap around back to buf1 until I tell you to do so", so if user code takes too long, DMA just stops (and maybe loses data) but there's no UB.

Idea 1: use M0AR/M1AR

There's some interesting ideas around on how to use M0AR/M1AR for this: writing a "poison" address to the next buffer (like 0xFFFF_FFFF) to get DMA to error and stop, then overwrite the poison with the real addr when it's safe to continue.

~~I'm not sure if this actually works in practice, or if it does it avoid UB in all cases.~~ yes it does.

Disadvantages:

Only works when the two bufs have the same length. Hardware has 2 addr regs but only 1 len reg :(
and only on chips with M0AR/M1AR (F2, F4, F7, H7, L5)

Idea 2: transfer queuing

it's not fast enough for some use cases (like DCMI)

Add a way in `trait Channel` to queue transfers. You start one transfer, queue the next. When a transfer finishes, the IRQ handler starts the next transfer if queued. DMA stops if there's no queued transfer. This allows code (e.g. the ADC hal) to: - Start transfer to buf1 - Queue transfer to buf2 - When buf1 is filled, hand it to user code, then queue it again - When buf2 is filled, hand it to user code, then queue it again - Repeat If user code is slow or IRQs are delayed, DMA loses data but there's no UB. Disadvantages: - Time gap is the irq latency, it's not zero.

Original discussion in Matrix

matoushybl commented 2 years ago

Further discussion revealed more information on double buffering with DMA/BDMA on different families and peripheral versions:

Families with BDMA v1/v2 (in RM often called DMA) cannot support Double Buffering as they lack hardware support. This means that it cannot be supported in hardware for families: F1, L1, F0, F3, L0, L4, G0, G4, WB, WL.
Families that have BDMA v3 support Double Buffered Mode - H7 and L5, where H7 support it on both DMA and BDMA.
F2, F4 and F7 support Double Buffering in their DMA peripheral.

AntoineMugnier commented 2 years ago

from the discussion on Matrix (Formatted) :

Idea 1 - fast, sound, only F2, F4, F7, H7, L5 -: Preferred options if hw permits it.

Idea 2 - slow, sound, all chips Would be easier to implement/understand/maintain than 1, but the IRQ latency is not negligeable.
It should be fine for audio on I2S/DAC, i.e. at 180 MHz core and 48 kHz sampling, you would have ~4000 cpu cycles for the IRQ, which should be enough, assuming that there are no long critical sections and IRQ priority is high. But if you have some high-frequency ADC sampling application then it will be noticable. There's at least one usecase where that doesn't work at all: transfering pictures from DCMI

Idea 3 - fast, unsound on overrun, all chips Use DMA circular mode - single buffer. On overrun, panic or stop DMA from IRQ then make the task return with "OverrunError". The second option is technically still unsound because by the time the IRQ fires, overrun (and therefore UB) already has happened (or perhaps stop DMA return an error to the user, though that's a bit more risky) This would allow us to get streaming DMA ADC/whatever working on ALL chips and then maybe we can later on apply idea 1 for the chips that do support it.

AntoineMugnier commented 2 years ago

After the previous discussion, we have stated to implement at least idea 3 and 1, and maybe 2; Suggested ordering of the tasks for the development: Idea 3 => Idea 1 => Idea 2

I'm starting working on Idea 3

hrouault commented 5 days ago

Any update on this?

I would like to switch my project to embassy but I can't yet because of this. Double buffering is implemented along the idea 1, I believe, in stm32h7xx-hal

matoushybl commented 5 days ago

I believe double buffered DMA is supported now: see for example https://github.com/embassy-rs/embassy/blob/main/examples/stm32f4/src/bin/adc_dma.rs . The thing that isn't supported afaik is support for large transfers (>65535).

hrouault commented 5 days ago

I got confused by the note there and thought the adc had to be restarted after each dma transfer completion. I was wrong.

If I understand correctly, idea 3 was implemented but not idea 1. It means that after each transfer, the dma buffer need to be copied into another buffer. Is that correct?

Should this issue be closed?

hrouault commented 3 days ago

Actually, I tried to use the ringbuffer with my board, but realized it is only supported for adc_v2, and stm32h7 is adc_v4.

embassy-rs / embassy