DMA buffers - memory location & usability concerns

TheSlowGrowth commented 3 years ago

In #326 we started discussing how to avoid having to pass in DMA buffers to low level drivers.

The Problem

When transferring data to a peripheral with the DMA, the data is transferred directly from the memory to the peripheral. If the memory is cached with the D-cache, then the contents of the memory will not be up to date and the DMA will transfer wrong/outdated data to the peripheral.

The solutions:

(This is the current solution) The buffers that are used by the DMA transfer are placed in a memory region which has the D-cache disabled.
- If the DMA buffers are accessed multiple times prior to the DMA transfer (e.g. when rendering images for the display) then all those accesses are un-cached and potentially slower
- The DMA buffers can't be kept as a member variable in a low level driver class. Instead they have to be allocated in the correct memory region by the user and handed to the low level driver via the Init() function.
The buffers that are used by the DMA transfer sit in regular, cached memory. Before each DMA transfer, the cache is flushed with dsy_dma_clear_cache_for_buffer().
- The user doesn't have to pass in buffers; the buffers can simply sit as member variables in the driver. This makes USING a driver much simpler and keeps complexity away from the user
- Access to the buffer prior to the DMA transfer (e.g. when rendering images for the display) is cached and potentially faster.
- We can change the buffer structure (e.g. introduce double buffering, etc.) without breaking the drivers API.

Performance considerations:

@recursinging wondered how the performance difference between the two approaches could be measured. Here's an idea: Execute this:

Set pin high
loop N times:
    write M bytes to buffer in cached memory region
    flush cache
set pin low

Then execute this:

Set pin high
loop N times:
    write M bytes to buffer in non-cached memory region
set pin low

And compare the execution times by observing the pin with an oscilloscope. I bet if M = bufferSize, the performance would be very comparable. I would assume that if M >> bufferSize (== typical display drawing scenario) the cached version would be faster.

I would argue that it's simpler and probably even faster to just flush the cache before the DMA transfer. At least it would greatly reduce complexity for users.

TheSlowGrowth commented 3 years ago

Answering some questions from @recursinging in #326 :

Alternatively, I was wondering if it might be possible to have template peripheral handle classes which allocate their own transmit and receive buffers sized by the template type(s).

You can't have member variables in a specific memory region unless the entire parent object is in that memory region.

The entire peripheral handle class instance could be placed in .sram1_bss without incurring a performance penalty (assuming these classes don't make use of the cache).

The D-cache is enabled for memory regions and is completely transparent to the code that's running. Classes don't use the cache on their own will.

It seems to be a more logical location, but I'm unsure how this would work with multiple devices on a single bus, so I dismissed the idea.

Each low level driver will need its own DMA buffers even if they share a DMA. Also I strongly disagree with placing an entire low level driver on the cache-disabled memory region: Only the actual DMA buffer needs to be excluded from the cache (or the cache needs to be flushed before the DMA transfer). All the other member variables that a low level driver may have should still be cached to benefit from the cache performance boost.

recursinging commented 3 years ago

I was thinking more along the lines of a template I2C or SPI handle that is constructed outside the driver class and passed in an Init similar to dev/leddriver.h only that the DMA Buffers need not be explicitly constructed alongside the peripheral handle itself... a slight improvement to the current implementation perhaps. Should flushing the cache prove viable, it would certainly be the optimal solution.

GregBurns commented 3 years ago

In the past I have implemented "dynamic" memory allocators just for the convenience of allocating buffers from specific memory regions. A DMA allocator dma_malloc() let's you allocate the buffer from non-cachable memory in the constructor and just hold a pointer to it in the class. Since drivers are expected to be available for the lifetime of the application there is no need for a dma_free().

On Tue, May 4, 2021, 8:21 AM recursinging @.***> wrote:

I was thinking more along the lines of a template I2C or SPI handle that is constructed outside the driver class and passed in an Init similar to dev/leddriver.h only that the DMA Buffers need not be explicitly constructed alongside the peripheral handle itself... a slight improvement to the current implementation perhaps. Should flushing the cache prove viable, it would certainly be the optimal solution.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/electro-smith/libDaisy/issues/335#issuecomment-832026393, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACESB6YKRV5SPU2AZU6J4ZDTMAGH7ANCNFSM44CWFKAQ .

stephenhensley commented 3 years ago

As mentioned above, I think doing some performance analysis of the cache performance when flushing compared to non-cached buffers would be a helpful.

However, I think you're right that they may perform very similarly. It would greatly reduce complexity to be able to store the buffer locally to the class instance, and perform the maintenance internally. What should also be considered when testing this is other bus activity that could have an effect on cache performance (dsp activity in callback, etc.).

A suitable test could probably be put together relatively quickly on the PCA9685 LED driver since we could just declare a few static buffers in the SRAM and compare the performance there. That way we can easily drop that into existing project code to profile any differences in a "real" environment.

This could also be tested for the Audio buffers, and ADC buffers. Though depending on performance we may want to be a bit pickier in the audio pipeline.

A "dynamic" allocator could also work pretty well for this, and I think the concept was discussed a while back in #242, and would still have the benefit of reducing user-facing complexity, but would add quite a bit of internal complexity (not that that's unwelcome).

recursinging commented 3 years ago

I was curious about this, so I wrote a little test program to check:

https://gist.github.com/recursinging/15b93588adc895d0a943864f5f27c3fc

Assuming my implementation is valid, It seems @TheSlowGrowth is correct in his assumption that the D-cache performance increase far outweighs the overhead of the cache clear/invalidate ops. In the above test, it is only when BUFFER_SIZE is lower than 8 that the cache op overhead becomes dominant.

That being said, this is naive implementation only to visualize the performance relationship of the memory access patterns. As @stephenhensley suggested, it would be prudent to verify this with "real" implementations.

stephenhensley commented 3 years ago

Thanks for going ahead with a benchmark on this. Definitely very helpful in coming to a decision moving forward.

A "real"-enough test might be to just run a reverb algorithm (since it'll be using quite a bit of non-contiguous memory) in the audio callback while the test repeats a few times.

That said, if the metrics hold, and I suspect they might, we could definitely move forward with a plan to move toward using cache-maintenance for DMA buffers.

Currently, I believe the I2C (LED Driver), ADC, UART, and Audio are the only things that use DMA so far.

We could always have a much smaller (say, 8 word buffer) in the noncache memory for Audio for when blocksize is set really small, and use cache-maintenance otherwise.

electro-smith / libDaisy