Open TheSlowGrowth opened 3 years ago
Answering some questions from @recursinging in #326 :
Alternatively, I was wondering if it might be possible to have template peripheral handle classes which allocate their own transmit and receive buffers sized by the template type(s).
You can't have member variables in a specific memory region unless the entire parent object is in that memory region.
The entire peripheral handle class instance could be placed in .sram1_bss without incurring a performance penalty (assuming these classes don't make use of the cache).
The D-cache is enabled for memory regions and is completely transparent to the code that's running. Classes don't use the cache on their own will.
It seems to be a more logical location, but I'm unsure how this would work with multiple devices on a single bus, so I dismissed the idea.
Each low level driver will need its own DMA buffers even if they share a DMA. Also I strongly disagree with placing an entire low level driver on the cache-disabled memory region: Only the actual DMA buffer needs to be excluded from the cache (or the cache needs to be flushed before the DMA transfer). All the other member variables that a low level driver may have should still be cached to benefit from the cache performance boost.
I was thinking more along the lines of a template I2C or SPI handle that is constructed outside the driver class and passed in an Init
similar to dev/leddriver.h
only that the DMA Buffers need not be explicitly constructed alongside the peripheral handle itself... a slight improvement to the current implementation perhaps. Should flushing the cache prove viable, it would certainly be the optimal solution.
In the past I have implemented "dynamic" memory allocators just for the convenience of allocating buffers from specific memory regions. A DMA allocator dma_malloc() let's you allocate the buffer from non-cachable memory in the constructor and just hold a pointer to it in the class. Since drivers are expected to be available for the lifetime of the application there is no need for a dma_free().
On Tue, May 4, 2021, 8:21 AM recursinging @.***> wrote:
I was thinking more along the lines of a template I2C or SPI handle that is constructed outside the driver class and passed in an Init similar to dev/leddriver.h only that the DMA Buffers need not be explicitly constructed alongside the peripheral handle itself... a slight improvement to the current implementation perhaps. Should flushing the cache prove viable, it would certainly be the optimal solution.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/electro-smith/libDaisy/issues/335#issuecomment-832026393, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACESB6YKRV5SPU2AZU6J4ZDTMAGH7ANCNFSM44CWFKAQ .
As mentioned above, I think doing some performance analysis of the cache performance when flushing compared to non-cached buffers would be a helpful.
However, I think you're right that they may perform very similarly. It would greatly reduce complexity to be able to store the buffer locally to the class instance, and perform the maintenance internally. What should also be considered when testing this is other bus activity that could have an effect on cache performance (dsp activity in callback, etc.).
A suitable test could probably be put together relatively quickly on the PCA9685 LED driver since we could just declare a few static buffers in the SRAM and compare the performance there. That way we can easily drop that into existing project code to profile any differences in a "real" environment.
This could also be tested for the Audio buffers, and ADC buffers. Though depending on performance we may want to be a bit pickier in the audio pipeline.
A "dynamic" allocator could also work pretty well for this, and I think the concept was discussed a while back in #242, and would still have the benefit of reducing user-facing complexity, but would add quite a bit of internal complexity (not that that's unwelcome).
I was curious about this, so I wrote a little test program to check:
https://gist.github.com/recursinging/15b93588adc895d0a943864f5f27c3fc
Assuming my implementation is valid, It seems @TheSlowGrowth is correct in his assumption that the D-cache performance increase far outweighs the overhead of the cache clear/invalidate ops. In the above test, it is only when BUFFER_SIZE
is lower than 8 that the cache op overhead becomes dominant.
That being said, this is naive implementation only to visualize the performance relationship of the memory access patterns. As @stephenhensley suggested, it would be prudent to verify this with "real" implementations.
Thanks for going ahead with a benchmark on this. Definitely very helpful in coming to a decision moving forward.
A "real"-enough test might be to just run a reverb algorithm (since it'll be using quite a bit of non-contiguous memory) in the audio callback while the test repeats a few times.
That said, if the metrics hold, and I suspect they might, we could definitely move forward with a plan to move toward using cache-maintenance for DMA buffers.
Currently, I believe the I2C (LED Driver), ADC, UART, and Audio are the only things that use DMA so far.
We could always have a much smaller (say, 8 word buffer) in the noncache memory for Audio for when blocksize is set really small, and use cache-maintenance otherwise.
In #326 we started discussing how to avoid having to pass in DMA buffers to low level drivers.
The Problem
When transferring data to a peripheral with the DMA, the data is transferred directly from the memory to the peripheral. If the memory is cached with the D-cache, then the contents of the memory will not be up to date and the DMA will transfer wrong/outdated data to the peripheral.
The solutions:
Init()
function.dsy_dma_clear_cache_for_buffer()
.Performance considerations:
@recursinging wondered how the performance difference between the two approaches could be measured. Here's an idea: Execute this:
Then execute this:
And compare the execution times by observing the pin with an oscilloscope. I bet if
M = bufferSize
, the performance would be very comparable. I would assume that ifM >> bufferSize
(== typical display drawing scenario) the cached version would be faster.I would argue that it's simpler and probably even faster to just flush the cache before the DMA transfer. At least it would greatly reduce complexity for users.