Open coryking opened 8 months ago
Here is some results when I used a completely different APA102 driver I found online: DMA Findings.md
As you can see from the oscilloscope readings in the linked document, when you use DMA it looks nothing like the ones in this ticket. Plus the speeds are significantly faster (though they appear to be "rounded" for reasons I don't fully understand.
So I need everybody's opinion on how to provide the "SPIOutput" class the number of LEDs / bytes required for a DMA buffer.
In order to get DMA working we need to allocate a buffer on the heap but to do so we need the size of the buffer. I think the best place for this is in the init() function. The problem is these "SPIOutput" classes do not know how many LEDs they are supposed to be driving, thus I have no clue how big of a buffer to allocate. So I need to get the number of LED's plumbed in and probably the number of "extra" bytes needed for start and end frames.
The problem is these SPI handlers are all supposed to look like "SPIOutput" even though they are all completely different classes. So I suspect if I make a change to init() to pass in the number of leds (or perhaps just bytes) I'll have to change all the other SPIOutput classes as well (which may not be a bad idea because any of the other SPI classes that do DMA need the same info).
So, in short:
Pinging @samguyer and @kriegsman for their thoughts as well as @Jueff & @sheaivey who looks to have done some DMA stuff for a different platform.
To avoid getting blocked, if I don't hear anything in the next few days I'm gonna proceed with doing this in the ESP32SPIOutput class and passing the number of bytes / leds through the init() function.
Also I should add I'm going to borrow quite liberally from the led_strip_spi component from the UncleRus esp-if-lib component library. I'll make sure to adhere to its MIT License when pushing my changes.
Any update on this?
Make sure that you have #define FASTLED_ALL_PINS_HARDWARE_SPI
before the #include "FastLED.h"
part to make sure you are not doing software bit banging.
Okay so i took at look at src/platforms/esp/32/fastspi_esp32.h and indeed we are using the 8-bit spi write and just iterating through the pixels. From the documentation of ESP32, these spi transactions need to grab a lock once per byte. So if true, that would explain the pattern of inefficient transmission you are seeing - it's basically blocked on a lock.
Say you have 1k leds, that's going to translate into 3k lock/unlocks to write all that data out.
Looking through the docs I see that esp32 now provides void spiWritePixelsNL(spi_t *spi, const void *data_in, uint32_t len);
as a new experimental API. Apparently, an entire buffer can now be sent to the SPI device in one shot and the lock will hold until the pixels have been flushed out through the spi device. I'd have to look more closely to see if 2 or more SPI devices can operate in parallel in FastLED.
Looking at src/platforms/esp/32/fastspi_esp32.h at the function template <uint8_t FLAGS, class D, EOrder RGB_ORDER> __attribute__((noinline)) void writePixels(PixelController<RGB_ORDER> pixels, void* context);
I see that we are writing each r,g,b byte out, then we are doing our advancedData() and stepDithering() operations after each pixel write. So sending an undefined block of data out to the spi in it's entirely won't work unless we supply a side buffer. However, we do do this for the WS2812 and other clockless data types exactly for this reason.
So the solution here is either create a side buffer that lives for the duration of the esp32 spi implementation... OR allocate a giant block of leds and send K batches for N leds. For example the batch size could be 256 pixels. This would mean that the inefficiency that you see would be cut to 1/256 of what it is now, which would almost be identical to full performance mode where everything is sent all at once.
I'm adding a new feature for bulk transfer through the SPI device on ESP32. You can enable it via #define FASTLED_ESP32_SPI_BULK_TRANSFER 1
before the inclusion of FastLED.
It will go out with 3.9.4. Right now it's just in master. I've compile tested, but haven't run it yet. If anyone tests it, let me know the results.
Thanks! I haven't had time to look into this for quite some time, unfortunately life events got in the way!
On Wed, Nov 13, 2024 at 6:36 PM Zachary Vorhies @.***> wrote:
I'm adding a new feature for bulk transfer through the SPI device on ESP32. You can enable it via #define FASTLED_ESP32_SPI_BULK_TRANSFER 1 before the inclusion of FastLED.
It will go out with 3.9.4. Right now it's just in master. I've compile tested, but haven't run it yet.
— Reply to this email directly, view it on GitHub https://github.com/FastLED/FastLED/issues/1609#issuecomment-2475290886, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEAPKWXIT5F6SSLI52YZI32AQEBPAVCNFSM6AAAAABEPFR7GCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZVGI4TAOBYGY . You are receiving this because you authored the thread.Message ID: @.***>
Summary
When using FastLED with ESP32 to drive APA102 and SK9822 LED strips, significant inefficiencies have been observed in SPI data transmission. The implementation appears to conduct separate DMA transactions for each byte of data, introducing noticeable gaps between 8-bit bursts. This behavior considerably diminishes the effective data transmission rate, closely mirroring the performance limitations characteristic of older one-wire LED models like the WS2812b.
Environment
Observations
Expected vs. Actual Behavior
Expected: For high-speed SPI communication, especially at higher data rates, FastLED should utilize continuous DMA transfers for the entire data packet of LED strip data, minimizing CPU overhead and maximizing throughput for quick and efficient updates.
This is what I'm calling the "inner byte gap". It's roughly 1.7 or so microseconds. It transmits 8 bits, takes a break, transmits 8 more bits... etc.
Actual: The implementation seems to trigger individual DMA transactions for every byte sent to the APA102 and SK9822 LEDs, leading to unnecessary overhead, reduced transmission rate, and increased update times, particularly for longer LED strips.
Investigation and Documentation
This issue aims to document and investigate the SPI data transmission inefficiency for APA102 and SK9822 LEDs on the ESP32 using FastLED. The goal is to explore this issue further, propose optimizations, and prevent duplication of efforts across multiple issues or pull requests.
The above image shows how long it took to transmit a 30 LED strip. Roughly 230uS at a "20MHz" setting in fastLED.
Additionally, this concern echoes a comment from @SCarter858 on Dec 27, 2023, which detailed similar inefficiencies and questioned the actual utilization of DMA for SPI communication within FastLED's ESP32 HW SPI support.
Additional Context
Note that this being caused by separate DMA transactions is only hypothetical. It might not be using DMA at all!
This inefficiency not only affects APA102 LEDs but also their SK9822 counterparts, which are often used interchangeably in projects. The potential lack of full DMA utilization for SPI transmission significantly throttles the performance capabilities of these chips, making high-density LED installations as slow as older one-wire LED protocols like the WS2801. An optimized implementation would greatly enhance performance, leveraging the ESP32's capabilities for LED projects using FastLED.