Remove need to allocate vector on each write in `LedPixelEsp32Rmt`

faern commented 11 months ago

Hi,

Not that I'm having a huge amount of performance trouble with this crate to begin with. But this code always stood out to me as not ideal: https://github.com/cat-in-136/ws2812-esp32-rmt-driver/blob/c41e6bdf7478ae336c388bf9daa12fd9c721b16e/src/lib_smart_leds.rs#L74-L79

In general in my LED projects I allocate a single LED buffer vector/array at the start of the program. Then I render into it in a loop, calling LedPixelEsp32Rmt::write after each time I'm done "rendering". But this code allocates a new vector on the heap for each frame and fills it up. Given the use of Vec::new() it probably also re-allocates a bunch of times during the loop here.

Have you tried/considered moving the code translating from the source RGB[W] type (CDev in the code) into the raw format currently consumed by Ws2812Esp32RmtItemEncoder directly inside the Ws2812Esp32RmtItemEncoder? Is the ws2812_rmt_adapter function extremely time sensitive, or do you think it could handle a few more bit shifts?

I think the most beautiful implementation would be if the raw Ws2812Esp32RmtDriver was channel-aware and had a generic type for the LED strip type/channel order. And then it initialized the rmt_translator_init with a function able to do the channel shuffling.

This would remove one "rendering pass" from the pipeline from raw RGB data to the IO pin. I will experiment a bit with this myself. But I wanted to create this issue as a place to ask if you have any insight into this.

cat-in-136 commented 11 months ago

@faern Oh, you are right about this. I think it is preferable that the user of the library allocates it, rather than allocating it in rmt_translator_init. Because it provides the user with a choice between using heap or stack.

Actually, I was going to write an implementation that allocates memory every time with Vec (to achieve viable library at first) and rewrite it later, but I simply forgot to set aside the time to do that and left it as is.

I do not consider the removal of the rendering path to be critical. Maybe is it removed by build optimizations? I thought so, but I have not checked llvm ir or assembly, so I can't be sure.

faern commented 11 months ago

I'm pretty sure the compiler can't remove that extra allocation. The compiler is not allowed to touch the original pixel buffer sent in by the user, and the RMT encoder runs in an ISR context. So this little conversion+allocation loop is stuck in the middle with no where to go.

In my own experiments with doing the channel shuffeling in the RMT Encoder I have gotten vary bad results so far. Not sure if the code is too expensive to run in that time sensitive context, or if I have some other bug here. But if we can't run that in the encoder then at least we can let the user supply a pre-allocated buffer so it does not need to be re-allocated for every frame.

The way I use this library is to render a constant amount of LEDs (known at compile time) at about 60 FPS. So I can allocate all buffers needed up front.

cat-in-136 commented 11 months ago

I see what I misunderstood about the rendering path. The place you are referring to is indeed not removable by the compiler.

There are many aspects to this:

Data copy in encode of Ws2812Esp32RmtItemEncoder : The esp32 RMT requires copying to dest: *mut rmt_item32_t provided as an argument of ws2812_rmt_adapter(). Therefore, this copying cannot be avoided.
Memory allocation for each write() function call of LedPixelEsp32Rmt<CSmart, CDev> trait : just ugly ad-hoc implementation as I already mentioned. (Unlike this, LedPixelDrawTarget allocate memory once at new())
RGB(W)-channel-cared Ws2812Esp32RmtItemEncoder (Integration of Ws2812Esp32RmtItemEncoder with CDev) : The functions in ws2812_rmt_adapter are time sensitive. Therefore, I simplified the ws2812_rmt_adapter process by pre-processing the caller of Ws2812Esp32RmtItemEncoder to take into account the order in which the RGB(W) are transferred. This is why Ws2812Esp32RmtDriver does not care RGB channel.
Complexly of ws2812_rmt_adapter() : In a nutshell, it's the same process as https://github.com/adafruit/Adafruit_NeoPixel/blob/fe882b84951bed066764f9350e600a2ec2aa5a9e/esp.c#L59 and not too complex. We believe that it is not too computationally intensive. I have considered making it less computationally intensive, but I found that the function arguments may be different each call time and so it is not possible to calculate all of them in advance and simply copy them.
Complexly of RGB(W) data conversion into ESP32 RMT native expression: Yes, it is indeed a complex process, and I believe it is difficult to eliminate it as long as the smart-led and embedded-graphics APIs are used.
Reducing complexly of RGB(W) data conversion: The smart-led API is not feasible because all LED RGB data is sent from user code each time. embedded-graphics API only calculates the portion that changes each frame. (However, the data structure is straightforward and processing speed is not a consideration.)

You seems to use the smart-leds API, but for your application it seems better to use the embedded-graphics API or Ws2812Esp32RmtDriver directly. (On top of that, the implementation of Ws2812Esp32RmtDriver would need to be changed/improved to take into account processing time and #31 aspects.)

faern commented 11 months ago

Complexly of RGB(W) data conversion into ESP32 RMT native expression: Yes, it is indeed a complex process, and I believe it is difficult to eliminate it as long as the smart-led and embedded-graphics APIs are used.

Personally I don't want the embedded graphics. So if that part is stopping improvements in the more basic driver, that's only a negative for my personal use case :)

You seems to use the smart-leds API, but for your application it seems better to use the embedded-graphics API or Ws2812Esp32RmtDriver directly

I just migrated my code from smart-leds to the raw driver, and do the channel shuffling myself. I do not want the LED matrix support of embedded graphics. Personally I build unique light installations where the shape and form of each "lamp" is different. And so far never in a matrix.

The only part I do want from this driver is a stable, fast and idiomatic/ergonomic way to write my pixel data to the IO pins. If channel ordering is built in in a nice way (with as little overhead as possible, thus this thread) that would be awesome. But anything on top of that has to be bespoke solutions for me for every project anyway. I "render" in very different ways for every project.

However, the LED type (channel order), number of LEDs, desired FPS and all of that is usually known at compile time for me. So if this knowledge can make the code more optimized, that would be cool. And this is what I'm experimenting with right now. Drawing LEDs with an Iterator (or &[u8] as in the current raw drivers case) is erasing some knowledge I have at compile time, and might cost some performance (for example the extra vector this issue is about).

I have an almost working experimental driver I wrote from scratch using ESP-IDF 5.1 and esp-idf-hal. But currently it renders too many pixels! The colors are correct but when I try to render 3 pixels it renders ~9-11, when I render 5 it renders ~20 pixels. I don't know if I have a bug somewhere or if my ws2812_rmt_adapter is so slow that it currently causes some timing issues on the IO pin. I should get a logic level analyzer I guess.

faern commented 11 months ago

I should add that I fully respect if this is not the direction you want to take this crate/driver! I just raise issues for me and my use cases. You maintain your crate in the way you see fit of course. I'll continue tinkering with my experiments. When I have something working I can try to formulate a PR. Or I'll publish my own driver.

faern commented 11 months ago

Regarding performance and timing issues. Have you experimented with putting ws2812_rmt_adapter in IRAM? I don't know how to do this, but the official ESP docs highly recommend you put your RMT encoder function in it. I created an issue asking of the esp-rs folks had any guidance on how to best do this with Rust functions. https://github.com/esp-rs/esp-idf-hal/issues/280

cat-in-136 commented 6 months ago

@faern I'm working on this issue in the support in esp-idf 5.1 of PR #34. Please refer PR #34.

As it turns out, it is difficult to make zero-copy if the pixel information iterator does not implement Send traits, and the SmartLED API (function definition) does not require Send traits. So I cannot introduce zero-copy API to the SmartLED. Instead, I have introduced the dedicated API LedPixelEsp32Rmt#write_nocopy. Note that write_nocopy required some generics specialization (CDev -> LedPixelColorImpl this could be solved with some ingenuity, but I don't think it would be a major issue since code optimization is important for the zero-copy API).

I changed Ws2812Esp32RmtDriver to handle pixel sequences (iterators) instead of pointers to slices. So if the SmartLED API is not needed and a simple thin driver is sufficient, the driver functions can be used directly.

cat-in-136 / ws2812-esp32-rmt-driver

Remove need to allocate vector on each write in `LedPixelEsp32Rmt` #30