Add Fadecandy-style dithering, gamma and color correction, maybe interpolation

Copying @swifty99's feature request from NeoPixelBus:

Is your feature request related to a problem? Please describe. For 8bit PWM LEDS, which are all addressable LED strips I know of, it is very hard or impossible to do accurate color mixing at low brightness levels. Also the PWM chips do not incorporate any gamma conversion. PWM controller in the addressable chips operate linear, means double the control value the absolute light output will be doubled. However, this is not the way our eyes work. Many solutions have been developed to address this, gamma conversion in computer display are the most common one. Perceptual quantifier like in dolby vision displays the more fancy one.

Describe the solution you'd like A project called Fadecandy solved many of these problems. Internally it increased the resolution by dithering. Dithering means, the LED will be turned off and on overlaying a PWM with a larger timescale on the whole LED. With this blinking the resolution can be increased by up to 2-4 bits. I have been using fadecandy in the past and adopted this to arduino libraries. It looks super smooth and nice, even low brightness color temperatures can be mixed. Unfortunately the project seems to be abandoned.

What is the downfall: Due to the needed of quite high refresh rates, a maximum of 64 LEDs per line should be used. With DMA access more than 64 LEDs need to be addressed in parallel.

Describe alternatives you've considered So far I use my own code, however missing all the add on features like WLED. So I will definitely go NeoPixelBus because it is great. Improving color range would be awesome.

@makuna clarified that dithering is the only feature he'd want to add into NeoPixelBus, not the platform-specific code to refresh the LEDs with consistent timing needed to get good results with dithering, not custom color correction tables. He recommended adding those things in WLED, so I'm moving the discussion here.

Interpolation between keyframes is another feature of Fadecandy that would be useful to support, as it complements the dithering and color/gamma correction.

There's some movement on a few related features:

ESP32 I2S Parallel driver for driving 16-120(!) WS2812 strips in parallel

Implementing WS2812 dithering to get more color depth (or decrease the loss from running LEDs at lower brightness levels) can take advantage of the above driver that's able to refresh the LEDs continuously using DMA.

With more color depth you can get more benefit from applying color/gamma correction.

As all these features including dithering now need to be handled outside of NeoPixelBus, we have more flexibility in how to implement all the features. I would like to keep the API similar to NeoPixelBus so it fits into WLED like a new NeoPixelBus method.

Some Background Info (for @swifty99 and also to sort out my own thoughts):

WLED uses an 8 bit color model for the effects, and has no plans to switch, so for memory efficiency we should keep the colors stored as 8 bit as long as possible.
There's also a brightness value that's currently passed into NeoPixelBrightnessBus, which applies brightness to pixel data before writing it to the storage used by the NeoPixelBus method.
We've already been talking on the WLED Discord about replacing NeoPixelBrightnessBus with plain NeoPixelBus and applying brightness ourselves, to support 16 bit LEDs, APA102 LEDs with GBC bits, gamma correction, and now dithering.
NeoPixelBrightnessBus is lossy when reading back pixel data from 8 bit methods, which causes problems with some effects. We'd need to add double buffering to be able to read back non-lossy pixel data.
WLED has a 8bit->8bit gamma correction table stored in RAM, but currently only uses gamma correction on brightness and palette values (maybe in some other places as well), and not on each pixel of an effect.
It's possible to create and use a NeoPixelBus-derived class, or Neo Method outside of NeoPixelBus as long as it uses the NeoPixelBus API. So we can create new NeoPixelBus-style classes without necessarily needing to fork NeoPixelBus. See bus_extensions.h in my branch for an example.

Double Buffering

From my point of view, the best solution in the long run would be a toggleable double buffer that returns the original color to getPixelColor(), or if disabled, use the lossy (value << 8) / (brightness +1) color recovery method. - AC

We'll likely need double buffering for some of the features below, but it's a desired optional feature in general, so let's discuss it first. Double buffering could be handled by a buffer outside of the NeoPixelBus class, by writing to a bitmap (could be NeoPixelBitmap) and then copying the bitmap to the NeoPixelBus method write before calling Show(). Or, we could create a class derived from NeoPixelBus that handles the optional double buffering internally.

Dithering

Dithering will only be supported when using hardware peripherals that support automated transfer via DMA with consistent timing to produce consistent frame rates. At this point it's only going to be supported on the ESP32 when refreshing via I2S Parallel mode.

There are at least two ways of preparing the data for dithering:

Preparing all the data in advance so refresh can happen completely asynchronously - interrupts could be disabled and the dithering will continue uninterrupted. There needs to be enough memory to support creating all the dithered sub-frames needed to refresh a keyframe, so it can be refreshed without involvement from the sketch. If there is a delay in drawing the next keyframe, the sub-frames can just repeat. In order to asynchronously create another frame while one is being refreshed, there will also need to be enough memory to create all the dithered sub-frames for another keyframe. (In other words we need double buffering for all the dithered sub-frames.) This is a significant amount of memory, likely 12x the amount of memory needed to refresh a frame without dithering.
Preparing the data as needed, requiring code to be run between Show() calls. In this case there would memory to hold a number of sub-frames in a circular queue, and as each sub-frame is finished refreshing, the next sub-frame can be calculated and stored into memory. A high priority task or low priority ISR is needed to run frequently to keep the sub-frames updated. Disabling interrupts would break dithering causing old sub-frames to repeat. This would save a lot of RAM but cost more CPU time and increase the complexity of the code.

I'm focusing on the memory heavy solution #1, as I'm not planning to drive a lot of strips using dithering directly from WLED, and this is a simpler and more robust solution as long as there's enough RAM. I plan to offload refreshing a large number of strips to a separate ESP32 on my Pixelvation Engine design.

Gamma/Color Correction (and Brightness)

If we're using double buffering for the pixel data, we'll have an uncorrected bitmap in addition to the potentially lossy pixel data stored encoded ready to send out to the pixels. SetPixelColor()/getPixelColor() will use the pixel data buffer for efficiency. When we call Show(), we can apply gamma/color correction and brightness to the pixel data and store it encoded for the LEDs.

If we're not using double buffering for the pixel data, then SetPixelColor()/getPixelColor() must inefficiently encode/decode the LED buffer, and decoding will likely be lossy. Show() will only shift the LED data, and not encode it first.

The simplest form of gamma correction would be to use WLED's existing 8bit->8bit conversion table, which could be passed as a pointer into the bus_wrapper class. When using LEDs with >8 bit color depth, the existing table won't be enough and an 8bit->16bit conversion table can be used instead, again passed as a pointer into bus_wrapper. For gamma correction and color correction, there can be three separate 8bit->16bit conversion tables passed in for each channel. I haven't thought about how RGBW LEDs might use color correction yet.

Interpolation

Interpolation between keyframes requires triple buffering: you need the previous frame, next frame, and the sub-frame interpolated between the two. It also requires either enough memory to store all the sub-frames between previous/next, or periodic calculation of sub-frames between calls to Show(). Unless we ensured a high refresh rate minimizing the number of sub-frames, or implemented the more complex dithering solution #2, it doesn't seem easy to implement interpolation.

For triple buffering, the NeoPixelBus-derived class would need access to the two pixel buffers, so that's an argument to store the buffers inside the NeoPixelBus-derived class.

I don't want to completely ignore interpolation, but it's not a feature I'm going to be focusing on in the short term.

NeoPixelBus supports both a DIB (device independent buffer) and a buffer with the same bit format as the destination bus. They don't need to match the NeoPixelBus in size and expose similiar API to access pixels like NeoPixelBus. Both can be used to provide double buffering techniques.

DIB is great for using arbitrary code as the internal format is RgbColor (RgbwColor, etc). It exposes a Render call to apply it the NeoPixelBus and converts at that time with the ability to include a shader (the place to put gamma and dither modifications). This makes the render slower but very flexible.
Buffer is great for temporary containment of image data and internally matches the NeoPixelBus. You Blt to the NeoPixelBus which is just directly copy bits. Faster render.

great summery

here is a draft of a possible pixel pipeline with timing and RAM constraints. Feel free to adjust, change, my understanding of some things might be wrong. functions are green :-) PixelPilpeline, edit here

grafik

@swifty99 Nice diagram! With the initial solution I'm proposing, I don't see a need for the renderedBuffers. colorMagic can be applied a pixel at a time when writing pixels inside NeoPixelMagicBus::SetPixelColor() (final name TBD), like it is in NeoPixelBrightnessBus. DMA Buffer will need to be much larger than 8 bit * 2, as it has to store RGB data (x3) and multiple sub-frames for dithering (x6?), and it's inefficient storage because of the peripheral (x3 - x24? depending on how you look at it).

alright, all this stuff is per single LED, I will add a note, means a RGB uses 3times the number of RAM. RGBW 4times and who knows who will come up with a RGBAWW version ;-) In the example on RGBW LED needs 40 bytes of RAM. quite a bit, however, ESP32 has a fortune of RAM. If half the RAM is used for LEDs that's 256k. Means 6k+ LEDs supported Timing wise, with 20 IOs and about 64 LEDs per GPIO the maximum is "only" 1280 LEDs. So I assume RAM not to be the problem.

about the buffers. I think they are needed. Of course you can apply color magic on the fly. From my understanding, if an overlay off effects should be possible (and it should) and read back of the out buffer is useful it should access the unrendered data. double buffer is safer as async read/write is hard to avoid completely. The rendered buffer will be needed to look up the data to dither or fade. Here color magic could also be applied directly on the input buffer. In that case I would assume the CPU load will increase a lot. HSL conversions might be more complicated. Dither needs to be called at refresh rate. Color magic would be applied the x^2 times more often (x = bit depth increase). RAM would be swapped against CPU time. When my RAM calculations are right, I would invest in RAM.

Aircoookie / WLED

Add Fadecandy-style dithering, gamma and color correction, maybe interpolation #2416