Aircoookie / WLED

Control WS2812B and many more types of digital RGB LEDs with an ESP8266 or ESP32 over WiFi!
https://kno.wled.ge
MIT License
14.94k stars 3.23k forks source link

Palettes: Bad performance on 0.14.* and later #3978

Closed jw2013 closed 5 months ago

jw2013 commented 5 months ago

What happened?

I discovered, that previous WLED binaries on ESP8266 were running at 80 MHz instead of 160 MHz. Using it a lot, so I tested the newer _160 releases for my d1-mini's, but they were actually much slower! String with 600 LEDs (WS2815) was now running at 13 FPS, instead of the 20 FPS using e.g. the 0.13.3 release.

So I curiously compiled 0.13.3 using board_build.f_cpu = 160000000L, and performance went up to 25 FPS. Nice!

But why do the new releases perform so badly? I compiled basically all 0.14.* sources, and discovered that performance went downhill going from 0.14.1-b2 to 0.14.1-b3. Only real code change here was in json.cpp, and that one indeed causes the performance issue.

To Reproduce Bug

Install firmware 0.14.1-b2 and 0.14.1-b3 to compare. Setup LEDs with power limiter (WS2815), 600 LEDs (WS281x), Effect Glitter. Check FPS in info.

Expected Behavior

Expected to keep the nice FPS rate.

Install Method

Self-Compiled

What version of WLED?

WLED 0.14.1-b2 and WLED-0.14.1-b3

Which microcontroller/board are you seeing the problem on?

ESP8266

Relevant log/trace output

No response

Anything else?

No response

Code of Conduct

jw2013 commented 5 months ago

@blazoncek , why did you add the "cannot reproduce" label? It is super simple to reproduce this! For your convenience, please find attached example cfg.json and presets.json files.

wled_presets.json wled_cfg.json

600 LEDs, WS281X, using Glitter effect, on a regular ESP8266EX D1 mini module.

Using the release binaries provided here, I get the following results:

WLED_0.13.3_ESP8266.bin => 21 FPS WLED_0.14.1-b2_ESP8266.bin => 12 FPS WLED_0.14.1-b3_ESP8266.bin => 8 FPS

(self compiled WLED_0.13.3_ESP8266_160.bin => 25 FPS)

This is not just a minor performance issue, I hope you agree? And replacing all ESP8266 and ESP8285 with ESP32 modules is NOT the correct fix.

jw2013 commented 5 months ago

One further update, also checked the performance of the first release after 0.13.3, which would be 0.14.0-b1:

WLED_0.14.0-b1_ESP8266.bin => 16 FPS

But then I made a nice discovery: When changing to a different palette, FPS went up to 20 FPS, so similar to 0.13.3. Analyzed this further, turns out that the framerate drops tremendously once the "Power Limiter" kicks in! This was not the case in earlier versions. @blazoncek , any idea? Next I'm going to check the source code changes of that part.

jw2013 commented 5 months ago

The power limiter seems not be the reason, it just causes a drop of about 2 FPS, but that's the same for 0.13.3 and 0.14.0-b1. But the actual palette makes the big difference in 0.14.0-b1:

On 0.13.3, the selected palette has zero influence on the FPS, all run with 21 FPS. Any ideas?

blazoncek commented 5 months ago

I'm only considering current 0_15 branch for anything that would need investigation. So test on 0_15 branch, please.

jw2013 commented 5 months ago

Ok, I will do that! Btw, for 0.14.0-b1, I have found the issue:

The performance issue is caused by Segment::color_from_palette in FX_fcn.cpp. It calls loadPalette for each single call of color_from_palette, which is causing a high CPU load (except for palettes 6, 7, 8, 9, 10, 11 and 12, as those are always kept in memory)!

Quick fix, using a paletteCache, renders the same performance as the 0.13 series!

CRGBPalette16 curPalCache;
int curPalCacheIndex = -1;

uint32_t Segment::color_from_palette(uint16_t i, bool mapping, bool wrap, uint8_t mcol, uint8_t pbri)
{
  // default palette or no RGB support on segment
  if ((palette == 0 && mcol < NUM_COLORS) || !_isRGB) {
    uint32_t color = currentColor(mcol, colors[mcol]);
    color = gamma32(color);
    if (pbri == 255) return color;
    return RGBW32(scale8_video(R(color),pbri), scale8_video(G(color),pbri), scale8_video(B(color),pbri), scale8_video(W(color),pbri));
  }

  uint8_t paletteIndex = i;
  if (mapping && virtualLength() > 1) paletteIndex = (i*255)/(virtualLength() -1);
  if (!wrap) paletteIndex = scale8(paletteIndex, 240); //cut off blend at palette "end"
  CRGB fastled_col;

  if (transitional && _t) {
    fastled_col = ColorFromPalette(_t->_palT, paletteIndex, pbri, (strip.paletteBlend == 3)? NOBLEND:LINEARBLEND); // NOTE: paletteBlend should be global
  }
  else {
    if ( curPalCacheIndex != palette ) {
      loadPalette(curPalCache, palette);
      curPalCacheIndex = palette;
    }
    fastled_col = ColorFromPalette(curPalCache, paletteIndex, pbri, (strip.paletteBlend == 3)? NOBLEND:LINEARBLEND);
  }

  return RGBW32(fastled_col.r, fastled_col.g, fastled_col.b, 0);
}
jw2013 commented 5 months ago

Basically same problem in 0.15 and main branch. color_from_palette always calls currentPalette, and currentPalette first calls loadPalette.

To give you an idea, just using the cache made it go up from 12 FPS to 24 FPS. So the current implementation spends half of its time in loadPalette calls :-/

softhack007 commented 5 months ago

And replacing all ESP8266 and ESP8285 with ESP32 modules is NOT the correct fix.

Well for me that's a debateable (and acceptable) option 😉 I'm not saying everyone has to agree. But here's my personal reasoning:

So that's my personal perspective as a developer in a small team, and we devs usually do not receive anything in return for our commitment to WLED. At the same time, everybody wants to have a say about "what is the correct way". Sorry you are not the first (or the last) - but I really had to blow off some steam. tl;dr you've hit a nerve.

Feeels better now 🤓 So back to technical...

Technicially, the palettes problem you have discovered seems to affect all platforms. Maybe it's worse on 8266, but esp32 could also benefit from an improvement 👍 so thanks for your investigations.

jw2013 commented 5 months ago

I understand the reasoning from a personal pov, but don't necessarily agree to all points, as my personal requirements might differ a lot. I really appreciate all the work you put into WLED, and all the new features it got so far! Thanks a lot!

I've created a pull request for 0_15, but compared to earlier versions, that branch does not work nicely on ESP8266 (yet). Therefore I will also create a pull request for 0.14.3, the fix makes it run with >25 FPS again, on the 'ancient' hardware, with 600 LEDs.

One more performance issue with 0.14.3 (might be the same case for 0.15): When the power limiter forces a lower brightness, it loses about 50% of its FPS. I have found the reason already, and a quick hack resolves it, but still working on a clean solution.

Hope that helps a bit!

vPrapo commented 5 months ago

@softhack007 I want to thank you for all your efforts and I totally respect your opinion. I agree with you to an extent and yes I would certainly replace old ESP devices with newer ones for better performance and support since they're cheap and worth it. However, some devices that are not DIY (manufactured) are not easy to mess with or replace, flashing the firmware alone by overwriting the propriety firmware that has no option to OTA is a big hassle and feels a big achievement when done well and safely for e.g some E27 bulbs in the market are built with old ESP devices and they give no choice to easily upgrade the ESP chip or flash into WLED without soldering/de-soldering or so, despite how it's assembled like some are made to not be opened or closed tightly for protection with glue or whatever so there are risks and possibilities to break it or fry components while doing soldering jobs. So without background in electronics and technical knowledge it's not easy at all to achieve.

Replacing tens of bulbs can cost too much for the whole house and that isn't a practical fix in terms of both convenience and cost.. unless of course to replace only the dead ones, for my case I do have bulbs that are hard to reach and needs very high ladder size that I usually rent and tell the workers nearby for help since it's not safe distance.

I know that many things can be fixed in a hacky way and this isn't very practical or actually need to sacrifice some features to make something else works, I agree this takes time and doesn't help the software as a whole, just certain devices but.. in the same time It's kinda definitive that maxing the possibilities of an old device is challenging and most likely will open the eyes to few things that we are not aware of without facing problems such as new ways of better memory management and optimizations which normally might not come to mind with powerful devices such as esp32 because everything seems to work flawlessly.. unless we are facing issues like this one now that in return bring optimizations to the whole software.

@jw2013 Special thanks for your great find.

I do really appreciate all of your guys efforts.

robertlipe commented 5 months ago

@softhack007 : "programming for 8266 is like torturing yourself" - totally understood and agreed. Heroic efforts to keep bringing new features to old hardware just has to stop at some point. They're not rewarded. At some point, it just doesn't fit.

At the risk of being one of those that wants a say (I don't think I have any code in this project), I'll offer what's worked for many projects: Pick a point in the project and formally (and loudly) pull the plug. Version numbers are cheap. Make one that counts. Add #if defined esp8266 (or whatever) #error "Here's a nickel kid, get yourself a better computer" and link to the Dilbert cartoon. You can't stop anyone from removing that and trying to build it on their own or forking it, but after that, you can just aggressively close things as 'unsupported platform' and it's no longer YOUR problem. As devs, we've all had do to this for VAX, Win95, NT3, Qt3, SunOS 4, K&R C, i386, Python 2, and other 'OMG, this HAS to live forever' platforms. Even non-free software development doesn't have infinite resources to keep things alive forever.

I get the arguments about e-waste and $3 boards (do yourself a favor and choose replacements with PSRAM.) needing techs (that usually cost > $3) to install them, changes in assembly lines, SKUs, or whatever, but 2016 boards are happiest running 2016 code. It just can't fit forever - whether that's bytes of SRAM, Flash, microseconds in an interrupt handler, jitter for frame refreshes, or whatever. Staying on old code is an option.

Sometimes, we must trim the branches of the shrubs of the gardens we maintain so that overall growth and life can thrive.

Happy Hacking, Jack Kevorkian

P.S. That said, the analysis and premise of #3979 seems pretty sound...and since a message came in while I was typing, I'll clarify I'm not unsympathetic to that POV, either. You can make tiny changes to old branches (for a little while) that can improve those and hopefully have enough resources that everything isn't the proverbial straw on the camel's back. Need 32 bytes for an extra block? Hopefully you blew the whistle in time to still have it and BEFORE that branch picked up some new feature that they may not need anyway. My fundamental point is that there needs to be a reasonable feature/functionality freeze so new LED support and new JSON support doesn't take those last straws.

blazoncek commented 5 months ago

As someone with about 20 ESP8266's in my environment be sure I consider 8266 development seriously. But, there is a point in time (or lifecycle) when program can no longer run efficiently on an ancient hardware due to all the bells and whistles it has accumulated in that time.

Consider what features have been added to the code since 0.13.3. The features affecting this issue are mode (AKA effect) blending, palette blending and custom palettes which were inexistent in 0.13.3. If you do not need those features then there is no reason to upgrade, but if you need them you should also consider upgrading your hardware platform if you find performance lacking.

P.S. I am guessing that you are no longer using iPhone 6 (if you ever were; released at the same time as ESP8266) which cost several hundred times more and has long been made obsolete. That's for comparison.

jw2013 commented 5 months ago

@vPrapo , @softhack007 , @robertlipe , @blazoncek , thank you all for sharing your thoughts and comments!

First of all, none of the performance fixes that I work on are related to ESP8266 in any way. But slow hardware in general acts as an early warning system, in case performance gets unnoticedly sacrificed. The fact that @blazoncek added the 'cannot reproduce' label right away further proves that point.

A software upgrade reduces the performance by 50%? My first thought would not be: I need new hardware, NOW.

My personal intentions here are most likely similar to those of @vPrapo: Public art installation with tons of waterproof sealed controller boxes (custom hardware, not cheap). No way those get replaced soon.

That being said, I have been able to fix the performance issue caused by the power limiter, too. Those two updates would make 0.14.3 the best performing version for ESP8266. Will commit this fix in a minute.

jw2013 commented 5 months ago

This fix works by restoring the brightness in a 'smarter' way, similar to what 0.13.* and earlier did.

ORIG:

FIX:

ORIG renders each frame with _brightness, and changes all pixels to lower brightness in each call of show(). This creates a worst-case scenario, performance-wise.

FIX only renders the first frame with _brightness, and change all pixels to lower brightness. All subsequent frames would get rendered with the last calculated brightness. It reduces brightness of all pixels immediately if required, and increases the brightness if possible, delayed by one frame.

Let's assume effect Palette, with a color palette and LED setting that would consume more power than available in budget. The internal brightness for all frames remains constant, nearly no loss of performance. Effect breath would be the opposite, but even that one gains more FPS.

Furthermore, the change in BusDigital::setBrightness makes sure, that only decrements in brightness change pixels immediately.

Hope that helps!

jw2013 commented 5 months ago

Just did a final performance test with 0.14.3 (settings attached):

WLED_0.14.3_ESP8266.bin => 9 FPS WLED_0.14.3_ESP8266_160.bin => 10 FPS WLED_0.14.3_jwfix_ESP8266.bin => 22 FPS (includes the fixes above) WLED_0.14.3_jwfix_ESP8266_160.bin => 25 FPS (includes the fixes above + 160MHz setting)

wled_cfg(1).json wled_presets(1).json

jw2013 commented 5 months ago

For your convenience/testing, please find a compiled binary for ESP8266 (160 MHz) attached. WLED_0.14.3-jw_ESP8266_160

(maintainer edit: binary file removed, as we cannot verify its integrity)

softhack007 commented 5 months ago

ORIG renders each frame with _brightness, and changes all pixels to lower brightness in each call of show(). This creates a worst-case scenario, performance-wise.

FIX only renders the first frame with _brightness, and change all pixels to lower brightness. All subsequent frames would get rendered with the last calculated brightness. It reduces brightness of all pixels immediately if required, and increases the brightness if possible, delayed by one frame.

@jw2013 Your proposal is (unfortunately) in conflict with design decisions made earlier for 0.14 and 0.15. To achieve better color quality in getPixelColor(), it was decided to let effects render all full brightness, and then do the brightness reduction as a post-process before passing pixels to the LED driver. In 0.15 with per-output ABL, this post-process was moved entirely into the BusManager.

To keep the two topics (ABL and Palettes) separated, I'd propose you either create PR for 0_15 with your proposed changes, or you open a new "enhancement" ticket as a place to discuss and evaluate design options.

Actually the "brightness limiter" is a safety feature and the code is very tricky - so I would even ask if it's worth to touch the ABL code AT ALL - and risk safety of user's equipment - for just gaining 2fps on 8266. If you disable ABL completely (LED settings), would that also give you the 2fps win on your devices?

blazoncek commented 5 months ago

Slightly different approach taken to keep it simple.

jw2013 commented 5 months ago

@softhack007 , I haven't checked the ABL code in 0_15 yet, but focused on 0.14.3. Pretty selfish, I know ;-) Thank you for the hint regarding the design decision, makes absolutely sense imho!

Regarding my ABL patch for 0.14.3, I didn't really touch the ABL code itself, but optimized the logic how it gets applied. My fix might only delay the raise of the brightness, so in fact, it will never use more energy that the original one.

Created some new stats with ABL turned off (on):

Worst impact can been seen with the unpatched 160 MHz version.

jw2013 commented 5 months ago

Slightly different approach taken to keep it simple.

Elegant approach @blazoncek . I thought about something like that too (most of the 'old' strip stuff would actually belong into segments), my fear was that it would use way more memory (one palette object per segment). That was also the reason, why I 'abused' strip._currentPalette for caching.

valkoh commented 5 months ago
  • programming for 8266 is like torturing yourself: 8266 has a single tasking framework, it's always low on RAM, and you literally need to invent new hacks every few weeks to keep it working. Additionally, 8266 support sometimes hinders developing new cool stuff on esp32.

So, give 8266 binaries with limited features, many dont need all the bling that can be crammed there. Even make multiple different binaries where each has some of the heavy stuff but not all. Im agreeing that 8266 should not hinder esp32 possibilities. Im still running 0.14.0 maybe due to the issue related to this, after installing(haos addon) 0.14.2 on all of my d1 minis were acting weird, some started doing reset circle. Im gonna try .4 on one of the worst acting ones, it was a big hassle when most of the d1's needed to be manually flashed again, my setup uses direct wires soldered to esp pins secured under blobs of hot glue, kinda my own fault hehe.

jw2013 commented 5 months ago

@blazoncek , thank you for fixing the palette issue in 0.14.4! Could you please also use my fix for the power limiter, it gives another big boost on all platforms. https://github.com/jw2013/WLED/commit/8aa5c95daa36be7b6acfb20dd7a5ceea1a028d17

blazoncek commented 5 months ago

So, give 8266 binaries with limited features, many dont need all the bling that can be crammed there.

Many do not even know what version do they have let alone features they have unless you strip their feature away and then they come complaining.

Unfortunately limited features binaries are not happening with 0.14 and most likely not with 0.15 as well. But as the code size increases we may see specific builds at some point in the future.

If you need trimmed down 0.14 use custom compile on your own.

blazoncek commented 5 months ago

Could you please also use my fix for the power limiter

No. ABL as of 0.14 is EOL. It has been moved to bus (output) logic in 0.15 to allow per-output (multiple) limitation.

There is also work with @makuna to modify NeoPixelBus so that repainting pixels will not be necessary regardless of global buffer.

If you need performance disable ABL and provide adequate power to your LEDs or use global buffer to reduce number of pixel paints.

jw2013 commented 5 months ago

The patch was meant for 0.14 only. Even if you consider it EOL, it is still in widespread use, and I think many users will already appreciate the release of 0.14.4.

@blazoncek, just to be clear, I don't want to interfere with your 0.15 developments, and understand the design decisions made there. Actually I want you to be able to fully focus on 0.15!

I help fixing 0.14 for now, as it clearly has advantages over 0.13 for older devices, despite its performance issues (that can all be fixed as shown).

softhack007 commented 5 months ago

The patch was meant for 0.14 only. Even if you consider it EOL, it is still in widespread use, and I think many users will already appreciate the release of 0.14.4.

we had "fun" with ABL last summer after 0.14.0-b3, because the function was not working any more (side-effect after upgrading NPB) and we had to alert users for protecting their hardware against overcurrent. Lessons learned - don't touch anything related to ABL, unless you have an important reason and 6 month of beta testing afterwards.


Not sure if I'm repeating myself now: -> what is the difference in fps - on your hardware - between running 0.14.4 with ABL on, and 0.14.4 when ABL is off?

Doyle4 commented 5 months ago

@jw2013 Do you have a binary for 8266 with your custom settings at all? I Installed new 14.4 and tried 15.b3 and the 160 also makes my devices slower than the 80, its a newer model also, but 80 build is overall slow performance.

jw2013 commented 5 months ago

@Doyle4, could you please try the binary I uploaded above: https://github.com/Aircoookie/WLED/issues/3978#issuecomment-2112137496

(that binary already included my palette and ABL optimizations, 160 MHz)

jw2013 commented 5 months ago

@softhack007, I just ran the tests on your current 0.14.4 release on ESP8266, ABL turned off (on):

WLED_0.14.4_ESP8266.bin: 24 (16) FPS WLED_0.14.4_ESP8266_160.bin: 27 (19) FPS

The values without ABL match the ones of my versions, so @blazoncek's changes for the palette perform the same. https://github.com/Aircoookie/WLED/issues/3978#issuecomment-2112569858

Without my ABL optimization for 0.14, ABL reduces framerate by 8 FPS, with my patches just by 2 FPS, for both 80 and 160 MHz. Btw, all my benchmarks are easy to reproduce using my cfg.json and presets.json above, no need to connect an actual strip or PSU.

blazoncek commented 5 months ago

@jw2013 did you try with global buffer enabled?

jw2013 commented 5 months ago

For the 600 LEDs in my case, the global buffer cannot be enabled (LED memory (ERROR: Using over 4000B!))

When I reduce it to 570 LEDs, it works though, giving the same 27 FPS with or without ABL.

Understood, without the global buffer, it needs to call multiple functions per pixel (CPU caching, stack...), while with the global buffer, it's a simple operation on memory.

Doyle4 commented 5 months ago

@Doyle4, could you please try the binary I uploaded above: #3978 (comment)

(that binary already included my palette and ABL optimizations, 160 MHz)

Installed, 15b3 160 using Hiphotic effect was average 21fps, suing your build Im getting 40!

Screenshot 2024-05-20 at 01 47 55

softhack007 commented 5 months ago

15b3 160 using Hiphotic effect was average 21fps

Please also try with "use global buffer" (LED settings) enabled, using 0.14.4 160 mhz build.

blazoncek commented 5 months ago

0.15 has the same principle (speed optimisation in ABL) already implemented.

And BTW @jw2013 your patch is incorrect.

Doyle4 commented 5 months ago

15b3 160 using Hiphotic effect was average 21fps

Please also try with "use global buffer" (LED settings) enabled, using 0.14.4 160 mhz build.

Solid 41fps, set fps to 60 to see if it would increase, it didn't, but thats all good. Screenshot 2024-05-20 at 20 14 31

EDIT: Flashed 15b3 160 also, without UGB 21fps, with UGB 31fps.

jw2013 commented 5 months ago

@blazoncek, could you please spare some of your valuable time to elaborate, why it is incorrect in your opinion?

blazoncek commented 5 months ago

@jw2013 & @Doyle4 please test (if you can) #3669 and #3877 with various effect transition/blending styles applied and report (in each respective PR) how well that scales with your ESP8266 configuration as one of them will become mainstream.

valkoh commented 5 months ago

If you need trimmed down 0.14 use custom compile on your own.

How that works? Im used to doing compiles on ESPHome, so thats how far my knowledge goes. Assuming WLED doesnt have a interface similar to it, though would bring a different world to the whole WLED business. I'd drop everything else but the essentials i need.

anlosen2020 commented 1 month ago

I based on the 0.14.3 source code, using ESP32-S3 hardware. It took me 5 months to understand the code and migrate the entire architecture to ESP-IDF 5.2.1. Currently, the migration is almost complete. At the hardware level, tasks are scheduled and divided by the dual-core CPU, and LED signal port output uses hardware DMA acceleration. With 600 LEDs, in the mode enabled by dblBfr as true, it can achieve around 39 fps. After hardware acceleration, it can reach 50 fps. In the future, I will also try to improve with new approaches to see if the fps can be further increased.

anlosen2020 commented 1 month ago

CPU at 160Mhz With 600 LEDs, Enable (PSRAM Flash SPI speed at 80Mhz) in BusConfig defCfg = BusConfig() enabled by dblBfr as true, it can achieve around 45 fps. disable dblBfr it can achieve around 50 fps. 屏幕截图 2024-09-14 195135

CPU at 240Mhz With 600 LEDs, Enable (PSRAM Flash SPI speed at 80Mhz) in BusConfig defCfg = BusConfig() disable dblBfr , it can achieve around 52. I believe that the current fps measurement may not be accurate. I will continue to optimize the fps data in the future to achieve higher fps.

blazoncek commented 1 month ago

@anlosen2020 your comments are pointless as the underlying problem was resolved. S3 can do 120 FPS in regular build (depending on the particular effect used).