Aircoookie / WLED

Control WS2812B and many more types of digital RGB LEDs with an ESP8266 or ESP32 over WiFi!
https://kno.wled.ge
MIT License
14.69k stars 3.16k forks source link

8266: WLED keeps rebooting after 0.14.1 update. #3685

Open Trevo525 opened 8 months ago

Trevo525 commented 8 months ago

What happened?

I have two instances of WLED running on two separate ESP-12F (I believe they are 8266 based?) modules. To be specific, it's this module (not the esp32, obviously). They are wired with different types of LEDs. One is with a WS2812B LED Strip and the other is a more generic LED string that has R|G|B|12V as the inputs, as opposed to 5V|Data|Ground that the first has. I'm not sure that will make a difference. But, I included it as it might be important to note. I just got them both running a week or two ago with WLED 0.14.0 and added them to Home Assistant. Everything worked as expected, I have been using presets and playing with the effects and colors on both. I even have a

However, I updated to 0.14.1 today and the ESP connected to the generic LED strip started turning off when I changed the color it will do that for a split second and I'll notice that the light will switch back to the default orange color. So, I kept testing and it kept happening. Then, I noticed that for a split second after this happens the web interface will be unresponsive for a moment. This leads me to believe the light is restarting.

I have been able to fix this for now by going to the update section and giving it the 0.14.0 interface. But, if I can give any assistance in finding this issue feel free to reach out and I will put 0.14.1 back on it if there is any form of logs or anything I can provide.

To Reproduce Bug

Update to 0.14.1 Press most any button in the interface.

Expected Behavior

I would have expected it not to crash.

Install Method

Binary from WLED.me

What version of WLED?

WLED 0.14.1

Which microcontroller/board are you seeing the problem on?

ESP8266

Relevant log/trace output

No response

Anything else?

No response

Code of Conduct

AKHwyJunkie commented 8 months ago

In the FWIW department, I'm also seeing this same behavior in Athom bulbs as well. (I'm using the recommended ESP02 image, happens across all bulb models.) In case it helps, I noticed this issue started in 0.14.1-B3 and did not occur in 0.14.1-B2, at least in my case. I figured this might have been related to the JSON buffer lock issue, but it looks like not. I can trigger it by changing profiles, either via the web interface or via Home Assistant. I don't believe it's configuration related as I tried a full factory reset in B3.

chertvl commented 8 months ago

Same with 8266. Continuously goes to Unavailable

Screenshot_20240115-064150_Home Assistant

AngusMcT commented 8 months ago

Have the same problem. Just updated through Home Assistant, and have the same symptoms as OP.

blazoncek commented 8 months ago

Please remove Home Assistant integration and see if the problems persist. If they don't you may want to upgrade to ESP32 or get a special build without various features to get more free RAM on ESP8266.

BTW one way to see if WLED restarted is in Info dialog, Uptime field.

dosipod commented 8 months ago

I do not use esp8266 ( 4MB , 2MB or 1MB ) in production setup but i do have a lot of them around to replicate such issues . If cfg.json and preset.json are provided then we could do so .

I have flashed two esp8266 4MB units since the first hour of 0.14.1 release and kept them with debug bins , i did not notice anything strange nor seen disconnection/reboot/crash in the log .

As of 1 hour ago i have added one of them to HA with a simple automation ( to actually only send alert if the unit is on/off ) and i can see the unit disconnecting from wifi ( ping is lost ) but could not get it to constantly behave the same way .

I blame HA integration but can not confirm

blazoncek commented 8 months ago

@chertvl down-voting will not help resolving the issue.

Doyle4 commented 8 months ago

Running fine on ESP32 S2 mini, will test on a esp8266 device later when I can.

chertvl commented 8 months ago

@chertvl down-voting will not help resolving the issue.

Nevermind. Already downgraded to 0.14.0 and thats works perfectly.

About "not help resolving issue", its:

I now have more time to describe the symptoms. After updating an 8266-based device using HA from version 0.14.0 to 0.14.1:

mxilievski commented 8 months ago

Same here, updated 3 8266-based devices. They can’t be accessed via Web.

Doyle4 commented 8 months ago

How many LED's you guys using? Flashed a couple esp8266's from B3 to released 0.14.1, no more than 100 led's working fine, BUT I don't use H.A at all so I can't help on that side sorry.

photobix commented 8 months ago

Same problem on 4 instances. Between 80 and 278 LED on WEMOS D1 Mini (8266). Even an update no longer works without any problems OTA, I had to flash 3 instances via USB. Apparently, the update runs into a timeout.

WarC0zes commented 8 months ago

Same problem on Atom Matrix.I use home assistant and a RESTful command. Since updating to version 0.14.1, I receive this error.

Logger: homeassistant.components.rest_command
Source: components/rest_command/__init__.py:166
Integration: RESTful Command ([documentation](https://www.home-assistant.io/integrations/rest_command), [issues](https://github.com/home-assistant/core/issues?q=is%3Aissue+is%3Aopen+label%3A%22integration%3A+rest_command%22))
First occurred: 06:19:45 (13 occurrences)
Last logged: 10:33:48

Client error. Url: http://192.168.1.xx/json/state. Error: Server disconnected

I reverted to version 0.14.0 and I no longer have errors.

mxilievski commented 8 months ago

Same problem on Atom Matrix.I use home assistant and a RESTful command. Since updating to version 0.14.1, I receive this error.

Logger: homeassistant.components.rest_command
Source: components/rest_command/__init__.py:166
Integration: RESTful Command ([documentation](https://www.home-assistant.io/integrations/rest_command), [issues](https://github.com/home-assistant/core/issues?q=is%3Aissue+is%3Aopen+label%3A%22integration%3A+rest_command%22))
First occurred: 06:19:45 (13 occurrences)
Last logged: 10:33:48

Client error. Url: http://192.168.1.xx/json/state. Error: Server disconnected

I reverted to version 0.14.0 and I no longer have errors.

How did you revert?

WarC0zes commented 8 months ago

Same problem on Atom Matrix.I use home assistant and a RESTful command. Since updating to version 0.14.1, I receive this error.

Logger: homeassistant.components.rest_command
Source: components/rest_command/__init__.py:166
Integration: RESTful Command ([documentation](https://www.home-assistant.io/integrations/rest_command), [issues](https://github.com/home-assistant/core/issues?q=is%3Aissue+is%3Aopen+label%3A%22integration%3A+rest_command%22))
First occurred: 06:19:45 (13 occurrences)
Last logged: 10:33:48

Client error. Url: http://192.168.1.xx/json/state. Error: Server disconnected

I reverted to version 0.14.0 and I no longer have errors.

How did you revert?

I downloaded the firmware (.bin) in version 0.14.0. After you connect to the esp through the browser. In setting / security and update, and click on manual OTA update. wled update You select the firmware and update.

softhack007 commented 8 months ago

I now have more time to describe the symptoms. After updating an 8266-based device using HA from version 0.14.0 to 0.14.1:

  • The WLED web page takes forever to load, sometimes some elements will be drawn, but very rarely, most often the error is err_connection_refused.
  • APIs do not work, including HA integration.
  • It can be seen that the device reboots every few minutes, and could not turn on normally. He's missing something, maybe memory.
  • The router reports that the device is connected, the uptime is stable, there are no reconnections.

@blazoncek a few thoughts on commonalities in user reports

We have to remember that WS responses are not running in arduino context; on esp32 they run inside the async_tcp task, not sure how its implemented on 8266.

I think there are a few dangerous lines in the code to lock the JSON buffer

https://github.com/Aircoookie/WLED/blob/a4a8e2614ea2b8479bb33fc53ac8ca2912f9df2c/wled00/util.cpp#L205


@chertvl @WarC0zes @Doyle4 if my understanding is right, it could help if you comment out the line I quoted, and replace it with

    if (jsonBufferLock) return false;

its a temporary hack and not a proper solution, but it should help to understand if using delay() and millis() on 8266 is the problem. If this hack helps, then I'll take some time the next days to implement a proper solution for requestJSONBufferLock() without busy-waiting.

softhack007 commented 8 months ago

🔺 On a different topic that goes to all who commented and contribute to this thread:

Please stop this thumbs-up thumbs-down BS. We are trying to analyse a problem and need you as users who must help us. It does not really help if you just express fuzzy feelings with thumbs.

image

image

We are trying to do engineering work here, not to entertain fans in the roman circus.

I'm really tired of playing guessing games with emoji.

Use words, instead of throwing tags onto the wall. please.

asolochek commented 8 months ago

I noticed this same behavior on my athom rgbw controller which is paired to home assistant.

After upgrading earlier in the afternoon everything seemed fine, but when I went to turn my lights off I noticed the wled controller wasn't responding. I tried a few times to turn them off via home assistant, and somehow got it stuck in a reboot loop that caused the leds to blink off every 30 seconds or so.

I was able to stop this by turning them off via the web UI and reverted to 0.14.0 and it's working again.

chertvl commented 8 months ago

@chertvl @WarC0zes @Doyle4 if my understanding is right, it could help if you comment out the line I quoted, and replace it

Thanks for the detailed explanation. I tried to compile the firmware for the first time using these instruction at https://kno.wled.ge/advanced/compiling-wled/

I followed your steps, commented out the required line, and added a new one. It seemed like I did everything right, but, unfortunately, it didn’t help. The web interface still cannot load properly, or does not load at all. Sometimes it’s possible to view the status via JSON. The physical button control on the board works. The behavior has not changed. ps: HA integration was disabled before all of these.

Below are some screenshots:

image image image image image image

chertvl commented 8 months ago

unfortunately, it didn’t help.

It may have gotten worse. Now I do not have enough time to update the firmware via OTA, browser gives err_connection_refused. Last time I miraculously succeeded, but now I don’t.

Unfortunately, my device doesn't have a UART, and I don't have one at home either. So continue the tests without me until I find a UART to restore the device... Thanks for understanding.

softhack007 commented 8 months ago

Now I do not have enough time to update the firmware via OTA, browser gives err_connection_refused. So continue the tests without me until I find a UART to restore the device... Thanks for understanding.

Thanks for helping as much as you could 🥇 and sorry about making it worse for you.

About the UART: if gpio 1 and 3 are accessible on your board, then a standard "USB-to-TTL" adapter is all you need. Like this one that's using a CH340G: https://amzn.eu/d/fZChiyZ

... or this one that's specificially made for "ESP-01S" https://amzn.eu/d/2CEAFUb

You'll also find them for cheap on ali.

blazoncek commented 8 months ago
* the only real change for 0.14.1 is the modified locking mechanism for WebSocket API

There were more changes than this. And it is not for websockets but for HTTP requests. Foremost we added PIO_FRAMEWORK_ARDUINO_MMU_CACHE16_IRAM48 to circumvent full IRAM condition. This may cause slowness in non LED display functions. Mode blending was introduced in 0.14.1-a1. It can use a lot of memory and CPU on its own.

IMO, and my own testing showed that, new locking mechanism only improved on stability and memory corruption.

* some people said that problems disappeared with -DWLED_DISABLE_WEBSOCKETS

Websockes need plenty of heap. Constantly. Disabling them can only improve things at the expense of stale UI.

* some problems include WDT reset (watchdog = potential infinite loop)

I've seen WDT in non-WLED code. How to avoid it? Have no clue. Async* stuff (web server and TCP and UDP) are interrupt driven on ESP8266.

* also web responses are sometimes affected ("takes ages")

This may be attributed to a more susceptible WiFi code in newer Arduino core we use with 0.14 (I've posted my own experience in another issue detailing the resolution).

All in all, IMO if you want to run 0.14.x on ESP8266 you need to make a few compromises. Why? Because with only 16kB of RAM available (after boot) it can get crowded rather quickly in the heap.

I am going to post my own ESP8266 configuration I use on ESP01 devices which I have plenty in daily use. Unfortunately that configuration may not work for some people as it strips quite a few features out, but produces reliable and working ESP8266 environment.

[env:esp01_4m]
extends = env:esp01_1m_full
board_build.filesystem = littlefs
board_build.ldscript = ${common.ldscript_4m1m}
board_build.f_cpu = 160000000L
build_flags = ${common.build_flags_esp8266}
  -DPIO_FRAMEWORK_ARDUINO_MMU_CACHE16_IRAM48
  -D LED_BUILTIN=2
  -D WLED_DISABLE_ALEXA
  -D WLED_DISABLE_HUESYNC
  -D WLED_DISABLE_LOXONE
  -D WLED_DISABLE_ADALIGHT
  -D WLED_DISABLE_MQTT
  -D WLED_DISABLE_2D
  -D WLED_DISABLE_PXMAGIC
  -D WLED_USE_UNREAL_MATH
  -D WLED_MAX_BUSSES=2
  -D LEDPIN=2
  -D USERMOD_PIRSWITCH
  -D PIR_SENSOR_PIN=3
  -D PIR_SENSOR_OFF_SEC=60
  -UWLED_USE_MY_CONFIG

My ESP01 use 4MB flash so they can be updated OTA.

If we explore the possibility to swap ESP8266 (in Wemos D1 mini format) with alternate (cheap) device (which I also did) I would recommend Lolin ESP32-S2 D1 mini with 4MB flash and 2MB PSRAM. I've also posted build environments for that elsewhere but the stock WLED doesn't differ much.

And for clarification I will not pursue resolving this issue any more since ESP8266 just does not have enough resources to run smooth everything 0.14 offers. If anybody insists on running fully built 0.14 with external system like Home Assistant, Alexa or Hue and MQTT, I would urge them to reconsider and build special version with other features stripped away.

softhack007 commented 8 months ago

@blazoncek thanks for your thoughts, and I completely forgot about "Mode blending" and other additions that really increase RAM and CPU needs.

It seems my idea about requestJSONBufferLock() did not improve it. So agreed, it could be a general issue with low RAM. Even when users see free RAM, it might be fragmented heavily - I've seen examples where the largest availeable block was less than 10% of total free space.

Guess that we need serial monitor logs from debug builds, to find out if something can be done to improve 8266 performances - or maybe nothing can be done, and we'll soon declare 8266 as "half-dead" 😉 aka deprecated....

Edit: a few more "disable" flags to try out:

.... and a simple one: go to LEDs settings, uncheck "Use global LED buffer"

blazoncek commented 8 months ago

Regarding WDT resets: I have received a word from @willmmiles (whom I consider one of the most technically skilled developers that touched WLED code) that he has traced WDT resets into NeoPixelBus code consuming too much time bitbanging data out.

If you are not using GPIO1 or GPIO2 or GPIO3 for digital led output then CPU has to keep feeding LEDs. This in turn reduces performance for everything else.

If you use PWM LEDs make sure you only use GPIO4 or GPIO12 or GPIO14 or GPIO15 (as specified by Espressif technical documentation, https://www.espressif.com/sites/default/files/documentation/esp8266-technical_reference_en.pdf). Do not forget PWM signal requires NMI to be driven, hence uses CPU.

willmmiles commented 8 months ago

Regarding WDT resets: I have received a word from @willmmiles (whom I consider one of the most technically skilled developers that touched WLED code) that he has traced WDT resets into NeoPixelBus code consuming too much time bitbanging data out.

My test case here is a single strip of 110 WS2812Bs, using a 0_15 branch derived build. Bit-banging for this many LEDs can take several milliseconds with interrupts disabled, which I believe can overflow some of the wifi hardware queues, depending on the amount of traffic on the network. I'm working on hacking some of the interrupt tolerance ideas from FastLED in to NeoPixelBus to see if I can mitigate it.

If a setup has more LEDs on a bit-banging pin, or a busier network, it might trip problems sooner. Sometimes this might manifest as hard reboots like I'm seeing; it's also possible it manifests as a wifi disconnect. (I'm actually rather suprised I haven't seen that in my testing, to be honest).

I will try a 0.14.1 build tonight and see if it behaves differently for me than the 0_15 development branch. It's quite possible this is a different issue than the one I've been chasing.

afflux commented 8 months ago

Regarding WDT resets: I have received a word from @willmmiles (whom I consider one of the most technically skilled developers that touched WLED code) that he has traced WDT resets into NeoPixelBus code consuming too much time bitbanging data out.

FWIW, I'm seeing occasional resets on 8266 with 0.14.1 and use LPD8806, so no bitbanging involved. (But it's way rarer than what people are reporting here, I have 48h uptime right now)

blazoncek commented 8 months ago

use LPD8806, so no bitbanging involved

how do you know it is not? If you are using GPIO13 & GPIO14 then yes it uses HW to accelerate output otherwise you are using SW (CPU) to drive clock and data.

afflux commented 8 months ago

how do you know it is not?

Because I explicitly checked the source when I set it up, and therefore assigned data to GPIO13 and clk to GPIO14.

blazoncek commented 8 months ago

Because I explicitly checked the source when I set it up, and therefore assigned data to GPIO13 and clk to GPIO14.

Good. Now try to catch crash dump on serial if you can. Please.

afflux commented 8 months ago

I'll have a look. Do I need to compile a debug build for this, or will the normal esp8266 crashdump suffice?

blazoncek commented 8 months ago

That would be best. And please add exception decoder to build environment.

blazoncek commented 8 months ago

FYI no issues whatsoever. Lolin Wemos D1 mini. 150 LEDs.

Screenshot 2024-01-18 at 22 16 02

Nunak commented 8 months ago

I have same issue, after upgrade to 14.1 my wemos D1 mini (ESP8266) are rebooting and are unstable. When I did downrange to 14.1 b1 issue disapper. I have this issue on all m strips each have different numbers of leds.

incarvr6 commented 8 months ago

Just my observations, similar issues as OP in my case on flashed DS Sidekicks using 8266 3 LED's PWM, dropping UDP connection from HyperHDR and rebooting. Flashing back to 14.0 resolves issues.

blazoncek commented 8 months ago

Anyone interested in trying to fix this issue is welcome to join Discord and ping me in the beta-testing channel. I will provide a new release binary (0.14.1; compiled on a different system) for you to try out. Apparently there are differences in output binaries depending on what system they were built on. This is a temporary until we figure out what is happening during build time.

TobiVanHelsinki commented 8 months ago
* The WLED web page takes forever to load, sometimes some elements will be drawn, but very rarely, most often the error is err_connection_refused.

* It can be seen that the device reboots every few minutes, and could not turn on normally. He's missing something, maybe memory.

I had the same problem and found a workaround for me by disabling the white channel(s). I can reproduce the error.

I have athom lightbulbs (GU10 and 15W Color Bulb) with ESP8285, they are PWM controlled RGBWW leds. And when ever I activate the white channel, they reset. If I just use RGB, they run 12h and more. I have tried various options like "calculate CCT from RGB" but never changed the behaviour. My free Heap ist 19 kb.

blazoncek commented 8 months ago

@TobiVanHelsinki that could mean poorly designed power supply on the device.

TobiVanHelsinki commented 8 months ago

@blazoncek I agree but I reduced the power limiter from 850 mA (built in and worked in 13.3) to 250mA. Still when I activate W channels, it will crash. Also I can set RGB=255 && W=0 and it will run long. As soon as I do RGB=0 && W=1 it will crash. And with 13.3 this never happened.

I mean doesn't this sound more like software bug?

Edit: fix grammar.

blazoncek commented 8 months ago

Please provide crash dump. Otherwise we cannot be sure.

TobiVanHelsinki commented 8 months ago

I'd love to that. But those bulbs have no usb or what-so-ever. Is there any other way to get the dump?

BTW I thought that v0.13.3 showed the reboot cause. I cant find this in v0.14.1 anymore. Is it gone?

jeeftor commented 8 months ago

How can I pull crash logs? Seeing same issue

image

Running on gpio4

lelemm commented 8 months ago

Just to add information about this topic:

I have a athom bulb (15W RGB), with this last version 0.14.1, sometimes it gets stable to keep the lights on. Whenever someone asks alexa to change the light color, it changes the color then reboot. Interestingly enough, if I ask Alexa to change color BEFORE turning the lights on, it works most of the time.

I reverted the firmware to 0.14.0, the problem is gone.

iNaiks commented 8 months ago

Hi, same here. I've an Athom.tech "RGBCCT Analog and Addressable Digital Strip Controller" with PWM RGBW LEDs. When updated (with .bon.gz, because the .bin didn't work) to 14.1 the light starts rebooting like every time I've change the color of the lights. Tried to revert to 14.0 and works perfect.

jeeftor commented 7 months ago

I downgraded to 14.0 and I think I'm still seeing issues unfortunately

(Home assistant) image

image

kenni commented 7 months ago

I've just tested the new v0.14.2-b1 with the JSON buffer guard fix, and it does NOT fix this issue.

I do not have any integrations to anything yet (Home Assistant, etc), this is a new WS2815 LED strip on a Athom LS-4P controller, only thing I have done is to upgrade first to v0.14.1 and afterwards v.0.14.2-b1.

I can consistenly make it crash within seconds by just loading WLED in a browser on my phone and change solid colors by clicking in the color palette 3-10 times.

I downgraded to v0.14.0, did not power cycle the device (I just let it reboot itself or whatever it does during the upgrade/downgrade process) and I then experienced one crash/reboot after ~5 seconds, after which it got completely stable. I played around with it for 8-10 minutes with no crashes. On v0.14.1 I do not think I was able to keep it alive for more than 30 seconds maximum.

When I have some time available, I might try to do a Git bisect between v0.14.0 and v0.14.1. I really do not want to take the controller apart if I can avoid it.

willmmiles commented 7 months ago

@kenni I believe that may be a different underlying issue than the JSON fix in 0.14.2. I've reproduced a problem triggered by changing palette colors over websockets myself tonight, I'm debugging it now.

blazoncek commented 7 months ago

I did a comparison (for science). Pay attention to free heap. Screenshot 2024-02-21 at 19 32 46 Screenshot 2024-02-21 at 19 50 41 Screenshot 2024-02-21 at 19 41 59 Screenshot 2024-02-21 at 19 36 23 Screenshot 2024-02-21 at 19 34 29

rohrsh commented 7 months ago

Same-sies. I have 2x Athom 8266 controllers with WS2812x strips. I also had them connected to Home Assistant. Web interface choppy and many reboots. Happy to help test fixes.

jeeftor commented 7 months ago

Anybody know how to pull logs w/out being plugged into the board? I'm assuming its not possible

softhack007 commented 7 months ago

Anybody know how to pull logs w/out being plugged into the board? I'm assuming its not possible

You're right, that's not possible. You need a USB connection and a serial terminal program (like platformIO serial monitor) to see logs. It's best to use a debug build. You cannot retrieve "logs" from the past, you will only see errors while they happen.

You can find a few generic debug builds here: https://github.com/srg74/WLED-wemos-shield/tree/master/resources/Firmware/%40Aircoookie/Latest/Debug_builds

blazoncek commented 7 months ago

Anyone interested I posted binaries of 0.14.2-b2, which is intended to solve this issue, on Discord, #beta-testing channel. It includes AWS modifications from @willmmiles and faster CPU clock.