forkineye / ESPixelStick

Firmware for the ESPixelStick
http://forkineye.com/
528 stars 169 forks source link

Crash after 4-6 Minutes - memory leak? #682

Closed XMP closed 8 months ago

XMP commented 8 months ago

ESPixelStick Firmware Version ESPixelStick CI #529

Hardware Version D1 Mini ESP 32 Generic ESP 32 Generic ESP 32 with SD D1 Mini (8266)

Binary release or compiled yourself? Release Archive from CI

Operating System (and version) W11P 22H2

Web Browser (and version) FF119

Access Point AVM 1750E

Describe the bug Flash the downloaded binary to the ESP and just wait 4-6 Minutes - nothing is responding.

You do not have to configure any inputs or outputs, but you can speed up the process if your press reload in your Browser.

After start up the free heap is 165-175k - every second there is more and more memory allocated. If the free heap hits a value at around 74k the ESP completely freezes without any message at the serial console. Sometimes it reaches 54k free heap if you do not interact with the ESP.

I tried different builds with and without ethernet and hardware with and without sd, nothing changes, every time the ESP crashes after a few minutes. It's the same with ESPixelStick CI #527

MartinMueller2003 commented 8 months ago

please add a screen shot of the admin page. i just ran multiple hours with no issue

XMP commented 8 months ago

Hey Martin, thank you for your quick reply.

grafik grafik

If I could do anything else, please let me know.

XMP commented 8 months ago

The D1 Mini ESP32 reached 7 Minutes until it freezes: grafik

The Generic ESP32 with SD crashed after 5 Minutes: grafik

I can reproduce this all the time.

XMP commented 8 months ago

Okay, very strange - after trying anything else, I changed the Wifi-Connection. Instead of connecting to the show network via AVM 1750E - I connected the ESP to my Home Network running by a AVM 7590 and 7530 AX.

The ESP starts with 166k free heap ... after a page reload, it was only 153k, and after 1-2 seconds it was back on 166k!? No free memory below 150k. Everything is fine, still 166k after 20 Minutes.

Any idea's, what I should try or change on the show network? Both networks are 2,4 GHz with WPA+WPA2 🤷‍♂️

XMP commented 8 months ago

I changed the Access Point from AVM 1750E to TP-Link TL-WR841N. Same issue - 3x D1 Mini 32, 1x ESP32 - 2x v4b4, 2x dev-CI from yesterday:

grafik

grafik

grafik

No clue what's going on here :-(

MartinMueller2003 commented 8 months ago

No idea. That is bizarre. Not sure how I would debug that. Can you leave the crashing device connected to the usb cable with the flash tool active. After a crash, save the log on the flash tool and attach it to this conversation.

XMP commented 8 months ago

Thanks for your help, Martin!

No message after the ESP crashes. Brand new D1 Mini 32 - flash erased 2x with esptool and than flashed wie dev-Binary. crashlog1.txt

Crashed ~7m20 after start up grafik grafik grafik grafik grafik grafik

grafik grafik grafik grafik

I also ran another Test on my normal home network - it's the same issue, so I think it is Router/AP independent.

MartinMueller2003 commented 8 months ago

There is aa minor error here but it should not cause what you are seeing. You are using an image that expects PSRAM and the device you are using does not have PSRAM installed. This can cause some WiFi Images. It seems not all D1 Mini implementations are the same.

XMP commented 8 months ago

Okay, I will try another Image.

I flashed the binary to an D1 Mini (ESP8266) and got some messages before the device rebooted:

crashlog2_esp8266.txt

MartinMueller2003 commented 8 months ago

That indicates a memory allocation failed and the system could not recover.

XMP commented 8 months ago

Okay, now using the Image for D1 Mini32 Twilight Lord - which is not using PSRAM:

crashlog3_Mini32_TwilightLord_noPSRAM.txt

MartinMueller2003 commented 8 months ago

This gives a little more info. 14:48:32: [WiFiDrv] WiFi Entering State: Connected To AP 14:48:33: [WiFiDrv] Connected with IP: 192.168.2.123

14:49:02: [WiFiDrv] WiFi Lost the connection to the AP 14:55:58: [WiFiDrv] WiFi Entering State: Connection Failed 14:55:58: Guru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled.

We see that after 30s the application got notified that there is a problem at the WiFi level and it took 6 minutes to get the final notification that the connection could not be reestablished. At that point the system was in a bad way and died. I will review that code to see if there is anything odd going on.

XMP commented 8 months ago

But this might be an inaccurate timestamp - I'm pretty sure, the two WifiDrv messages popped up at the same time.

Here is another log from a D1 Mini 32 with D1 DevkitC Firmware (again no PSRAM): The first crash occurs after 7m30 - but the log is clean. Then I did a reset (lines 15:08:19 and 15:16:16) - I pressed the reset button 100% at 15:16:16 and not 15:08:19, as the log is expecting.

After startup, the ESP reboots again!? and after that, it took the normal way of 7mins until complete freeze.

crashlog4_D1DevkitC.txt

MartinMueller2003 commented 8 months ago

So this is a crash I have seen: 15:16:24: E (19938) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time: 15:16:36: E (19938) task_wdt: - async_tcp (CPU 0/1) 15:16:36: E (19938) task_wdt: Tasks currently running: 15:16:36: E (19938) task_wdt: CPU 0: IDLE 15:16:36: E (19938) task_wdt: CPU 1: loopTask 15:16:36: E (19938) task_wdt: Aborting. 15:16:36:

It is due to an issue in the ESPAsync handlers. I have not gotten it to happen reliably. One side effect that is not being logged is an endless loop of http messages being fired at the application that dont do anything. The after a while the watchdog fires and resets the system. It was most often triggered by receiving a udp message from the FPP.

XMP commented 8 months ago

After shutting down the FPP, the D1DevkitC Image on D1 Mini 32 is running for more than 20 minutes 🥳 - free heap always above 180k.

I'm now starting the FPP on the RPi3 ... at first nothing happens.

Switching to the multisync page (and keep the it open) - free heap is getting lower and lower.

After switching back to the status page in fpp, the amount of free heap stays at that size (for example 120k), it does not get any lower, but also did not recover/rise back to 180k.

The problem is caused by this option: grafik If this is enabled, the ESP runs out of memory after ~7 minutes.

FPP Version: | 7.2-2-ge35c0df7 Platform: | Raspberry Pi (Pi 3 Model B) FPP OS Build: | v2023-08 OS Version: | Raspbian GNU/Linux 11 (bullseye)

EDIT: Downgrade to FPP 7.1 - it's the same problem.

So, is this a FPP or a ESPixelStick Issue?

MartinMueller2003 commented 8 months ago

This is an ESP Issue. FYI: Using my FPP 5.3 Master, I am not seeing this. I have an FPP 7 that I use for testing strings of lights. I think it is going to be a lab master soon. :)

MartinMueller2003 commented 8 months ago

Reproduced. FPP 7 is giving us heartburn. Dont know what it is doing exactly but it most certainly is leaking memory. A visual inspection of the code does not show where. More debugging to be done.

XMP commented 8 months ago

I tested some FPP Releases: 6.3 - okay 7.0 - nope 7.1 - nope 7.2 - nope

So something must have been changed with FPP 7.0. 🤷‍♂️

If I can help in any way or test anything, please give me a shout.

MartinMueller2003 commented 8 months ago

Fixed FPP related memory leak PR #683

XMP commented 8 months ago

Tested with a bunch of ESP32+8266 - all are still running after more than 48 hours. 🥳 ESP32 with D1 DevKitC now with a minimum of 180k memory - sometimes 10k lower, but it get's back to >=180k after a few seconds if idle.

Thanks for your great work, I appreciate it so much, because I love to have the multisync page open all day & night to check, what's going on :-)