Blueforcer / awtrix3

Custom firmware for the Ulanzi Smart Pixel clock or self made awtrix. Getting started is easy as 1-2-3
https://blueforcer.github.io/awtrix3/
Other
1.25k stars 108 forks source link

Flashing yellow and red pixels after some period of working hours #519

Open feitzi opened 5 months ago

feitzi commented 5 months ago

Bug report

Describe the bug

My Ulanzi with Awtrix works fine for a few hours with HA over Mqtt. After 3-6 hours the wlan disconnects and on the left side two pixels (one yellow and one red) starts flashing. With a restart (just possible via hardware buttons) everything works fine for the next few hours.

Wireless strength is really good and the Awtrix has a fixed IP address on my DHCP Server.

Additional information

To Reproduce

Running the Awtrix for some hours.

Screenshots

20240326_195652.jpg

haterakathegrinch1 commented 5 months ago

Can confirm this. Thought it was a hardware issue, but as my error is exactly the same pixel and the same behaviour it seems more like a software issue. Sadly in terms of bugtracking the issue is non reproducable atm.

Blueforcer commented 5 months ago

that's not a software issue. Red pixel means no Wi-Fi connection, Yellow pixel means no MQTT connection.

feitzi commented 5 months ago

If it's not a software problem, what else could it be? Everything works perfectly after a restart. And mqtt via home assistant works seamlessly with other devices. The same applies to wifi. Another point I've noticed in the last few days: when the error occurs, sometimes the screen seems to freeze for a few minutes.

For me it seems to be a software error. I will reflash Mx Awtrix. Is there any way to retrieve logs and publish them here?

Blueforcer commented 5 months ago

MQTT tries to reconnect wich isn't asynchronous and awtrix freezes during that time.

You can set debug_mode in dev.json and listen with serial terminal.

If you think that's a software bug, than please give me instructions on how to reproduce this error. You must have configured something special, otherwise there would be more such reports with over 5000 users.

Another possibility would of course be that the ESP is damaged.

luebbe commented 5 months ago

I have seen the yellow pixel on Awtrix too on a few rare occasions. Is it possible that your router/repeater sometimes switches the wifi channel? I have the impression that some ESP devices have problems reconnecting in that case, depending on the firmware (not Awtrix AFAICT). At least some of my ESP8266 behave this way and I have implemented a watchdog to reboot them if it takes too long. But I'm using an async wifi library (asyncmqtt in older projects, espmqtt in newer projects)

Blueforcer commented 5 months ago

There are also problems with ESP32 together with Unifi routers.

vitaha83 commented 5 months ago

I join this problem!!!

vitaha83 commented 5 months ago

How do I get back to the awtrix3 firmware version 0.95? Perhaps there will be no problems with this firmware!

Blueforcer commented 5 months ago

How do I get back to the awtrix3 firmware version 0.95? Perhaps there will be no problems with this firmware!

https://github.com/Blueforcer/awtrix3/releases

but there was no change to wifi in 0.96

Ysbrand commented 5 months ago

Can I join? I see the same issue, after a few hours, sometimes a day I loose the connection to Awtrix, This issue exists already for longer time and is, in my opinion, not related to the most recent Awtrix FW versions.

I have Omda AP's but I'm using Unify switches and routers (not sure if it is related but I see a mention of ESP32 issues with Unifi.

vitaha83 commented 5 months ago

Perhaps the reason for the hang-up is in RAM? breaks are a loss of connection and reboot! Please watch your "Free ram" parameter in Home Assistant 2024-04-05_23-09-42

feitzi commented 5 months ago

I also monitored the ram usage and I confirm that it seems to be a memory leak problem. Here is the free ram consumption: Screenshot_20240407_134746_Home Assistant.jpg

At 13:44 the yellow led starts to flash. It's definitely a issue with the software and not with the hardware.

vitaha83 commented 5 months ago

The dependence of the operation of RAM and the operating time of the clock. If the value of "Free ram" is less than 40,000 B, the clock stops working! 2024-04-07_18-34-57

luebbe commented 5 months ago

Interesting find. I'll take a look at uptime vs ram usage as well.

Blueforcer commented 5 months ago

Make sure to not trigger HA discovery entities very often. The used library has a memory leak. It's better to work with raw Mqtt or http API commands.

Ysbrand commented 5 months ago

Hi,

I'm not doing any automatic discoveries in HA, everything is manual (MQTT wise). Is there any chance that we can reboot the AWTRIX3 automatically when running out of resources (and obviously a mechanism that prevents more than 1 automatic reboot every .. hours)?

luebbe commented 5 months ago

I'm running Awtrix 0.96 since a few weeks and haven't had any problems so far.

I checked my HA logs for the Awtrix free ram and apart from one peak, where it goes down to 60K about a week ago, it is consistently between 120-130K free ram.

Are you running custom applications? Maybe turn them off for a day and check if the problem goes away? Other things that come to mind:

vitaha83 commented 5 months ago

Yes, after disabling automation in the Home Assistant, the RAM in the watch stopped being consumed, respectively, and the watch will work for a long time without restarts ... But it's not interesting! I bought the watch specifically for AWTRIX 3 firmware, to work with the Home Assistant! Will the library be improved with the operation of RAM in the firmware? photo_2024-04-09_12-56-17

luebbe commented 5 months ago

This is the free ram of my Awtrix 0.96 over the course of a week:

grafik

Apart from the few occasional peaks, it's pretty stable between 120K and 130K free ram.

I have three automations running on HA that continuously push data to Awtrix. (Outside temp, Solar power and Octoprint status). They all have Text, maybe a progress bar and static icons, no gifs.

How many automations are you running in HA? If you enable them one after the other and check if there is one specific automation that causes the memory leak, we could try to investigate.

Otherwise your question:

Will the library be improved with the operation of RAM in the firmware?

is just straining the capabilities of our crystal ball... ;-)

Blueforcer commented 5 months ago

I don't think there will be any update from the Creator of the lib. Last update was 2 years ago. The goal ist to completely remove the HA discovery shit and make a awtrix integration in HA, (HACS). But for that I doesn't have enough skills

vitaha83 commented 5 months ago

Hello, friends! I managed to achieve stable operation of the operating system in our watch by editing automation in the home assistant: 2024-04-11_18-05-40

  1. I have added parameters for all automation:

    data:
    qos: 0
    retain: false
  2. To work with indicators, I changed the automation: 2024-04-11_18-06-04

it was:

- service: light.turn_on
  entity_id: light.awtrix_..._indicator_1

changed:

- service: mqtt.publish
  data:
     qos: 0
     retain: false
     topic: awtrix_.../indicator1
     payload: >-
       {"color":[255,0,0]}

it was:

- service: light.turn_off
  entity_id: light.awtrix_..._indicator_1

changed:

- service: mqtt.publish
  data:
     qos: 0
     retain: false
     topic: awtrix_.../indicator1
     payload: >-
       {"color":[0,0,0]}

Thanks Lübbe Onken! ;)

luebbe commented 5 months ago

@vitaha83 that is a very interesting solution for me. I wonder, how the change to qos:0, retain:false can have such a big impact on the awtrix since awtrix is just a consumer of the message.

This sounds a bit like a problem in the mqtt library used by awtrix. qos 0 means "fire and forget" (from the view of home assistant). qos 1 means that the subscriber tries to confirm that it has received the message to the sender. Is it possible that awtrix builds up a stack of puback messages that it never gets rid of, when qos > 0?

I assume that retain: false/true doesn't affect the memory consumption on awtrix.

Ysbrand commented 5 months ago

Changing the automations solved my issue as well. Apparently the service light.turn_off and turn_on are doing things in a different way while calling the mqtt.publish service is handled better by the Awtrix logic.

luebbe commented 4 months ago

I don't think there will be any update from the Creator of the lib. Last update was 2 years ago. The goal ist to completely remove the HA discovery shit and make a awtrix integration in HA, (HACS). But for that I doesn't have enough skills

I don't know how to implement a HA integration for Awtrix, but I took a look at the HA autodiscovery library that you are using, found it too big and clumsy and rolled my own. If you want, I can take a look at the HA autodiscovery. Maybe we should also take a look at the way mqtt is handled in Awtrix, since this thread indicates that memory is leaking.

Blueforcer commented 4 months ago

Mqtt itself doesn't have a memory leak, only when you access the entities from HA discovery. that's already tested.

As I see the library builds the MQTT Payload string at runtime, maybe there is no release of the data.

We have some experienced users in discord. I'm pretty sure together we can build an integration wich just uses MQTT or even better http. This would have the advantage that we can free Awtrix from some code and we have more RAM free.

DrRSatzteil commented 4 months ago

I do experience similar problems though I only get the blinking yellow indicator (lost mqtt connection I guess). I can still see the device from my router but I cannot ping the device anymore.

I think that it did not start right after the upgrade to 0.96 but after I started using the lifetime property of apps. Could this be related? Might just be coincidence of course... I'm not using Home Assistant and no HA Autodiscovery.

michapixel commented 2 months ago

i think it's a power-related problem, when you power the esp & the leds from the onboard usb. i also experienced strange hangs, leds (the red and yellow ones) and extensively low WIFI signals reconnect errors etc.pp.

but since i redesigned ma case anyway i added an extra usb-plug, like shown here: hardware_basis

https://pixelit-project.github.io/hardware.html#wiring-guide

And boom: WiFi good, no system hangs, no quirky leds etc. But rebooting always reboots into the "system-menu" :) but i can live with that for now.

Blueforcer commented 2 months ago

But rebooting always reboots into the "system-menu" :) but i can live with that for now.

Then you have connected your middle button wrong. Needs to be active low.

michapixel commented 2 months ago

how would i do that?

heimchemiker commented 1 month ago

I had the same problem. I had one custom app that had a "lifetime" of one hour (3600s). I have removed this Lifetime and switched to deleting the app through my homeautomation after the hour and so far this seems to have solved the problem. (The "lifetime" didn't work anyways in my case, see Issue 335)