MallocArray / airgradient_esphome

ESPHome definition for an AirGradient DIY device to send data to HomeAssistant and AirGradient servers
GNU General Public License v3.0
202 stars 29 forks source link

Erratic uptime #36

Open WillZad opened 3 months ago

WillZad commented 3 months ago

I'm experiencing erratic uptime. Latest version 2.0.1 installed from day one. I didn't use the AG with it's factory firmware. I need a more stable connection in order to activate the ventilation fans in my shop based on PM and VOC levels. Can anyone comment on the attached screen grabs? Thank for your help, as I'm not sure if this is a software, wifi, placement or hardware issue. AGOne AGOne-wifi

MallocArray commented 3 months ago

I also see similar reboot cycles. I've disabled nearly all of the sensors and still get there reboots, so I'm not sure if it is just ESPHome acting up, or hardware related. What I haven't done yet is to unplug physical sensors and disabling the sensor configs.

On the ESP32 based devices, if I turn off my 5Ghz radio on my Unifi access points, then it goes very stable, but the ESP8266 (d1 mini) based devices are still erratic.

Since your Wifi signal has 2 distinct levels, I'm guessing you have multiple access points as well? The higher numbers are when it is connected to one AP and the lower when it chooses the other?

For me, disabling various sensors didn't make a difference, but I do need to try disabling things like the captive_portal and maybe the ap: section of the wifi config to see if it makes any difference.

MallocArray commented 3 months ago

Also, it should be noted that the latest Beta firmware from AirGradient for their Arduino based devices now have MQTT and a REST endpoint available that can also be used to connect to HomeAssistant. It seems like more work to get the data into HA compared to ESPHome, but may be another valid way to get the data you are looking for and maybe it is more stable, but I haven't watched the actual uptime on it.

WillZad commented 3 months ago

The two different WiFi levels are due to me moving the unit from my shop to my office. I may try to go back to the AG firmware to see if anything changes.

depasseg commented 3 months ago

I'm experiencing this as well. Also using Unifi, but mine is on an IOT SSID and vlan that has only ever been configured for 2.4Ghz.

I only got the unit last week, you can see that is was fairly decent until March 12. No network changes (I manually update unifi) image image

I also moved mine closer to my AP as well on a very steady USB adapter plugged into a UPS. I did that just before the 9a mark on this graph. image

I notice this line in the logs after it finally starts back up "Last Boot was an unhandled reset, will proceed to safe mode in 4 restarts"

BTW, it seems to take nearly 20 minutes to move off the boot screen after it restarts. That seems odd.

depasseg commented 3 months ago

I disabled the API uploads to Airgradient (commented out the airgradient_api package), and since then my uptime is now 22 hours! (I recall something about the API uploads taking seconds in the logs somewhere which is what made me think of that)

Can someone else test?

MallocArray commented 3 months ago

Likely related to this open issue that hasn't been resolved: https://github.com/esphome/issues/issues/2853

ex-nerd commented 1 month ago

Has anyone tried the workaround mentioned in that bug report?

Update: So far, easy enough to drop the .h file into my ESPHome config and paste the other values into the main yaml file. My Open Air was crashing every 10-15 minutes before and so far has been stable for about twice that. 🤞 this will hold up.

danielnitz commented 1 month ago

Disabling the API uploads to AirGradient also solved the issue for me.

MallocArray commented 1 month ago

Agree that I see a device that is running the official Arduino firmware start to have trouble connecting to the Dashboard and right after my ESPHome based device rebooted. Observed this multiple times. I think when the AirGradient API takes a long time to process, it can cause a condition that ends up with ESPHome resetting the device. I had diagnostics running on my ONE and saw the reboot reason listed as Timer Group 0 Watch Dog Reset Digital Core

There is a way to disable the software watch dog on ESP32 devices, but it requires an extra file outside of the main yaml and I'm trying to avoid that for simplicity sake, but it is an option as mentioned above.

I do still see ESP8266 based devices reboot even when I don't include the API package at all, so this isn't the only cause, but it is certainly a contributing factor

WillZad commented 1 month ago

I've since gone back to the factory firmware and connected to HA via MQTT. It's better but still seeing the AirGradient go offline, just no way to track it as they don't have an uptime component.

MallocArray commented 1 month ago

Check if the MQTT reports a value of boot or bootCount as this is incremented whenever it uploads to the API. So not an actual minute uptime count, but gives an indicate of how long it was up.

917huB commented 1 month ago

Similar results here on two devices, disabling API uploads has reduced the frequency of restart from 300seconds to closer to 800seconds but ultimately both reboot frequently. I bought a newer C3 version that just arrived I'll try this weekend to see if results are similar.

MallocArray commented 1 month ago

@917huB What device do you have and can you check the Uptime graph in HomeAssistant?

If it is regular at 300 seconds, that sounds like the hardware watchdog in the ONE model not being refreshed so it is rebooting. If more erratic than that, the likely it is running out of memory and causing a crash, but I haven't been able to pinpoint a single thing.

This is my original AirGradient DIY which has the same ESP8266 as the Pro models, but very erratic uptimes image

I've been able to get more stable with the later boards, but still not reliable. Other than the TVOC sensor, it doesn't seem to impact much though.

917huB commented 1 month ago

These are both v3.7 boards with the D1-mini I believe. They are both configured identically but in different areas of my home and demonstrate display behavior.

agmaster

The improvement demonstrated below on the 25th is due to commenting out the API call.

agproguest

I've added the code below which other than IP address is identical

# AirGradient Pro V3.3 - V4.2
# https://www.airgradient.com/open-airgradient/instructions/overview/

substitutions:
  name: "ag-guestbed"
  friendly_name: "AG Pro Guest Bed"
  config_version: 2.0.5
  name_add_mac_suffix: "false"  # Must have quotes around value

# Enable logging
logger:

# Enable Home Assistant API
api:
  encryption:
    key: "snip"

ota:
  password: "snip"

wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password
  use_address: ag-guestbed.local.lan
  manual_ip:
    static_ip: 192.168.12.29
    gateway: 192.168.12.1
    subnet: 255.255.255.0
    dns1: 192.168.12.1

time:
  - platform: sntp
    id: my_time
    timezone: America/Los_Angeles
    servers: 192.168.12.1

dashboard_import:
  package_import_url: github://MallocArray/airgradient_esphome/airgradient-pro.yaml
  import_full_config: false

packages:
  board: github://MallocArray/airgradient_esphome/packages/airgradient_d1_mini_board.yaml
  pm_2.5: github://MallocArray/airgradient_esphome/packages/sensor_pms5003.yaml
  co2: github://MallocArray/airgradient_esphome/packages/sensor_s8.yaml
  temp_humidity: github://MallocArray/airgradient_esphome/packages/sensor_sht30.yaml
  tvoc: github://MallocArray/airgradient_esphome/packages/sensor_sgp41.yaml
  display: github://MallocArray/airgradient_esphome/packages/display_sh1106_single_page.yaml
  #airgradient_api: github://MallocArray/airgradient_esphome/packages/airgradient_api_d1_mini.yaml
  config_button: github://MallocArray/airgradient_esphome/packages/config_button.yaml
  wifi: github://MallocArray/airgradient_esphome/packages/sensor_wifi.yaml
  uptime: github://MallocArray/airgradient_esphome/packages/sensor_uptime.yaml
  safe_mode: github://MallocArray/airgradient_esphome/packages/switch_safe_mode.yaml

binary_sensor:
  - id: !extend config_button
    pin:
      number: D7
MallocArray commented 1 month ago

Do you happen to have multiple wireless access points in your location and/or Unifi products? I have a sneaking suspicion that sometimes mine is related to jumping to another AP and that causes a software crash/reset.

I haven't used the safe_mode button in over a year and am considering commenting it out.

If you want to dig deeper, you could add another package I have for showing some diagnostic info that might give some insight into what is happening behind the scenes.

  diagnostic: github://MallocArray/airgradient_esphome/packages/diagnostic_esp8266.yaml

This will enable some additional sensors, such as Heap Free. If this gets below about 8000 I've seen regular resets. The change here is when I modified the behavior of the Blank Page switch in the single page display and is consuming a bit more RAM, but not all the way down to where I see it reset. If yours is trending 8000 or lower, that could help track things down. image

I don't have Encryption setup on this device, and it could use more memory as well, just thinking of what might be different as your config looks very close to mine. You have the time: section that I don't know how much it consumes, but if you go down the road of adding the diagnostics sensors, you could try commenting out the time section and see if that makes any significant change.

917huB commented 1 month ago

I do have multiple access points (Ruckus) with a dedicated 2.4Ghz IoT network. I'll check to see if the device is jumping between them and also try your diagnostic module. I'm pretty sure that turning logging up to DEBUG resulted in more frequent reboots but nothing scientific.

wozz commented 1 month ago

I'm having issues with the board connecting at all when I use the standard display component. I then switched to the single page display and it was working fine until this commit: https://github.com/MallocArray/airgradient_esphome/commit/a6546cb44d0813e1853f7ba4e4ed909700663f9f

Then it failed to even startup past the boot screen. I revert the commit, and it boots right up again.

So I think the memory usage is a much more likely culprit than any of the wifi parameters.

ex-nerd commented 1 month ago

Wifi has been rock solid on my outdoor model since applying the wifi workaround shared on the ESPHome bug (https://github.com/esphome/issues/issues/2853#issuecomment-1949349868) uptime of about a week now … however, that fix only works for esp32 devices, not the esp8266 in the the other models (which all seem to reboot every few hours even with airgradient uploads disabled).

romines-dev commented 2 weeks ago

Similar experience to @ex-nerd, after applying that patch I've had no issues. In my case it appears that the trigger was the Air gradient dashboard going unresponsive around 1AM, causing the HTTP calls to trigger the watchdog and panic as per that ESPHome issue. Is there any way the HTTP requests made with these scripts can have timeouts installed that are smaller than the standard watchdog timeout? Or is that something baked into ESPHome's libraries?

MallocArray commented 2 weeks ago

There is a timeout option that defaults to 5 seconds but I've lowered it to 1 second without significant improvement. The ESPHome release that came out this week did a major rewrite of the http_requesr module and now has an option to disable the watchdog timeout for ESP33 based chips, so I'm going to try that. It won't work with the D1 Mini, but still an improvement And maybe in the rewrite in general will help. Lots to try out in this latest release along with breaking changes.

MallocArray commented 6 days ago

Looking much improved with the 2024.6.x releases and all of the changes. I've had uptimes of over 100 hours before I had more changes to test and needed to reboot.

New release coming soon