geerlingguy / airgradient-prometheus

AirGradient Prometheus exporter.
MIT License
176 stars 59 forks source link

Device becomes unreachable after a while #13

Open pheetr opened 2 years ago

pheetr commented 2 years ago

First of all, thanks for everyone's efforts around this sketch.

The issue I've encountered is that while things seem to be working fine for a while, at some point the device stops responding to network queries, such as a ping or the curl directed to get the metrics. Sometimes this happens after an hour or two of normal operations, but other times as soon as 10-15min after the device was powered on. At the same time, sensor values are being updated on the screen, so this only seems to affect the networking side of things. One additional interesting aspect is that when the device is in this unreachable state and I look at the router's client list, it is still listed as connected.

This is my first Arduino project so I'm not quite sure how I would go about troubleshooting this issue.

geerlingguy commented 2 years ago

@pheetr - I've noticed this once or twice too, on my office sensor. I usually unplug and replug it, and it's back to being happy again. My guess is something in the web/http stack just kinda locks up, though I haven't had it plugged into a computer so I don't have any debug data to look at.

pheetr commented 2 years ago

Thanks, Jeff. I'll see if I can figure it out once I have a bit of time. In its current state the device's usefulness is quite limited as over the past few days since I've built it the longest it has ever remained reachable was about two hours. This pretty much precludes the option of any automated alerts when CO2 levels rise.

In case I can't figure it out I guess I could plug it into a smart plug and set up an automation to cycle the plug when the device becomes unreachable, but that wouldn't feel like a very satisfying solution. :)

geerlingguy commented 2 years ago

@pheetr - Another quick question, since this was a problem on one of mine—what power supply are you using? I had to make sure I had a good at least 1A power supply for it (my little Apple iPhone USB power adapter didn't cut it).

pheetr commented 2 years ago

Hmm, interesting. I have it running connected to my pc (troubleshooting) next to the router during the day, while at night it's on a 0.7A Samsung phone adapter. I've picked this adapter based on the 200mA max consumption suggested by this post, but I'll go ahead and try it with a 1A adapter tonight. Fingers crossed it's as simple as that. :)

Benargee commented 2 years ago

I wonder if a watchdog timer would be a good option to reset the device if it is fully freezing or consistently unable to send data over wifi.

geerlingguy commented 2 years ago

I still get this every now and then (seemingly at random). Just putting that out there :D

potokslow commented 2 years ago

Thank you very much for your work on this sketch and the whole Internet Pi project.

I think I'm experiencing the same / similar issue to the one described by @pheetr The device frequently become unreachable / gets disconnected - sometimes it takes minutes, in other cases the connection stays up for a day or more. The OLED updates just fine and the device seems to be working, except for the wireless network portion - ping or connections to the device are no longer possible.

I did quite a bit of troubleshooting over few months - moving the device close to the access point, changing power supplies, cables, even went as far as taking apart everything, resoldering all connections and replacing the ESP module - nothing helped. I suspect this may be due to high traffic / interference on 2.4 GHz. My network is stable and no other devices are affected, but most of them are using 5 GHz band.

I have two AirGradient devices, one is located in rural area with few wireless networks around and I don't have any issues with it. The other one, the problematic unit, is set up in a high-rise apartment building with many 2.4 GHz access points nearby, which I think may be causing the connectivity issues.

A watchdog periodically checking network connection status or connection to another host and restarting the device would be great in this situation.

geerlingguy commented 2 years ago

@potokslow - I'm wondering the same... the weird thing is I have also isolated one of my sensors in a basement area where only a couple 2.4 GHz networks can be reached, and mine is very strong with an AP only a few feet away... and it still drops out frequently now :/

potokslow commented 2 years ago

I tried using WiFi.status() to display the connection status on the OLED, but strangely enough the device keeps reporting as connected even after it becomes unresponsive. I waited for few hours to be sure that there's no delay in status reporting, but the status didn't change. It seems that this form of status checking is either not reliable, or there's something else going on. I'll try with WiFi events next, maybe those will give me better results.

pheetr commented 2 years ago

After quite a bit of time spent reading issue comments across a wide range of repos in January I've concluded that there seems to be a certain amount of WiFi functionality instability with these Wemos D1 Mini modules. I couldn't really find a solution to it anywhere, so I ended up settling on a workaround that has been working for me very well over the past few months with virtually no gaps in data reported.

As noted above, WiFi.status() doesn't accurately represent the situation when WiFi seems to be locked up and unresponsive, so it unfortunately can't be relied on. Luckily, to resolve this unresponsive communication state I've found that one doesn't have to restart the whole device, it is enough to reconnect WiFi.

In my specific case, one of the destinations I'm sending data to is InfluxDB, where the sending function has decent error handling. Thus, at every send event I check if an error has occurred and if so I invoke WiFi.reconnect().

As a side note, just to see how often this is happening I've also added a counter for the wifi reconnects. I've just checked the current status in Grafana and it is showing 8048 reconnects over the past 10.6 days since the device was last restarted (I have it set up to automatically restart every 30 days).

geerlingguy commented 2 years ago

@pheetr - I wonder if as part of the main event loop, we could have it just do a ping on google.com (or a URL or IP of your choosing via variable) every 10 minutes or something. If it fails, reconnect.

pheetr commented 2 years ago

Yes, I think that could be a good approach here. Setting the time interval via a variable in the config section may be desirable as well, given varying degrees of WiFi instability across devices/environments. (In my case the numbers average out to a reconnect almost every two minutes.)

potokslow commented 2 years ago

I tried using the WiFi event handler to reconnect the device once it's disconnected, but either my implementation was faulty, or it just didn't work as expected (or maybe both). The device sometimes reconnected, but for few minutes at most. Also, the reconnection happened after a completely random period of time - sometimes it took minutes, sometimes hours.

I managed to get some of the D1 Mini ESP32 variants, do an ugly code port to ESP32 and I'm currently testing to see if it behaves differently. If this works, I'd like to do a proper port.

Did you manage to get the ping test and WiFi reconnect running reliably? If so, please share the code.

potokslow commented 2 years ago

Quick update regarding running with ESP32 - my setup was running for about 12 days without connection drop, but finally got disconnected today. For comparison, I never had such a long period without disconnection with the original D1 Mini based on ESP8266, it used to drop few times per day. The disconnection with ESP32 looked just like with the ESP8266, but I'm not sure what was the exact reason.

It seems that the WiFi on ESP32 is significantly more stable, but I'll keep monitoring it over longer time.

stale[bot] commented 2 years ago

This issue has been marked 'stale' due to lack of recent activity. If there is no further activity, the issue will be closed in another 30 days. Thank you for your contribution!

Please read this blog post to see the reasons why I mark issues as stale.

sbrodehl commented 2 years ago

@pheetr - I wonder if as part of the main event loop, we could have it just do a ping on google.com (or a URL or IP of your choosing via variable) every 10 minutes or something. If it fails, reconnect.

I have done exactly that and it works fine so far. Previously, I've had connectivity issues every now and then :shrug:

stale[bot] commented 2 years ago

This issue has been closed due to inactivity. If you feel this is in error, please reopen the issue or file a new issue with the relevant details.

geerlingguy commented 2 years ago

Leaving open as a 'bug', since I still see the dropouts from time to time on one of my two setups. Though I may be moving to ESPHome soon... we'll see!

EIndriksons commented 1 year ago

Just had my first dropout. Can someone give me a code snippet on how to ping google.com and reconnect? I've tried this:

// Ping google.com
const int pingResult = WiFi.ping("google.com");

// If the ping failed, try to reconnect to the WiFi network
if (pingResult != 0) {
    WiFi.reconnect();
}

But it displayed an error Compilation error: 'class ESP8266WiFiClass' has no member named 'ping'

sbrodehl commented 1 year ago

If have used the WiFiClient class to connect to some server. Then you can use WiFi.disconnect() and WiFi.begin(ssid, password) to reconnect to the network.

Benargee commented 1 year ago

@EIndriksons I don't think there is an official ping library and it certainly isn't part of ESP8266WiFi. Try this: https://github.com/dancol90/ESP8266Ping

Also, if your concern is only connecting your local Prometheus server to a local ESP, just ping the router or Prometheus server and not a public internet server as it's not indicative to the health of the network scope. Your internet may go down but your system should still work. No need to handle for an error that covers a further reaching scope.

bartstar commented 1 year ago

Would appreciate any help. I have new AirGradient modules - both inside and outside. I've modified the code to allow Prometheus to scrape data from both (curl shows data is available) but I get nothing (no data) from Prometheus. I've setup the data points as Jeff instructed, checked port numbers, etc. Do I need to uninstall and reinstall Prometheus?

Encryptic commented 1 year ago

I run my own firmware, which actually pushes data instead of Prometheus-style pulls, and I also found I have issues with this.

I have been experimenting for a few months with assigning these devices fixed DHCP IPs, to ensure that they don't get changed and so far I haven't had any issues with a device dropping off. It makes me wonder if there's a bug with the networking stack on the ESP which doesn't handle this well.

Moorviper commented 5 months ago

It looks like it is the webserver I commented out the webserver prometheus part an now it runs stable. As i read esphome doesn't recommend using the webserver on 8266 chips. I only use the sensors with homeassistant. mqtt is still activated so it is still possible to push the metrics to prometheus with an external prometheus exporter.