espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.62k stars 7.28k forks source link

WIFI unreliable intermittently (IDFGH-6124) #7801

Closed svenfuchs closed 2 years ago

svenfuchs commented 2 years ago

Apologies if this would be supposed to go to a forum. It seems as much a bug report as a question to me, but let me know if it should be moved.

An application I am working on is supposed to act as a simple thermostat/heating control. It consists of an esp32, DHT22 thermometer, OLED display, and a relay to control a heating device. It performs network requests for 3 functions:

The issues I am seeing are:

The HTTP related tasks will intermittently not be able to select the socket and establish a new connection, and show the following error log:

Or for sending metrics:

E (11:32:28.256) esp-tls: couldn't get hostname for :*******.herokuapp.com: getaddrinfo() returns 202, addrinfo=0x0
E (11:32:28.264) esp-tls: Failed to open new connection
E (11:32:28.274) TRANSPORT_BASE: Failed to open a new connection
E (11:32:28.286) HTTP_CLIENT: Connection failed, sock < 0

Sending the UDP message sometimes results in Not enough space (12) errors when returning from sendto.

Turning off all but one of the three network related tasks (e.g. only keeping the OTA task) does not make the remaining one function reliably.

The code for running the OTA update is taken from the advanced HTTP OTA example, which when run locally fails with:

E (25021) esp-tls: [sock=54] select() timeout
E (25021) TRANSPORT_BASE: Failed to open a new connection: 32774
E (25021) HTTP_CLIENT: Connection failed, sock < 0
E (25021) esp_https_ota: Failed to open HTTP connection: ESP_ERR_HTTP_CONNECT
E (25031) esp_https_ota: Failed to establish HTTP connection
E (25041) advanced_https_ota_example: ESP HTTPS OTA Begin failed

... where, as far as I can tell, 32774 is ESP_ERR_HTTPD_RESP_SEND, which seems to happen when something goes wrong during sending either HTTP headers or the request body?

The trouble is that "intermittent" above means something like:

Sometimes, but not always, the OTA update GET request starts failing and at the same time UDP messages fail to get sent (with the mentioned Not enough space (12) error message. Sometimes, but most often not, either one of these issues or both will resolve after a couple minutes (usually to come back after a short while again).

Sometimes, but not always, the request alternates failing and succeeding. I.e. it fails once or twice, and then succeeds, only to start failing again a couple times.

Thinking that Not enough space indicates a lack of free heap or some task's stack I have added debug output listing those (this has required tinkering with sdkconfig for a while in order to resolve various stack overflows in order to get debug log output working):

I (13:56:47.739) [mem]: Free heap: 157996 (min: 150792)
I (13:57:23.160) [mem]: info         6448
I (13:57:23.168) [mem]: ota          5924
I (13:57:23.172) [mem]: heating      6408
I (13:57:23.177) [mem]: metrics      9460
I (13:57:23.182) [mem]: sensor       6272
I (13:57:23.187) [mem]: tiT          7068
I (13:57:23.191) [mem]: ipc1         7712
I (13:57:23.196) [mem]: sys_evt      6800
I (13:57:23.201) [mem]: events       14216
I (13:57:23.206) [mem]: esp_timer    7992
I (13:57:23.211) [mem]: wifi         4220
I (13:57:23.215) [mem]: ipc0         7700
I (13:57:23.220) [mem]: Tmr Svc      7736

It would seem to me these numbers should be high enough (but maybe I'm wrong?), and I am now thinking they are probably unrelated to the "space" mentioned in the error message. I am also not sure these two errors even are related (socket select timeouts for HTTP requests vs failures to send UDP messages).

I am poking at this on two identical esp32-devkit-v4 boards (one of them soldered onto a diy pcb, the other one wired on a breadboard; one of them alternating on two different power supplies), not seeing any difference in realiability of the network or reproducibility of the issue(s).

The code can be found at https://github.com/svenfuchs/box, sans sdkconfig as I haven't found any other way to keep credentials out of version control.

My sdkconfig config can be found here.

svenfuchs commented 2 years ago

I have now moved sending out log messages via UDP to an async setup using a queue and a worker task. Apparently the whole thing is now a lot more stable.

However, after a successful OTA update (which btw also downloads the binary a lot quicker now) and restart it still sometimes gets stuck in the state where the OTA update HTTP GET request won't be able to get the hostname and getaddrinfo fails with 202.

laukik-hase commented 2 years ago

Hello, @svenfuchs

Sorry for the late response. I was investigating this issue, but unfortunately could not reproduce it (with esp_http_client and advanced_https_ota examples). Here is a similar issue - https://github.com/espressif/esp-azure/issues/55, suggesting that esp-tls: couldn't get hostname for **** could be a memory issue.

  1. Could you post logs when using esp_get_free_heap_size() and esp_get_minimum_free_heap_size() just before performing the HTTP requests (and/or wherever you are observing the error)?
  2. As you have separate tasks for each of your features, could you also post logs when using uxTaskGetStackHighWaterMark()?
  3. TRANSPORT_BASE: Failed to open a new connection: 32774 points to ESP_ERR_ESP_TLS_CONNECTION_TIMEOUT. I couldn't reproduce it with the advanced_https_ota example, so could you please check your server-side config for that? You can also increase the network timeout as given in here.
svenfuchs commented 2 years ago

Thanks for the response @laukik-hase!

I had to move on to other things in the meanwhile, but before I did I have managed to reduce the instabilities quite a bit. (I am not sure what exactly it was that helped with that in the end, but I remember tweaking memory a lot.)

I am going to close this issue. I might get back to this if I am still seeing these issues once I can pick up the project again, but in the meantime it makes no sense to keep it open.

Thanks again!