espressif / esp-protocols

Collection of ESP-IDF components related to networking protocols
181 stars 126 forks source link

esp_modem: silent disconnection leads to endless OTA (IDFGH-10572) #322

Closed bbinet closed 1 year ago

bbinet commented 1 year ago

Answers checklist.

General issue report

Context: We have some battery powered esp32c3 systems which use Quectel BG95-M3 modem to publish sensor data though MQTT.

I've noticed that when pushing OTA updates (using esp_https_ota component), the OTA update often fails and drain the battery because BG95 modem remain powered until the OTA ends, but it never ends... Looking at the DEBUG logs of esp_https_ota component, the OTA seems blocked in an endless loop saying "ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready". This is probably caused by a silent PPP disconnection.

So I think I have multiple issues here:

I've come across issue #287: may it be related to this one?

bbinet commented 1 year ago

Here are the related esp logs:

I (2371) uart: ESP_INTR_FLAG_IRAM flag not set while CONFIG_UART_ISR_IN_IRAM is enabled, flag updated
I (2381) uart: queue free spaces: 30
W (2421) uart_terminal: Rx Break
I (13121) esp-netif_lwip-ppp: Connected
W (13141) esp_rmaker_time: Waiting for time to be synchronized. This may take time.
I (13151) esp_rmaker_time: The current time is: Wed Jul  5 14:13:13 2023 +0000[GMT], DST: No.
I (13161) esp_mqtt_glue: Connecting to mq.helioslite.net
I (13171) esp_rmaker_core: Waiting for MQTT connection
I (13171) main_task: Returned from app_main()
I (15491) esp_mqtt_glue: MQTT Connected
I (15491) esp_rmaker_node_config: Reporting Node Configuration of length 904 bytes.
I (15571) esp_rmaker_param: Params MQTT Init done.
I (15571) esp_rmaker_param: Reporting params (init): {"HL":{"SPM":{}},"OTA":{"Status":"","Info":"","URL":""},"Schedule":{"Schedules":[]}}
I (15581) esp_rmaker_cmd_resp: Enabling Command-Response Module.
I (15581) esp_rmaker_core: End of wifi_wait task
I (16091) esp_rmaker_param: Received params: {"OTA":{"URL":"20230705_141204362780.bin"}}
I (16091) esp_rmaker_ota_using_params: Received value = 20230705_141204362780.bin for OTA - URL
I (17151) NimBLE: GAP procedure initiated: stop advertising.
I (17151) NimBLE: GAP procedure initiated: stop advertising.
I (17151) wifi_prov_mgr: Provisioning stopped
I (17191) wifi_prov_scheme_ble: BTDM memory released
I (17201) esp_rmaker_ota: Reporting in-progress: Starting OTA Upgrade
I (17211) esp_rmaker_param: Reporting params: {"OTA":{"Info":"Starting OTA Upgrade"}}
I (17221) esp_rmaker_param: Reporting params: {"OTA":{"Status":"in-progress"}}
W (17221) esp_rmaker_ota: Starting OTA. This may take time.
I (19581) esp_https_ota: Starting OTA...
I (19581) esp_https_ota: Writing to partition subtype 17 at offset 0x1a0000
W (19581) wifi: [pm.c,3517] Set ps type: 0
I (19591) esp_rmaker_ota: Reporting in-progress: Downloading Firmware Image
I (19601) esp_rmaker_param: Reporting params: {"OTA":{"Info":"Downloading Firmware Image"}}
I (19611) esp_rmaker_param: Reporting params: {"OTA":{"Status":"in-progress"}}
D (19691) esp_https_ota: Written image length 1024
D (19691) esp_https_ota: Written image length 2048
D (19761) esp_https_ota: Written image length 3072
D (20351) esp_https_ota: Written image length 4096
D (20431) esp_https_ota: Written image length 5120
D (20431) esp_https_ota: Written image length 6144
D (20481) esp_https_ota: Written image length 7168
D (20571) esp_https_ota: Written image length 8192
D (20981) esp_https_ota: Written image length 9216
D (20981) esp_https_ota: Written image length 10240
D (20981) esp_https_ota: Written image length 11264
D (21061) esp_https_ota: Written image length 12288
D (21131) esp_https_ota: Written image length 13312
D (21131) esp_https_ota: Written image length 14336
D (22271) esp_https_ota: Written image length 15360
D (22351) esp_https_ota: Written image length 16384
D (22411) esp_https_ota: Written image length 17408
D (22411) esp_https_ota: Written image length 18432
D (22411) esp_https_ota: Written image length 19456
D (22451) esp_https_ota: Written image length 20480
D (22521) esp_https_ota: Written image length 21504
D (22521) esp_https_ota: Written image length 22528
E (27521) transport_base: esp_tls_conn_read error, errno=No more processes
D (27521) esp_https_ota: Written image length 22733
D (32531) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (37541) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (42551) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (47561) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (52571) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (57581) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (62591) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (67601) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (72611) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (77621) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (82631) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (87641) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (92651) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (97661) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (102671) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (107681) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (112691) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (117701) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (122711) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (127721) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (132731) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
E (135521) mqtt_client: No PING_RESP, disconnected
W (135521) esp_mqtt_glue: MQTT Disconnected. Will try reconnecting in a while...
D (137741) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (142751) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (147761) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (152771) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (157781) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
E (160531) esp-tls: [sock=54] select() timeout
E (160531) esp-tls: Failed to open new connection
E (160531) transport_base: Failed to open a new connection
E (160531) mqtt_client: Error transport connect
E (160551) esp_mqtt_glue: MQTT_EVENT_ERROR
W (160551) esp_mqtt_glue: MQTT Disconnected. Will try reconnecting in a while...
D (162791) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
I (162791) esp_rmaker_ota: Image bytes read: 22733
D (167801) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
D (418301) esp_https_ota: ESP_ERR_HTTP_EAGAIN invoked: Call timed out before data was ready
...
bbinet commented 1 year ago

I've activated DEBUG log level for more components in the following logs: https://gist.github.com/bbinet/26720eac93f0fd4280ca1d195840add9

I have the impression that the more component with DEBUG log level are activated, the earlier this silent disconnection issue happens. Could the PPP silent disconnection be due to cpu being busy doing other things (like printing debug logs to uart) and then HTTP connection timeouts?

bbinet commented 1 year ago

Also note that when I change my code to connect through WiFi instead of LTE with BG95 (esp_modem), the OTA always succeeds and is much faster to complete.

david-cermak commented 1 year ago

@bbinet Which IDF version are you using? Are you also seeing this issue on IDF v5.0 or v5.1 (if it's easy for you to switch and check)?

I could only reproduce the problem when I physically disconnect the UART wires from my modem. When connected normally, it shouldn't happen and and the disconnection should be reported to the http client and OTA layers.

By silent disconnection, you mean that the PPP interface is still considered up, so you receive no event about the disconnection? You can enable LCP keepalive to prevent this issue, but I'd have to investigate why it's happening during normal operation. Is it possible that the modem restarts during connection (that would be one reason)?

bbinet commented 1 year ago

@david-cermak I was using a specific commit between IDF v5.0 and v5.1, but I've just updated to v5.1, and I still face the same issue.

By silent disconnection, I mean that I don't receive any connection error, and OTA is still trying to continue indefinitly, so I suppose that there is an issue with the PPP connection but it is still considered up.

I will have a look to LCP keepalive, and report here when ready.

No, the modem does not restart during connection: I have checked its rx/tx line of the modem uart, and the modem just stops sending data to the esp, and I can't see any visible error on uart lines.

I also tried to increase baudrate speed to 460800 for the modem uart just to see if the OTA could last shorter, hoping to reduce the risk of OTA failure, but same issue again.

Also, I noticed that when trying to connect to MQTT server, it regularly fails in the TLS handshake phase:

I (7479) app_connect: Waiting for LTE modem to be registered on network
I (13819) app_connect: Modem Connect to PPP Server
I (13819) app_connect: ~~~~~~~~~~~~~~
I (13819) app_connect: IP          : 10.181.144.244
I (13829) app_connect: Netmask     : 255.255.255.255
I (13829) app_connect: Gateway     : 10.64.64.64
I (13839) esp-netif_lwip-ppp: Connected
I (13839) app_connect: Name Server1: 8.8.8.8
I (13849) app_connect: Name Server2: 8.8.4.4
I (13849) app_connect: ~~~~~~~~~~~~~~
I (13859) app_connect: GOT ip event!!!
I (13859) sensornode: Connected: got ip
W (13869) esp_rmaker_time: Waiting for time to be synchronized. This may take time.
I (13869) main_task: Returned from app_main()
I (15819) hl_spsn_service: SNTP Synchronised.
I (15869) esp_rmaker_time: The current time is: Fri Jul 21 14:49:56 2023 +0000[GMT], DST: No.
I (15869) esp_mqtt_glue: Connecting to mq.helioslite.net
I (15879) esp_rmaker_core: Waiting for MQTT connection
I (16649) wifi:flush txq
I (16649) wifi:stop sw txq
I (16649) wifi:lmac stop hw txq
E (17659) esp-tls-mbedtls: mbedtls_ssl_handshake returned -0x004C
I (17659) esp-tls-mbedtls: Certificate verified.
E (17659) esp-tls: Failed to open new connection
E (17669) transport_base: Failed to open a new connection
E (17669) mqtt_client: Error transport connect
E (17679) esp_mqtt_glue: MQTT_EVENT_ERROR
W (17679) esp_mqtt_glue: MQTT Disconnected. Will try reconnecting in a while...

I think this is a different issue, but this is still an issue with the modem only (I don't see such handshake issues while connecting through wifi).

bbinet commented 1 year ago

@david-cermak I've just implemented LCP keepalive, and it seems the PPP session is still up (I don't reveive any PPP STATUS about ppp termination...)

Here are the full logs: https://gist.githubusercontent.com/bbinet/b8e7621aed9bdc1183a7b92a05c5dcf9/raw/1dc42a316193717f33653ac869bc5e8d0d4714d9/uart.log

If PPP is still up, I don't understand why I can't get any more data for OTA and why MQTT is disconnected. Any idea?

david-cermak commented 1 year ago

Could you please also enable PPP_DEBUG (it's just next to the LCP in menuconfig) and turn off the MQTT task.

What I find really strange is that the MQTT client got disconnected, while OTA still keeps trying.

This definitely looks like a bug, but need to find out if it's in the networking layers or in OTA.

bbinet commented 1 year ago

PPP_DEBUG was already enabled. Here are new logs with all DEBUG logs activated except for some verbose components (nvs, gdma, CMUX): https://gist.githubusercontent.com/bbinet/b0cf418a48f5851ab852d6f88d8c6dff/raw/b272e8197bc6316e3a73ec56b965f307cd148779/uart2.log

Also, here is my sdkconfig file if you want to try to reproduce the issue: https://gist.githubusercontent.com/bbinet/b0cf418a48f5851ab852d6f88d8c6dff/raw/b272e8197bc6316e3a73ec56b965f307cd148779/sdkconfig

I kept MQTT task running for now as it's not easy to turn off for me because I use MQTT to trigger the OTA update. I'll send you an update with MQTT task disabled soon.

I think that the MQTT client got disconnected because the MQTT keepalive timeout expires at MQTT level (see log: "mqtt_client: No PING_RESP, disconnected").

david-cermak commented 1 year ago

interesting, it looks like the link is still up and running, so the LCP echo replies are still being sent by the device.

I think that the MQTT client got disconnected because the MQTT keepalive timeout expires at MQTT level

Oh, okay. makes sense.

To workaround this issue, you can enable TCP keepalive checks for OTA as well, here an example:

https://github.com/espressif/esp-idf/blob/28167ea5a3cdd49227216c4131a9df7f4eda260e/examples/system/ota/simple_ota_example/main/simple_ota_example.c#L100C35-L100C35

bbinet commented 1 year ago

I've just checked and keep_alive_enable = true was already set.

I've already implemented a temporary workaround for the endless OTA: I've setup a timer to restart the ESP after 30min.

Do you have any idea why the connection may not be working whereas the PPP link is still up?

david-cermak commented 1 year ago

I've just checked and keep_alive_enable = true was already set.

Weird, this line should've appeared in your log, but I didn't see it anywhere...

I've already implemented a temporary workaround for the endless OTA: I've setup a timer to restart the ESP after 30min.

The TCP level keepalive mechanism should also work, but need to understand what's wrong in your configuration.

Do you have any idea why the connection may not be working whereas the PPP link is still up?

No idea, one theory could be that the internet connectivity got broken somewhere upstream? In that case you would be able to ping your remote IP, but cannot access the OTA endpoint.

bbinet commented 1 year ago

I've just checked and keep_alive_enable = true was already set.

Weird, this line should've appeared in your log, but I didn't see it anywhere...

You're right, this line does not appear, so the tcp keepalive is finally not enabled, but I've double checked, and the keep_alive_enable = true is correctly set: I've added a log message in esp_http_client_init to confirm (here: https://github.com/espressif/esp-idf/blob/master/components/esp_http_client/esp_http_client.c#L668). So maybe a bug: this keepalive config may be lost somewhere before the esp_tls_set_socket_options function is called.

Do you have any idea why the connection may not be working whereas the PPP link is still up?

No idea, one theory could be that the internet connectivity got broken somewhere upstream? In that case you would be able to ping your remote IP, but cannot access the OTA endpoint.

Ok, I will try to use my 1NCE SIM card in a 4G Mobile WiFi Hot Spot product to see if it comes from my upstream 1NCE operator, or if it was really an issue with my own esp+bg95 modem.

bbinet commented 1 year ago

Ok, I will try to use my 1NCE SIM card in a 4G Mobile WiFi Hot Spot product to see if it comes from my upstream 1NCE operator, or if it was really an issue with my own esp+bg95 modem.

So I've just tried to insert my SIM card in a 4G modem and then connect my esp32c3 to the 4G modem through WiFi: everything is running fine and OTA performs with a much higher data rate than when using my BG95 with esp_modem. So this is probably not an issue upstream with my operator...

david-cermak commented 1 year ago

Some updates about these issues (sorry for the late reply):

1) Default config (no TCP keepalive, nor LCP echo) disconnection:

2) LCP echo: This works correctly on my end, so either there's something wrong with your configuration (saw you're using rainmaker?) or your modem actually sends the LCP echo replies.

3) TCP keepalive: There's a bug in http client's implementation in IDF, I'll post a fix MR soon. Here's a quick patch to check:

diff --git a/components/esp_http_client/esp_http_client.c b/components/esp_http_client/esp_http_client.c
index 0deb011253..f10b487224 100644
--- a/components/esp_http_client/esp_http_client.c
+++ b/components/esp_http_client/esp_http_client.c
@@ -692,6 +692,14 @@ esp_http_client_handle_t esp_http_client_init(const esp_http_client_config_t *co
         goto error;
     }

+    if (config->keep_alive_enable == true) {
+        client->keep_alive_cfg.keep_alive_enable = true;
+        client->keep_alive_cfg.keep_alive_idle = (config->keep_alive_idle == 0) ? DEFAULT_KEEP_ALIVE_IDLE : config->keep_alive_idle;
+        client->keep_alive_cfg.keep_alive_interval = (config->keep_alive_interval == 0) ? DEFAULT_KEEP_ALIVE_INTERVAL : config->keep_alive_interval;
+        client->keep_alive_cfg.keep_alive_count =  (config->keep_alive_count == 0) ? DEFAULT_KEEP_ALIVE_COUNT : config->keep_alive_count;
+        esp_transport_ssl_set_keep_alive(ssl, &client->keep_alive_cfg);
+    }
+
     if (config->crt_bundle_attach != NULL) {
 #ifdef CONFIG_MBEDTLS_CERTIFICATE_BUNDLE

(the problem is that the keepalive was enabled correctly for TCP transport, but wasn't enabled for SSL transport which you're using)