espressif / esp-protocols

Collection of ESP-IDF components related to networking protocols
165 stars 115 forks source link

esp_modem "NO CARRIER" message not correctly handled (IDFGH-10017) #287

Closed FabITA1 closed 9 months ago

FabITA1 commented 1 year ago

Answers checklist.

General issue report

IDF version: v4.4.3 Esp_modem: v1.0.0 Esp module: ESP32-S3-WROOM-1 LTE module: Quectel EG915 (very similar to BG96) Data mode: ESP_MODEM_MODE_CMUX

The communication between cpu and LTE module works fine. But sometimes the data connection with the network drops and the modem generate a "NO CARRIER" message. Seems that this condition was not correctly handle (I make some mistake...) . I thought it generated a IP_EVENT_PPP_LOST_IP event that inform me about the new communication state. Instead, I see other errors caused by the drop of connection, like:

W (1773301) TRANSPORT_BASE: Poll timeout or error, errno=Connection already in progress, fd=57, timeout_ms=10000 E (1773301) MQTT_CLIENT: Writing didn't complete in specified timeout: errno=119 I (1773301) myMqttModule: MQTT_EVENT_ERROR E (1773301) MQTT_CLIENT: Error to resend data

There is a way to correctly handle the "NO CARRIER" message? Thanks

david-cermak commented 1 year ago

But you receive the IP_EVENT_PPP_LOST_IP event eventually (after all those mqtt errors), correct? If so, you just need a faster reaction from PPP protocol. To achieve that, you can enable LCP keepalive mechanism in lwip menuconfig (or even shorten the default interval)

[*] Enable LCP ECHO
(3)     Echo interval (s) (NEW)
(3)     Maximum echo failures (NEW)

If not, please make sure that you've registered on the specific event (or on all events from IP_EVENT base). To make it even more granular, you can also enable PPP protocols status events (again in lwip component config):

(Top) → Component config → LWIP
[* ] Enable Notify Phase Callback
FabITA1 commented 1 year ago

No, I receive no event in this situation I have registered the event halders in this way: ESP_ERROR_CHECK(esp_event_handler_register(IP_EVENT, ESP_EVENT_ANY_ID, &on_ip_event, NULL)) The Enable Notify Phase Callback is flagged,

FabITA1 commented 1 year ago

Attached a ESP32 log (Debug view) and the uart tx pin of the Quectel module. The NO CARRIER message don't trigger any event or message on the Esp log side. NOTE: I forced the disconnection by antenna disconnecting.

Image

david-cermak commented 1 year ago

Thanks for sharing the log. It seems like the device is just sending NO CARRIER message in text instead of gracefully exiting the PPP session. which is treated as garbage on ESP32 side (and dropped) and the PPP client still tries to maintains the connection.

I'm afraid that in this case, there's no way to tell that the connection failed and not able to post any event. The solution is to use the LCP sub-protocol as noted above and check for the PPPERR_PEERDEAD status. Here are the steps to register an event on this timeout error: 1) Enable LCP keepalive and status events in menuconfig:

[*] Enable LCP ECHO
[*] Enable Notify Phase Callback

2) Listen to the PPP_STATUS events

    ESP_ERROR_CHECK(esp_event_handler_register(NETIF_PPP_STATUS, ESP_EVENT_ANY_ID, &on_ppp_changed, NULL));

3) Check for NETIF_PPP_ERRORPEERDEAD event id in the handler:

static void on_ppp_changed(void *arg, esp_event_base_t base, int32_t event_id, void *data)
{
    if (event_id == NETIF_PPP_ERRORPEERDEAD) {
        ESP_LOGE(TAG, "Connection lost!");
    }
}
FabITA1 commented 1 year ago

David, many thanks for your support. Your suggestions are very useful. However, I have two questions:

  1. the LCP Echo option generate data traffic (= costs) on the LTE side? Because this is no good for my application.
  2. what should be the right behavior of the modem? At the message NO CARRIER I see two cmux frame (dlci 0 and 1): isn't that what we expect? Is it possible that I made some mistake in the setup settings? I find a little strange that it has a different behavior than others devices.
david-cermak commented 1 year ago

the LCP Echo option generate data traffic (= costs) on the LTE side? Because this is no good for my application.

LCP is a link protocol within the PPP session, so shouldn't generate anything above IP = no data traffic from the modem to cellular provider.

what should be the right behavior of the modem? At the message NO CARRIER I see two cmux frame (dlci 0 and 1): isn't that what we expect?

I would expect entering a disconnection state of the PPP session (phases of "terminate" or "dead"). As you can see the NO CARRIER comes from dlci 1, which is used for the PPP protocol, dlci 0 is the control terminal so maybe the modem is trying to exit the CMUX session while abandoning the subprotocols?

Is it possible that I made some mistake in the setup settings?

No, I can easily reproduce your issue just by disconnecting the UART wires to my device. Client's side has no knowledge on the server, cannot tell if it's been disconnecting without the LCP keepalive mechanism.

I find a little strange that it has a different behavior than others devices.

TBH, I'm not surprised. After providing support for this library I found a few deviations in some devices, usually related to switching modes and CMUX. This issue is related to both :smile:

FabITA1 commented 1 year ago

Thanks David. I have implemented the LCP as you suggested. It works better! I have set Echo interval to 4 sec and Max echo failure to 1. After few seconds of NO CARRIER event on uart, I obtein this:

E (146511) TRANSPORT_BASE: poll_read select error 113, errno = Software caused connection abort, fd = 54 E (146511) MQTT_CLIENT: Poll read error: 119, aborting connection .............. E (158511) esp-netif_lwip-ppp: Connection timeout I (158511) Gest4G: PPP state changed event 9 E (158511) Gest4G: Connection lost! I (158511) esp_modem_netif: PPP state changed event 9

The Connection lost event has triggered 12 secs after the first Transport Base fail. It is normal?

david-cermak commented 1 year ago

Yes, 12 seconds sound about right (4 seconds to one 1 echo timeout, another 4 and a half to treat it as a failure).

But, the weird part is the first error message you shared:

E (146511) TRANSPORT_BASE: poll_read select error 113, errno = Software caused connection abort, fd = 54
E (146511) MQTT_CLIENT: Poll read error: 119, aborting connection

Are you getting this immediately after receiving the NO-CARRIER message or after some delay (few seconds)? This seems like the lwip knows about the PPP disconnection. (this is something I wasn't able to simulate)

Could you please enable these options in lwip component config?

(Top) → Component config → LWIP
[*] Enable PPP debug log output
[*] Enable LWIP Debug

and share the log?

Also, is it possible for you to easily switch to IDF v5.0 (just to test the disconnection)? (maybe we're missing something on v4.x)

FabITA1 commented 1 year ago

The messages indicating trouble on connection appears few seconds after NO CARRIER (4-5 secs). This the new log:

pppos_write[1]: len=12 pppos_input[1]: got 18 bytes I (110531) Gest4G: rssi: 26, ber: 99 pppos_input[1]: got 14 bytes pppos_write[1]: len=12 I (114531) Gest4G: rssi: 99, ber: 99 No response to 1 echo-requests Serial link appears to be disconnected. ppp phase changed[1]: phase=11 ppp phase changed[1]: phase=9 sifvjcomp[1]: VJ compress enable=0 slot=0 max slot=0 sifdown[1]: err_code=9 ppp phase changed[1]: phase=6 pppos_send_config[1]: out_accm=FF FF FF FF ppp_send_config[1] pppos_recv_config[1]: in_accm=0 0 0 0 ppp_recv_config[1] sent [LCP TermReq id=0x2 "Peer not responding"] pppos_write[1]: len=27 E (118491) TRANSPORT_BASE: poll_read select error 113, errno = Software caused connection abort, fd = 54 E (118491) MQTT_CLIENT: Poll read error: 119, aborting connection I (118531) Gest4G: rssi: 99, ber: 99 I (118611) GestMqtt: >>> MQTT disconnect I (122531) Gest4G: creg: 5 sent [LCP TermReq id=0x3 "Peer not responding"] pppos_write[1]: len=27 I (125491) GestComm: Retry connect I (125491) GestMqtt: MQTT other event id: 7 E (125491) esp-tls: [sock=54] connect() error: Host is unreachable E (125491) TRANSPORT_BASE: Failed to open a new connection: 32772 E (125491) MQTT_CLIENT: Error transport connect I (125491) GestMqtt: MQTT_EVENT_ERROR I (126491) GestComm: Retry connect I (126491) GestMqtt: MQTT other event id: 7 E (126491) esp-tls: [sock=54] connect() error: Host is unreachable E (126491) TRANSPORT_BASE: Failed to open a new connection: 32772 E (126491) MQTT_CLIENT: Error transport connect I (126491) GestMqtt: MQTT_EVENT_ERROR I (126531) Gest4G: rssi: 99, ber: 99 I (127491) GestComm: Retry connect I (127491) GestMqtt: MQTT other event id: 7 E (127491) esp-tls: [sock=54] connect() error: Host is unreachable E (127491) TRANSPORT_BASE: Failed to open a new connection: 32772 E (127491) MQTT_CLIENT: Error transport connect I (127491) GestMqtt: MQTT_EVENT_ERROR I (128491) GestComm: Retry connect I (128491) GestMqtt: MQTT other event id: 7 E (128491) esp-tls: [sock=54] connect() error: Host is unreachable E (128491) TRANSPORT_BASE: Failed to open a new connection: 32772 E (128491) MQTT_CLIENT: Error transport connect I (128491) GestMqtt: MQTT_EVENT_ERROR I (129491) GestComm: Retry connect I (129491) GestMqtt: MQTT other event id: 7 E (129491) esp-tls: [sock=54] connect() error: Host is unreachable E (129491) TRANSPORT_BASE: Failed to open a new connection: 32772 E (129491) MQTT_CLIENT: Error transport connect I (129491) GestMqtt: MQTT_EVENT_ERROR ppp phase changed[1]: phase=12 Connection terminated. ppp_link_terminated[1] ppp_link_end[1] ppp phase changed[1]: phase=0 E (130491) esp-netif_lwip-ppp: Connection timeout I (130491) Gest4G: PPP state changed event 9 E (130491) Gest4G: Connection lost! I (130491) esp_modem_netif: PPP state changed event 9 ppp_link_terminated[1]: finished.

My application receive only the mqtt disconnect event. Because I don't know the cause, I retry to re-connect many times. Only after 12 seconds I realize that the PPP was lost. If it is normal, I try to workaround

I have problem to switch now to IDF 5.0, sorry. I tried to build my application with 5.0 some months ago, but without success.... I should try it again, but I don't have much time right now...

FabITA1 commented 1 year ago

David, have you had the opportunity to look at the log? If you confirm me that this is the correct behavior, I'll try to find a workaround

david-cermak commented 1 year ago

This looks weird, could you please try to modify your handler:

static void on_ppp_changed(void *arg, esp_event_base_t base, int32_t event_id, void *data)
{
    if (event_id == NETIF_PPP_ERRORPEERDEAD || event_id == NETIF_PPP_ERRORCONNECT) {
        ESP_LOGE(TAG, "Connection lost!");
    }
}

to see if you get the connection-lost event (earlier)?

I can see in the log that the phase got updated correctly (after those 4-5 seconds), but for some reason it's not propagated as the connection error event.

FabITA1 commented 1 year ago

I have added the new event on the handler as you suggested, but unfortunately the result don't change.

FabITA1 commented 1 year ago

I have also discover now that, after enabled the LCP echo, the OTA process does not work anymore ☹ Here is what happening:

I (71897) HTTP_CLIENT: Body received in fetch header state, 0x3fcc5cab, 205 I (71897) esp_https_ota: Starting OTA... I (71897) esp_https_ota: Writing to partition subtype 16 at offset 0x210000 I (71897) FwUpg: Running firmware version: 002 E (78537) TRANSPORT_BASE: poll_read select error 113, errno = Software caused connection abort, fd = 57 E (78537) MQTT_CLIENT: Poll read error: 119, aborting connection E (78547) TRANSPORT_BASE: poll_read select error 113, errno = Software caused connection abort, fd = 58 E (78557) TRANSPORT_BASE: poll_read select error 0, errno = Success, fd = 58 E (78557) esp_https_ota: data read -1, errno 0 E (78557) FwUpg: Complete data was not received. E (78927) esp-netif_lwip-ppp: Connection timeout I (78927) Gest4G: PPP state changed event 9 E (78927) Gest4G: Connection lost! I (78927) esp_modem_netif: PPP state changed event 9

david-cermak commented 1 year ago

I'm sorry. the phase to look for early disconnection is Terminate, so the event-id should have been: NETIF_PPP_PHASE_TERMINATE.

It should be also possible to enable CONFIG_LWIP_NETIF_STATUS_CALLBACK and register a callback on link down event:

netif_set_status_callback(lwip_netif, netif_status_callback);

that should be called when the link gets down, somewhere around these lines in the log:

ppp phase changed[1]: phase=9
sifvjcomp[1]: VJ compress enable=0 slot=0 max slot=0
sifdown[1]: err_code=9

As for the OTA. This is a problem as the intensive flash operation might stall the CPU for longer than LCP timeout. Maybe its possible to adjust the timeouts to enable both OTA and timely responsiveness. Another option is to implement something like a two phase OTA, which would pause the network before flashing and resume it after flashing.

FabITA1 commented 1 year ago

Sorry for the delaied response. I tried both the terminate_event and the status_callback without significant improvements. In the end, I managed the problem at the application level by monitoring only the mqtt events.

Instead, what do you pratical mean with "two phase OTA, which would pause the network before flashing and resume it after flashing"?

It is not possbile to run-time change the LCP timeout parameters?

david-cermak commented 1 year ago

I tried both the terminate_event and the status_callback without significant improvements. In the end, I managed the problem at the application level by monitoring only the mqtt events.

This is weird, it must work even on v4.4. Have you enabled those events both in menuconfig and runtime with esp_netif_ppp_set_params() ? This should also work without LCP keepalive. I can create a simple example if you're still having trouble to get the events/callbacks.

what do you pratical mean with "two phase OTA, which would pause the network before flashing and resume it after flashing"?

I meant using native OTA interface directly (instead of https-ota component), similarly as you can see https://github.com/espressif/esp-idf/tree/master/examples/system/ota/native_ota_example You can perform OTA in two phases: 1) Check the binary (making just the https connection and reading headers to get sizes and versions of the binary) to decide if you're going to upgrade or not 1a) If yes, then turn off the network and bulk erase the partition 2) Turn the network back on and continue with OTA. This would still need to flash write, but that's not such an intensive operation to cause LCP timeouts. (can also create a simple example of this native two phase OTA)

FabITA1 commented 1 year ago

Thanks David! Ok, now i remember what the problem was... Yes, I enabled the status_callback in menuconfig: CONFIG_LWIP_NETIF_STATUS_CALLBACK=y but when I registered the callback function, I got this: error: implicit declaration of function 'netif_set_status_callback'; did you mean 'netif_status_callback'? [-Werror=implicit-function-declaration] I don't understood where it is declared.

perhaps a small example would be help me

david-cermak commented 1 year ago

I'm sorry, was suggesting direct interfacing lwIP API, which is not always very simple. Here's a quick demo of the two methods:

1) option: PPP events:

static void on_ppp_changed(void *arg, esp_event_base_t base, int32_t event_id, void *data)
{
    ESP_LOGI(TAG, "on_ppp_changed: Event %x", event_id);
}

....
   // in main (after creating `esp_netif`, before starting the DCE
    esp_netif_ppp_config_t ppp_config = { true, true };
    esp_netif_ppp_set_params(ppp_netif, &ppp_config);
    ESP_ERROR_CHECK(esp_event_handler_register(NETIF_PPP_STATUS, ESP_EVENT_ANY_ID, &on_ppp_changed, NULL));

you should get notified of all the PPP events. When the interface goes down, you should get:

D (55238) esp-netif_lwip-ppp: Phase Terminate
I (55248) modem_console: on_ppp_changed: Event 10b

2) option: Netif callback

#include "lwip/netif.h"

...
void netif_status_callback(struct netif *netif)
{
    ESP_LOGI(TAG, "netif(%p) is %d", netif, netif_is_up(netif));
}

...
static void on_ip_event(void *arg, esp_event_base_t event_base,
                        int32_t event_id, void *event_data)
{
    ESP_LOGD(TAG, "IP event! %d", event_id);
    if (event_id == IP_EVENT_PPP_GOT_IP) {
       ... here we're sure that the lwip netif has been created
        struct netif *lwip_netif = NULL;
        NETIF_FOREACH(lwip_netif) {
            if (lwip_netif->name[0] == 'p' && lwip_netif->name[1] == 'p') {
                break;
            }
        }
        if (lwip_netif != NULL) {
            netif_set_status_callback(lwip_netif, netif_status_callback);
        }

    }
}
FabITA1 commented 1 year ago

David, it works! I tried the option 1 (PPP events) and I obtain:

I (90598) Gest4G: PPP state changed event 267 E (90598) Gest4G: Connection lost! I (90598) Gest4G: PPP state changed event 265 I (90598) Gest4G: PPP state changed event 262 E (90598) TRANSPORT_BASE: poll_read select error 113, errno = Software caused connection abort, fd = 57 E (90598) MQTT_CLIENT: Poll read error: 119, aborting connection

but if I disable the LCP Echo no events are triggered. So, I am forced to solve the OTA problem. Thanks again

david-cermak commented 1 year ago

So, I am forced to solve the OTA problem.

Does it also fail for you if you have the UART ISR in IRAM?

CONFIG_UART_ISR_IN_IRAM=y

If so, we should probably add more functions to IRAM, the reception part up to pppos_input and the lcp processing sections of PPP (LcpSendEchoRequest(), LcpEchoCheck(), LcpEchoTimeout())

(alternatively, you can implement this two phase OTA as mentioned previously using native OTA API -- I can help to put together a simple demo)

FabITA1 commented 1 year ago

David, I had already activated this option when I started working with the ota via modem, and this solve the problem. But the issue came back when I activated the LCP Echo for detect the PPP disconnection. If you can, any help is appreciated... Thanks

FabITA1 commented 1 year ago

David, I update the situation. I have tried to implement the dual-phase OTA as you suggest following the example you post some weeks ago: https://github.com/espressif/esp-idf/tree/master/examples/system/ota/native_ota_example

When I use the LAN interface (the other interface of my board) seems to works. However, I notice that it is slower than the previous implementation: 37 vs 23 seconds.

But when I try the fw update over PPP the result is this:

I (44680) FwUpg: Running partition type 0 subtype 0 (offset 0x00010000) I (46610) HTTP_CLIENT: Body received in fetch header state, 0x3fccf53f, 205 I (46610) FwUpg: Writing to partition subtype 16 at offset 0x210000 I (46610) FwUpg: New firmware version: 003.H I (46610) FwUpg: Running firmware version: 003.H I (46610) FwUpg: esp_ota_begin succeeded W (55990) uart_terminal: Ring Buffer Full W (56000) CMUX: Protocol mismatch: Missed trailing SOF, recovering... I (56010) CMUX: Protocol recovered E (62730) esp-tls-mbedtls: read error :-0x004C: E (62730) TRANSPORT_BASE: esp_tls_conn_read error, errno=Software caused connection abort W (62730) HTTP_CLIENT: esp_transport_read returned:-76 and errno:113 E (62730) TRANSPORT_BASE: poll_read select error 113, errno = Software caused connection abort, fd = 57 E (62770) MQTT_CLIENT: Poll read error: 119, aborting connection E (62780) TRANSPORT_BASE: poll_read select error 113, errno = Software caused connection abort, fd = 58 E (62780) FwUpg: Error: SSL data read error E (62780) FwUpg: Exiting task due to fatal error... I (62870) GestMqtt: >>> MQTT disconnect E (63190) esp-netif_lwip-ppp: Connection timeout I (63190) Gest4G: PPP state changed event 9 E (63190) Gest4G: Connection lost! I (63190) esp_modem_netif: PPP state changed event 9

Edit: I use the uart flow control mode +IFC=2,0 - I will check if it works effectively. But I don't know if this could be the problem.

david-cermak commented 1 year ago

@FabITA1 The IDF example with native-ota example follows the same workflow as the default https-ota example. To implement what I was referring to as a two-phase OTA you would have to 1) download the app descriptor to check if you're going to update or not 2) disconnect (as the next step would break the connection) 3) start the ota with OTA_SIZE_UNKNOWN parameter to bulk-erase the partition -- this will take a few seconds 4) resume network, reconnect 5) complete the OTA (flash write operations are shouldn't cause LCP keepalive to timeout) (so it's actually a multi-phase OTA)

here's a quick patch that modifies the IDF example accordingly:

--- a/examples/system/ota/native_ota_example/main/native_ota_example.c
+++ b/examples/system/ota/native_ota_example/main/native_ota_example.c
@@ -139,6 +140,7 @@ static void ota_example_task(void *pvParameter)
     int binary_file_length = 0;
     /*deal with all receive packet*/
     bool image_header_was_checked = false;
+    bool partition_erased = false;
     while (1) {
         int data_read = esp_http_client_read(client, ota_write_data, BUFFSIZE);
         if (data_read < 0) {
@@ -182,16 +184,32 @@ static void ota_example_task(void *pvParameter)
                     }
 #endif

-                    image_header_was_checked = true;
-
-                    err = esp_ota_begin(update_partition, OTA_WITH_SEQUENTIAL_WRITES, &update_handle);
-                    if (err != ESP_OK) {
-                        ESP_LOGE(TAG, "esp_ota_begin failed (%s)", esp_err_to_name(err));
-                        http_cleanup(client);
-                        esp_ota_abort(update_handle);
-                        task_fatal_error();
+                    if (!partition_erased) {
+                        partition_erased = true;
+                        // now we disconnect
+                        ESP_LOGW(TAG, "Disconnecting before bulk erase");
+                        esp_http_client_close(client);
+                        err = esp_ota_begin(update_partition, OTA_SIZE_UNKNOWN, &update_handle);
+                        if (err != ESP_OK) {
+                            ESP_LOGE(TAG, "esp_ota_begin failed (%s)", esp_err_to_name(err));
+                            http_cleanup(client);
+                            esp_ota_abort(update_handle);
+                            task_fatal_error();
+                        }
+                        ESP_LOGI(TAG, "esp_ota_begin succeeded..");
+                        // and reconnect
+                        ESP_LOGW(TAG, "Resuming network operations");
+                        err = esp_http_client_open(client, 0);
+                        if (err != ESP_OK) {
+                            ESP_LOGE(TAG, "Failed to open HTTP connection: %s", esp_err_to_name(err));
+                            esp_http_client_cleanup(client);
+                            task_fatal_error();
+                        }
+                        esp_http_client_fetch_headers(client);
+                        continue;
+                    } else {
+                        image_header_was_checked = true;
                     }
-                    ESP_LOGI(TAG, "esp_ota_begin succeeded");
                 } else {
                     ESP_LOGE(TAG, "received package is not fit len");
                     http_cleanup(client);

(note that you can place CMUX-exit and CMUX-enter at the placeholders -- warning logs about disconnection and reconnection , but it would probably also work without it, it's just expected to disconnect from the PPP server)

FabITA1 commented 1 year ago

@david-cermak thanks!! I have made the mods of my implementation, but the result is not stable. I have tried many times the upgrade process, but only 1 or 2 with success. In these cases, however, the update was very very slow and I see these message log every 10-15 seconds:

E (1004186) TRANSPORT_BASE: esp_tls_conn_read error, errno=No more processes W (1004186) HTTP_CLIENT: esp_transport_read returned:-26880 and errno:11

In all other cases, the process stucks almost immediately. Here an example of what happens:

I (66352) FwUpg: Running partition type 0 subtype 0 (offset 0x00010000) I (68232) HTTP_CLIENT: Body received in fetch header state, 0x3fcd0913, 205 I (68232) FwUpg: Writing to partition subtype 16 at offset 0x210000 I (68232) FwUpg: New firmware version: 003.H I (68232) FwUpg: Running firmware version: 003.H W (68232) FwUpg: Disconnecting before bulk erase W (68442) CMUX: Protocol mismatch: Missed trailing SOF, recovering... I (68442) CMUX: Protocol recovered W (68442) uart_terminal: Ring Buffer Full W (68442) CMUX: Protocol mismatch: Missed trailing SOF, recovering... I (68642) CMUX: Protocol recovered W (68852) CMUX: Protocol mismatch: Missed leading SOF, recovering... I (68852) CMUX: Protocol recovered W (68852) uart_terminal: Ring Buffer Full I (74972) FwUpg: esp_ota_begin succeeded.. W (74972) FwUpg: Resuming network operations I (77072) HTTP_CLIENT: Body received in fetch header state, 0x3fcd0913, 205 I (77072) FwUpg: New firmware version: 003.H I (77072) FwUpg: Running firmware version: 003.H E (82772) esp-tls-mbedtls: read error :-0x004C: E (82772) TRANSPORT_BASE: esp_tls_conn_read error, errno=Software caused connection abort W (82772) HTTP_CLIENT: esp_transport_read returned:-76 and errno:113 E (82772) TRANSPORT_BASE: poll_read select error 113, errno = Software caused connection abort, fd = 57 E (82772) MQTT_CLIENT: mqtt_message_receive: transport_read() error: errno=119 I (82772) GestMqtt: MQTT_EVENT_ERROR E (82772) MQTT_CLIENT: mqtt_process_receive: mqtt_message_receive() returned -1 E (82772) TRANSPORT_BASE: poll_read select error 113, errno = Software caused connection abort, fd = 58 E (82772) FwUpg: Error: SSL data read error E (82772) FwUpg: Exiting task due to fatal error... E (83142) esp-netif_lwip-ppp: Connection timeout I (83142) Gest4G: PPP state changed event 9 E (83142) Gest4G: Connection lost!

Maybe I didn't understand what to do?

[Off topic: but is not possible to simply deactivate/activate the LCP Echo option while the application is running?]

Thank you

david-cermak commented 1 year ago

This is weird, I don't see any terminate event, or other ppp transitions caused by LCP timeouts. Do you still have the UART interrupt in IRAM?

but is not possible to simply deactivate/activate the LCP Echo option while the application is running?

These are compile time configuration, but I think we can provide an lwip hook to let users update it runtime. I'll make an example.

FabITA1 commented 1 year ago

Yes, I confirm David, the Uart ISR in IRAM flag is set.

david-cermak commented 1 year ago

Yes, I confirm David, the Uart ISR in IRAM flag is set.

But it doesn't fail due to LCP timeout (at least cannot see the issue in your log). Does this native OTA work when you disable LCP keepalive?

provide an lwip hook to let users update it runtime.

This is not possible, unfortunately. We'd have to update IDF to support that (no need to update lwip tho). Could you please check if after applying this patch you can complete the OTA (after calling esp_netif_disable_lcp())?

diff --git a/components/esp_netif/include/esp_netif_ppp.h b/components/esp_netif/include/esp_netif_ppp.h
index 159388e80e..a78934ac7f 100644
--- a/components/esp_netif/include/esp_netif_ppp.h
+++ b/components/esp_netif/include/esp_netif_ppp.h
@@ -120,6 +120,8 @@ esp_err_t esp_netif_ppp_set_params(esp_netif_t *netif, const esp_netif_ppp_confi
  */
 esp_err_t esp_netif_ppp_get_params(esp_netif_t *netif, esp_netif_ppp_config_t *config);

+esp_err_t esp_netif_disable_lcp(esp_netif_t *esp_netif);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/components/esp_netif/lwip/esp_netif_lwip_ppp.c b/components/esp_netif/lwip/esp_netif_lwip_ppp.c
index 4bf697bc27..46dd95a10f 100644
--- a/components/esp_netif/lwip/esp_netif_lwip_ppp.c
+++ b/components/esp_netif/lwip/esp_netif_lwip_ppp.c
@@ -269,6 +269,16 @@ esp_err_t esp_netif_start_ppp(esp_netif_t *esp_netif)
     return ESP_OK;
 }

+esp_err_t esp_netif_disable_lcp(esp_netif_t *esp_netif)
+{
+    netif_related_data_t *netif_related = esp_netif->related_data;
+    lwip_peer2peer_ctx_t *ppp_ctx = (lwip_peer2peer_ctx_t *)netif_related;
+    assert(ppp_ctx->base.netif_type == PPP_LWIP_NETIF);
+    ppp_ctx->ppp->settings.lcp_echo_interval = 0;
+    ppp_ctx->ppp->settings.lcp_echo_fails = 0;
+    return ESP_OK;
+}
+
 esp_netif_recv_ret_t esp_netif_lwip_ppp_input(void *ppp_ctx, void *buffer, size_t len, void *eb)
 {
     struct lwip_peer2peer_ctx * obj = ppp_ctx;
FabITA1 commented 1 year ago

In the meantime, I'll answer the first question: Yes, disabling the LCP Echo the OTA works fine.

And now the second question: I have tried what you suggest. I hope maked no errors, but after the esp_netif_disable_lcp() call, I get this error repeatedly: E (62809) esp-netif_lwip-ppp: pppos_input_tcpip failed with -1 ....... .......