espressif / esp-azure

SDK to connect ESP8266 and ESP32 to Microsoft Azure IoT services
176 stars 92 forks source link

Loss of wifi AP connection results in Azure/TLS errors on recovery (IDFGH-7210) (CA-220) #127

Open rtheil-growlink opened 2 years ago

rtheil-growlink commented 2 years ago

Environment

Problem Description

ESP32 on esp-idf 4.3.2

I've adapted the Azure example project into our project using the port stuff from the example. I'm testing different disconnection scenarios and if I have a controller up and running and connected to azure, then I power down the wifi AP, the controller loses connectivity as expected, but is never able to recover once reconnected.

Here's the log file from the disconnection event. It starts with a confirmation of a message sent to Azure IoT hub. Then the wifi disconnection happens, and right away there's a tls error, "esp-tls-mbedtls: read error :-76". I've seen this in other cases and when this happens Azure client is never able to recover. Later, there are errors about, "tlsio_esp_tls_send_async without a prior successful open". This repeats.

I (191518) AzureService: Confirmation[15] received for message tracking id = 15 with result = IOTHUB_CLIENT_CONFIRMATION_OK
I (213958) wifi:bcn_timout,ap_probe_send_start
I (216468) wifi:ap_probe_send over, resett wifi status to disassoc
I (216468) wifi:state: run -> init (c800)
I (216468) wifi:pm stop, total sleep time: 178968508 us / 213361706 us
W (216468) wifi:<ba-del>idx
W (216478) wifi:<ba-del>idx
I (216478) wifi:new:<9,0>, old:<9,0>, ap:<255,255>, sta:<9,0>, prof:1
E (216488) esp-tls-mbedtls: read error :-76:
E (216488) UDP: stop
I (216568) gpio: GPIO[12]| InputEn: 0| OutputEn: 0| OpenDrain: 0| Pullup: 1| Pulldown: 0| Intr:0 
W (221558) wifi:Haven't to connect to a suitable AP now!
W (221558) wifi:Haven't to connect to a suitable AP now!
I (221558) WiFiClass: rssi: 0
I (230878) gpio: GPIO[12]| InputEn: 0| OutputEn: 0| OpenDrain: 0| Pullup: 1| Pulldown: 0| Intr:0 
I (241458) gpio: GPIO[12]| InputEn: 0| OutputEn: 0| OpenDrain: 0| Pullup: 1| Pulldown: 0| Intr:0 
W (251608) wifi:Haven't to connect to a suitable AP now!
W (251608) wifi:Haven't to connect to a suitable AP now!
I (251608) WiFiClass: rssi: 0
I (251618) gpio: GPIO[12]| InputEn: 0| OutputEn: 0| OpenDrain: 0| Pullup: 1| Pulldown: 0| Intr:0 
Error: Time:Thu Apr 14 19:22:24 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:22:24 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:22:24 PINGREQ
Error: Time:Thu Apr 14 19:22:24 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:22:24 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data

Once the wifi is plugged back in and started up, the controller will reconnect successfully to wifi AP, then throw a bunch more tls/azure errors, get an IP, then continue to throw the same tls and azure client errors. The controller is unable to recover from this and must be reset to get the Azure client working again.

-> 19:27:31 PINGREQ

I (561108) wifi:new:<9,0>, old:<9,0>, ap:<255,255>, sta:<9,0>, prof:1
I (561108) wifi:state: init -> auth (b0)
I (561118) wifi:state: auth -> assoc (0)
I (561128) wifi:state: assoc -> run (10)
I (561248) wifi:connected with gl-esp1, aid = 1, channel 9, BW20, bssid = 94:83:c4:17:0c:40
I (561248) wifi:security: WPA2-PSK, phy: bgn, rssi: -15
I (561248) wifi:pm start, type: 1

I (561318) wifi:AP's beacon interval = 102400 us, DTIM period = 1
Error: Time:Thu Apr 14 19:27:35 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:35 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:35 PINGREQ
I (563118) WiFiClass: rssi: -15
Error: Time:Thu Apr 14 19:27:35 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:35 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:35 PINGREQ
Error: Time:Thu Apr 14 19:27:35 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:35 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:35 PINGREQ
Error: Time:Thu Apr 14 19:27:35 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:35 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:35 PINGREQ
Error: Time:Thu Apr 14 19:27:35 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:35 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:35 PINGREQ
Error: Time:Thu Apr 14 19:27:35 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:36 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:36 PINGREQ
Error: Time:Thu Apr 14 19:27:36 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:36 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:36 PINGREQ
Error: Time:Thu Apr 14 19:27:36 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:36 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:36 PINGREQ
Error: Time:Thu Apr 14 19:27:36 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:36 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:36 PINGREQ
Error: Time:Thu Apr 14 19:27:36 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:36 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:36 PINGREQ
Error: Time:Thu Apr 14 19:27:36 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:36 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:36 PINGREQ
Error: Time:Thu Apr 14 19:27:36 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:36 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:36 PINGREQ
Error: Time:Thu Apr 14 19:27:36 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:36 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:36 PINGREQ
Error: Time:Thu Apr 14 19:27:36 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:36 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:36 PINGREQ
Error: Time:Thu Apr 14 19:27:36 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:36 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:36 PINGREQ
Error: Time:Thu Apr 14 19:27:36 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:36 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:36 PINGREQ
Error: Time:Thu Apr 14 19:27:36 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:36 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:36 PINGREQ
Error: Time:Thu Apr 14 19:27:36 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:36 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:36 PINGREQ
Error: Time:Thu Apr 14 19:27:36 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:36 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
-> 19:27:36 PINGREQ
W (564588) wifi:<ba-add>idx:0 (ifx:0, 94:83:c4:17:0c:40), tid:6, ssn:2, winSize:64
W (564818) wifi:<ba-add>idx:1 (ifx:0, 94:83:c4:17:0c:40), tid:0, ssn:2, winSize:64
I (565558) esp_netif_handlers: sta ip: 192.168.1.71, mask: 255.255.255.0, gw: 192.168.1.1
I (565558) WiFiClass: on_wifi_got_ip 192.168.1.71
Error: Time:Thu Apr 14 19:27:40 2022 File:../port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open
Error: Time:Thu Apr 14 19:27:40 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:387 Failure sending control packet data
Error: Time:Thu Apr 14 19:27:40 2022 File:../lib/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:mqtt_client_publish Line:1175 Error: mqtt_client_publish send failed
Error: Time:Thu Apr 14 19:27:40 2022 File:../lib/azure-iot-sdk-c/iothub_client/src/iothubtransport_mqtt_common.c Func:publishTelemetryMsg Line:1001 Failed attempting to publish mqtt message

Worth noting that this is tls specific. Standard http requests from this controller work fine once the wifi has reconnected.

Also notable is that the azure sdk never calls the connection status callback method set with IoTHubDeviceClient_LL_SetConnectionStatusCallback().

Expected Behavior

When wifi AP is lost, TLS should recover and Azure client should be able to reconnect

Actual Behavior

TLS cannot recover until Azure client attempts to renew SAS token

hauserkristof commented 2 years ago

I have the exact same issue! Could somebody managed to avoid it?

rtheil-growlink commented 2 years ago

I switched to the Azure middleware for FreeRTOS. Worth the effort. https://github.com/Azure/azure-iot-middleware-freertos

Vivero commented 1 year ago

I was running into this same exact issue. Sometimes I'd get into this scenario even without turning off the AP. My device would just randomly disconnect and then never recover. The only workaround I was able to find was to modify a part of the MQTT code myself. Here's what that fix looks like (using my own fork of azure-umqtt-c): https://github.com/MvIt-Now/azure-umqtt-c/commit/bcb0b2387fd04cc32c2d53d4d65363c296188af8

afcec commented 1 year ago

I have also the same issue but unfortunately this repo seems to be dead...

rtheil-growlink commented 1 year ago

@afcec Look into Azure middleware. They have a samples repo that has everything you need for getting up and running on any espressif device. https://github.com/Azure/azure-iot-middleware-freertos

ClaesIvarsson commented 11 months ago

I just found a problem in tlsio_esp_tls.c When a problem occur ion the socket, the state goes to TLSIO_STATE_ERROR, however in most cases it is not reported upstream, which means azure sdk is not aware of the problem. My solution was to change the code to always call enter_tlsio_error_state when current code set the error state Happens only in the function tlsio_esp_tls_dowork:

static void tlsio_esp_tls_dowork(CONCRETE_IO_HANDLE tls_io)
{
    if (tls_io == NULL)
    {
        /* Codes_SRS_TLSIO_30_070: [ If the tlsio_handle parameter is NULL, tlsio_dowork shall do nothing except log an error. ]*/
        LogError("NULL tlsio");
    }
    else
    {
        TLS_IO_INSTANCE* tls_io_instance = (TLS_IO_INSTANCE*)tls_io;

        // This switch statement handles all of the state transitions during the opening process
        switch (tls_io_instance->tlsio_state)
        {
        case TLSIO_STATE_CLOSED:
            /* Codes_SRS_TLSIO_30_075: [ If the adapter is in TLSIO_STATE_EXT_CLOSED then  tlsio_dowork  shall do nothing. ]*/
            // Waiting to be opened, nothing to do
            break;
        case TLSIO_STATE_INIT:
            {
                int result = esp_tls_conn_new_async(tls_io_instance->hostname, strlen(tls_io_instance->hostname), tls_io_instance->port, &tls_io_instance->esp_tls_cfg, tls_io_instance->esp_tls_handle);
                if (result == 1) {
                    tls_io_instance->tlsio_state = TLSIO_STATE_OPEN;
                    tls_io_instance->on_open_complete(tls_io_instance->on_open_complete_context, IO_OPEN_OK);
                } else if (result == -1) {
                    enter_tlsio_error_state(tls_io_instance);
                }
            }
            break;
        case TLSIO_STATE_OPEN:
            if (dowork_read(tls_io_instance) < 0 && errno != EAGAIN)
            {
                enter_tlsio_error_state(tls_io_instance);
            }
            if (dowork_send(tls_io_instance) < 0 && errno != EAGAIN)
            {
                enter_tlsio_error_state(tls_io_instance);
            }
            break;
        case TLSIO_STATE_ERROR:
            /* Codes_SRS_TLSIO_30_071: [ If the adapter is in TLSIO_STATE_EXT_ERROR then tlsio_dowork shall do nothing. ]*/
            // There is nothing valid to do here but wait to be retried
            break;
        default:
            LogError("Unexpected internal tlsio state");
            break;
        }
    }
}