Azure-Samples / iot-middleware-freertos-samples

This repo has samples for dev kits using the Azure IoT middleware for FreeRTOS
MIT License
76 stars 46 forks source link

ESP-TLS : Failed to open new connection #375

Closed LucasIssardWatts closed 1 year ago

LucasIssardWatts commented 1 year ago

Hi everyone 👋 !

This issue is for a:

- [x] bug report
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Randomly, my ESP32 fails to open a connection with the IoT Hub and returns the error you can see below. This bug is relatively rare and unpredictable (which makes it particularly unpleasant to correct) but is blocking because the device remains in a disconnected state until the next restart. This problem only appears on one of my Wi-Fi networks, which is probably the least reliable. When the problem occurs, the ESP32 is still connected to the Internet, of course.

Log messages given by the failure

W (140390) esp-tls: Failed to open new connection in specified timeout
E (140390) TRANSPORT_BASE: Failed to open a new connection
E (140390) tls_freertos: Failed establishing TLS connection (esp_transport_connect failed)

W (140400) AZ IOT: Connection to the IoT Hub failed [6]. Retrying connection with backoff and jitter [317]ms.

E (142720) esp-tls: [sock=54] select() timeout
E (142720) esp-tls: Failed to open new connection
E (142720) TRANSPORT_BASE: Failed to open a new connection
E (142720) tls_freertos: Failed establishing TLS connection (esp_transport_connect failed)

W (142730) AZ IOT: Connection to the IoT Hub failed [6]. Retrying connection with backoff and jitter [450]ms.

E (145190) esp-tls: [sock=54] select() timeout
E (145190) esp-tls: Failed to open new connection
E (145190) TRANSPORT_BASE: Failed to open a new connection
E (145190) tls_freertos: Failed establishing TLS connection (esp_transport_connect failed)

W (145200) AZ IOT: Connection to the IoT Hub failed [6]. Retrying connection with backoff and jitter [1472]ms.

E (148680) esp-tls: [sock=54] select() timeout
E (148680) esp-tls: Failed to open new connection
E (148680) TRANSPORT_BASE: Failed to open a new connection
E (148680) tls_freertos: Failed establishing TLS connection (esp_transport_connect failed)

W (148690) AZ IOT: Connection to the IoT Hub failed [6]. Retrying connection with backoff and jitter [1756]ms.

E (164050) esp-tls: couldn't get hostname for :<myhostname>: getaddrinfo() returns 202, addrinfo=0x0
E (164050) esp-tls: Failed to open new connection
E (164050) TRANSPORT_BASE: Failed to open a new connection
E (164060) tls_freertos: Failed establishing TLS connection (esp_transport_connect failed)

W (164070) AZ IOT: Connection to the IoT Hub failed [6]. Retrying connection with backoff and jitter [2199]ms.

E (180050) esp-tls: couldn't get hostname for :<myhostname>: getaddrinfo() returns 202, addrinfo=0x0
E (180050) esp-tls: Failed to open new connection
E (180050) TRANSPORT_BASE: Failed to open a new connection
E (180060) tls_freertos: Failed establishing TLS connection (esp_transport_connect failed)

E (180070) AZ IOT: Connection to the IoT Hub failed, all attempts exhausted.

Here's an extract of the code where the bug appears (prvConnectToServerWithBackoffRetries function) :

/* Initialize reconnect attempts and interval. */
BackoffAlgorithm_InitializeParams( &xReconnectParams,
                                   sampleazureiotRETRY_BACKOFF_BASE_MS,
                                   sampleazureiotRETRY_MAX_BACKOFF_DELAY_MS,
                                   sampleazureiotRETRY_MAX_ATTEMPTS);

/* Attempt to connect to IoT Hub. If connection fails, retry after
 * a timeout. Timeout value will exponentially increase till maximum
 * attempts are reached. */
do
{
    LogInfo( ( "Creating a TLS connection to %s:%u.\r\n", pcHostName, port ) );
    /* Attempt to create a mutually authenticated TLS connection. */
    xNetworkStatus = TLS_Socket_Connect( pxNetworkContext,
                                         pcHostName, port,
                                         pxNetworkCredentials,
                                         sampleazureiotTRANSPORT_SEND_RECV_TIMEOUT_MS,
                                         sampleazureiotTRANSPORT_SEND_RECV_TIMEOUT_MS);

    if( xNetworkStatus != eTLSTransportSuccess )
    {
        /* Generate a random number and calculate backoff value (in milliseconds) for
         * the next connection retry.
         * Note: It is recommended to seed the random number generator with a device-specific
         * entropy source so that possibility of multiple devices retrying failed network operations
         * at similar intervals can be avoided. */
        xBackoffAlgStatus = BackoffAlgorithm_GetNextBackoff( &xReconnectParams, rand(), &usNextRetryBackOff );

        if( xBackoffAlgStatus == BackoffAlgorithmRetriesExhausted )
        {
            LogError( ( "Connection to the IoT Hub failed, all attempts exhausted." ) );
        }
        else if( xBackoffAlgStatus == BackoffAlgorithmSuccess )
        {
            LogWarn( ( "Connection to the IoT Hub failed [%d]. "
                       "Retrying connection with backoff and jitter [%d]ms.",
                       xNetworkStatus, usNextRetryBackOff ) );
            vTaskDelay( pdMS_TO_TICKS( usNextRetryBackOff ) );
        }
    }
} while( ( xNetworkStatus != eTLSTransportSuccess ) && ( xBackoffAlgStatus == BackoffAlgorithmSuccess ) );

Versions

Run on ESP32-WROOM-32E Complied with ESP-IDF v4.4 Project based on iot-middleware-freertos-samples\demos\projects\ESPRESSIF\esp32 & iot-middleware-freertos-samples\demos\sample_azure_iot

How can I improve reliability ?

Thanks for your help !

LucasIssardWatts commented 1 year ago

Any ideas :confused: ?

I think the issue is located in /demos/projects/ESPRESSIF/esp32/components/sample-azure-iot/transport_tls_esp32.c file, when trying to establish TLS connection but I don't understand why...

RLeclair commented 1 year ago

Hi @LucasIssardWatts, if the wi-fi connection your device is connecting to is unreliable, you can increase the number of attempts by modifying sampleazureiotRETRY_MAX_ATTEMPTS or increasing the delay in between attempts by modifying sampleazureiotRETRY_BACKOFF_BASE_MS and sampleazureiotRETRY_MAX_BACKOFF_DELAY_MS. These variables are in sample_azure_iot

LucasIssardWatts commented 1 year ago

Hi @RLeclair ! Thank you for your reply 👍.

I've already tried to fine tune these values but without significant improvement.

The only solution I've found to this problem is to restart Wi-Fi connection by using esp_wifi_stop() and esp_wifi_start() to make a new try. You'd think the problem is the ESP32's Internet connection. However, Wi-Fi connection seems correct, since I can synchronize SNTP without any problem.

RLeclair commented 1 year ago

Hi @LucasIssardWatts, does restarting the Wi-Fi connection always fix the issue (does the new try always work)? If not, that would point toward the network being unstable for long periods of time which might require a longer delay in between retries than what you tried.

We currently don't have plans on implementing our sample with network re-initialization as our code is not meant to be production-ready.

LucasIssardWatts commented 1 year ago

Hi, after further tests, restarting the Wi-Fi connection solves the problem most of the time but not always (it solves the problem 8 times /10). So, I implemented a network re-initialization process in my code to improve reliability. Thanks.