Azure / azure-iot-arduino

Azure IoT library for the Arduino
Other
168 stars 95 forks source link

azure-iot-arduino fails IoTHub connection after a couple hours #30

Closed GregTerrell closed 7 years ago

GregTerrell commented 7 years ago

I am is working on an embedded solution targeting Arduino architecture and WINC1500 (using the Adafruit Feather M0 with WINC1500 for our reference build). My application code works fine for several hours then fails with one of two scenarios… 1) the WINC1500 transmit LED (yellow) goes on solid; or 2) the connection to the IoT Hub is lost and a sequence of errors is produced (exact sequence depends on which WINC client: Adafruit_WINC1500 or WiFi101).

I can reproduce the issue with the Remote_Monitoring example provided in the Getting Started kit at https://github.com/Azure-Samples/iot-hub-c-m0wifi-getstartedkit and I have moved all testing from my code to the remote_monitoring example.

Scenario #2 errors with Adafruit_WIN1500 and AzureIoTHub (0.2.0) AzureIoTHub\src\httpapi.cpp Func:HTTPAPI_CreateConnection Line:39 HTTPS connection to loouq-iotos-preview.azure-devices.net failed AzureIoTHub\src\sdk\httpapiex.c Func:HTTPAPIEX_ExecuteRequest Line:476 unable to recover sending to a working state AzureIoTHub\src\sdk\iothubtransporthttp.c Func:DoEvent Line:1223 unable to HTTPAPIEX_ExecuteRequest

Scenario #2 errors with WINC1500 and azure-iot-arduino (master - Latest commit df88a5a on Oct 14) azure-iot-arduino-master\src\adapters\tlsio_arduino.c Func:tlsio_arduino_create Line:159 Host loouq-iotos-preview.azure-devices.net not found. azure-iot-arduino-master\src\adapters\httpapi_compact.c Func:HTTPAPI_CreateConnection Line:221 Create connection failed azure-iot-arduino-master\src\azure_c_shared_utility\httpapiex.c Func:HTTPAPIEX_ExecuteRequest Line:478 unable to recover sending to a working state azure-iot-arduino-master\src\sdk\iothubtransporthttp.c Func:DoEvent Line:1582 unable to HTTPAPIEX_SAS_ExecuteRequest

I also noted a huge slowdown in the IoTHubClient_LL_DoWork() method with the newer azure-iot-arduino, from a about 10 millis to a range of 4 to 9 seconds.

I tried switching to the WiFi101 driver (both 0.10.0 and github master latest as of 11/13/2016) with the same results. I attempted to use the WiFi101 driver due to the statement on the Adafruit WINC1500 github repo… “This fork is deprecated! Please use WiFi101, which now has pin-setting support Thak you! :)” and has not had commits since Sep 22nd.

size_t WiFiClient::write(const uint8_t *buf, size_t size)
{
    sint16 err;

    if (_socket < 0 || size == 0) {
        setWriteError();
        return 0;
    }

    // Network led ON (rev A then rev B).
    m2m_periph_gpio_set_val(M2M_PERIPH_GPIO16, 0);
    m2m_periph_gpio_set_val(M2M_PERIPH_GPIO5, 0);

    m2m_wifi_handle_events(NULL);

    #ifdef DEBUG_STREAM
        DEBUG_STREAM.print(F("WiFiClient()::write, ready to send: "));
        DEBUG_STREAM.println((char*)buf);
    #endif

    while ((err = send(_socket, (void *)buf, size, 0)) < 0) {

        #ifdef DEBUG_STREAM
            DEBUG_STREAM.print(F("WiFiClient()::write, send error="));
            DEBUG_STREAM.println(err);
        #endif

        // Exit on fatal error, retry if buffer not ready.
        if (err != SOCK_ERR_BUFFER_FULL) {
            setWriteError();
            m2m_periph_gpio_set_val(M2M_PERIPH_GPIO16, 1);
            m2m_periph_gpio_set_val(M2M_PERIPH_GPIO5, 1);
            return 0;
        }
        m2m_wifi_handle_events(NULL);
    }

    #ifdef DEBUG_STREAM
        DEBUG_STREAM.println(F("WiFiClient()::write, send operation completed"));
    #endif

    // Network led OFF (rev A then rev B).
    m2m_periph_gpio_set_val(M2M_PERIPH_GPIO16, 1);
    m2m_periph_gpio_set_val(M2M_PERIPH_GPIO5, 1);

    return size;
}

My questions…

Thanks, Greg

GregTerrell commented 7 years ago

After further testing using the 0.11.0 version of the WiFi101 client I have traced the lockup (solid activity LED) problem down to an infinite loop in SSL send function. I don’t see this as a direct defect in the azure-iot-arduino library, but the library’s actions are hitting a design constraint and\or defect with the WINC1500 code found inside WiFi101 (and the Adafruit WINC1500 library).

I haven’t been able to trace down any Atmel documentation, but it appears that the WINC1500 supports 16 outstanding transmits. The azure-iot-arduino library performs 18 sslClient_Writes() to complete an event send, this results in the TCP send window closing at the CRLF sent between the headers and the body; until outstanding sends are deemed complete via the send callback.

Typically the WiFi client waits appropriately for the send callbacks in order to finish sending the event POST. However, under some conditions the WINC deadlocks never processing send callbacks and consequently never returning to the task of sending out TCP frames. The WiFi101\WINC code in question is within the WINC hif_send() (in m2m_hif.c) and WiFiClient::write() (in WiFiClient.cpp) functions. This logic is waiting for a dma_addr (aka send buffer) from the WINC chip.

I have implemented a work around in our test environment, which consolidates the header and trailing CRLF writes inside SendHeadsToXIO() (in adapter\httpapi_compact.c) to a single send header + CRLF. By reducing the sends per event from 18 down to 11 the WINC code has now been running for over 75 hours, previous to the workaround never made it 8 hours.

GregTerrell commented 7 years ago

Our work around (below) initially solved the issue, however as I added back out user properties I discovered these are sent over the wire as additional HTTP headers, resulting in more than 16 frames to send an event and a reappearance of the issue.

Workaround...

//LOOUQ: compress header and trailing endline into one send()

if(((ret = snprintf(buf, sizeof(buf), "%s \r\n", header)) < 0) || (ret >= sizeof(buf)))
{
    /*Codes_SRS_HTTPAPI_COMPACT_21_027: [ If the HTTPAPI_ExecuteRequest cannot create a buffer to send the request, it shall not send any request and return HTTPAPI_STRING_PROCESSING_ERROR. ]*/
    result = HTTPAPI_STRING_PROCESSING_ERROR;
}
else 
{
    if (xio_send(http_instance->xio_handle, (const unsigned char*)buf, strlen(buf), NULL, NULL) != 0)
    {
    /*Codes_SRS_HTTPAPI_COMPACT_21_028: [ If the HTTPAPI_ExecuteRequest cannot send the request header, it shall return HTTPAPI_HTTP_HEADERS_FAILED. ]*/
        result = HTTPAPI_SEND_REQUEST_FAILED;
    }
}

// if (xio_send(http_instance->xio_handle, (const unsigned char*)header, strlen(header), NULL, NULL) != 0)
// {
//  /*Codes_SRS_HTTPAPI_COMPACT_21_028: [ If the HTTPAPI_ExecuteRequest cannot send the request header, it shall return HTTPAPI_HTTP_HEADERS_FAILED. ]*/
//  result = HTTPAPI_SEND_REQUEST_FAILED;
// }
// if (xio_send(http_instance->xio_handle, (const unsigned char*)"\r\n", (size_t)2, NULL, NULL) != 0)
// {
//  /*Codes_SRS_HTTPAPI_COMPACT_21_028: [ If the HTTPAPI_ExecuteRequest cannot send the request header, it shall return HTTPAPI_HTTP_HEADERS_FAILED. ]*/
//  result = HTTPAPI_SEND_REQUEST_FAILED;
// }
GregTerrell commented 7 years ago

Closing... this issue was rooted within the WiFi code. The AzureIotHub was able to cause the WiFi code to go into a tight retry loop after exhausting the send buffers within the WINC1500 chip. Under certain conditions, still unknown, the WINC1500 driver would stop attempting to service the chip and deadlock waiting for a send buffer.

A pending pull request https://github.com/arduino-libraries/WiFi101/pull/77 provides a better retry mechanism to the WINC no send buffer condition. My application can successfully run for, so far 24+ hours, with this WiFi101 code.