aws / amazon-freertos

DEPRECATED - See README.md
https://aws.amazon.com/freertos/
MIT License
2.54k stars 1.11k forks source link

OTA Update Fails To Connect Intermittently (BLE-BETA 17ba347) #501

Closed bgintz-rsc closed 4 years ago

bgintz-rsc commented 5 years ago

Describe the bug Failure to connect with broker. Error -116 pdFREERTOS_ERRNO_ETIMEDOUT

System information

Expected behavior A clear and concise description of what you expected to happen.

Screenshots or console output

See screenshot below: 6 305 [OTA] OTA demo version 0.9.2 7 305 [OTA] Creating MQTT Client... 8 807 [OTA] [ERROR][NET][8070] Failed to establish new connection: -116 9 807 [OTA] Failed to create MQTT client.

Note: "Failed to establish new connection" message is from: not_network_afr.c line 325. We modified the message to log socketStatus. (-116)

To reproduce Steps to reproduce the behavior: Since the failure is intermittent if the initial connect succeeds and OTA begins then restart the ESP32. Continue until a failure is observed

Additional context We attempted to resolve this (believing it might be due to network delay) by increasing the following timeouts in FreeRTOSIPConfig.h

define ipconfigSOCK_DEFAULT_RECEIVE_BLOCK_TIME ( 500 )

define ipconfigSOCK_DEFAULT_SEND_BLOCK_TIME ( 500 )

This "may" have helped (difficult to tell due to intermittent nature of issue). However, failure still observed with very long timeouts set - hangs waiting for the timeout.

Possible resolution: It appears that the connect for OTA when using the network manager (aws_iot_ota_update_demo.c) should be inside a loop as it is in was_ota_update_demo.c

I apologize for ott referencing the lines and functions within the file but need to post this now and cannot allocate additional time for documenting this now.

Thank you!

ravibhagavandas commented 5 years ago

@bgintz-rsc ,

We will look into reproducing the issue on ESP board using this commit. Could you confirm if you disabled BLE for your tests by removing AWSIOT_NETWORK_TYPE_BLE from configENABLED_NETWORKS here

bgintz-rsc commented 5 years ago

Yes - here is what we have in was_iot_network_config.h

define configSUPPORTED_NETWORKS ( AWSIOT_NETWORK_TYPE_WIFI )

define configENABLED_NETWORKS ( AWSIOT_NETWORK_TYPE_WIFI )

Note: At one point we needed to remove AWSIOT_NETWORK_TYPE_BLE from configSUPPORTED_NETWORKS in order to build.

ravibhagavandas commented 5 years ago

Thanks, I am able to reproduce the issue on this commit by running MQTT demo and after few restarts on ESP. However the issues occurs when I disconnect and reconnect fast to the AP ( after restarting ESP device in quick succession ). I will debug further by enabling IP stack logs/Wireshark.

bgintz-rsc commented 5 years ago

Thanks for the feedback. We have seen this occasionally occur on an initial connect. Also, each time we see the error we have verified that an IP address was obtained by the router. So this may not (but could be) the same failure

ravibhagavandas commented 5 years ago

I was able to dump the logs in both cases, and it seems both times the device got a valid ip address from the router. From logs, it seems that device is not getting a ACK for SYN from the server.

-- Failure case --

I (3595) event: sta ip: 192.168.1.34, mask: 255.255.255.0, gw: 192.168.1.1 I (3595) WIFI: SYSTEM_EVENT_STA_GOT_IP 9 346 [IP-task] vDHCPProcess: offer c0a80122ip 10 346 [IP-task] vDHCPProcess: acked c0a80122ip 11 348 [MQTTEcho] prvProcessDNSCache: add: 'xxxxxxxxx-ats.iot.us-west-2.amazonaws.com' @ ip 12 348 [MQTTEcho] FreeRTOS_connect: 64070 to 22d0de5cip:8883 13 348 [MQTTEcho] Socket 64070 -> 22d0de5cip:8883 State eCLOSED->eCONNECT_SYN 14 349 [IP-task] prvSocketSetMSS: 1400 bytes for 22d0de5cip:8883 15 349 [IP-task] prvWinScaleFactor: uxRxWinSize 1 MSS 1400 Factor 0 16 349 [IP-task] Connect[22d0de5cip:8883]: next timeout 1: 3000 ms 17 648 [IP-task] prvWinScaleFactor: uxRxWinSize 1 MSS 1400 Factor 0 18 648 [IP-task] Connect[22d0de5cip:8883]: next timeout 2: 6000 ms 19 849 [MQTTEcho] [ERROR][NET][8490] Failed to establish new connection. error = -116 [IP-task] FreeRTOS_closesocket[64070 to 22d0de5cip:8883]: buffers 60 socks 0

-- Success case ---

I (4165) WIFI: SYSTEM_EVENT_STA_GOT_IP 9 403 [IP-task] vDHCPProcess: offer c0a80122ip 10 403 [IP-task] vDHCPProcess: acked c0a80122ip 11 406 [MQTTEcho] prvProcessDNSCache: add: 'xxxxxxx-ats.iot.us-west-2.amazonaws.com' @ 22d01365ip 12 406 [MQTTEcho] FreeRTOS_connect: 64070 to 22d01365ip:8883 13 406 [MQTTEcho] Socket 64070 -> 22d01365ip:8883 State eCLOSED->eCONNECT_SYN 14 406 [IP-task] prvSocketSetMSS: 1400 bytes for 22d01365ip:8883 15 406 [IP-task] prvWinScaleFactor: uxRxWinSize 1 MSS 1400 Factor 0 16 406 [IP-task] Connect[22d01365ip:8883]: next timeout 1: 3000 ms 17 409 [IP-task] MSS change 1400 -> 1380 18 409 [IP-task] Change mss 1400 => 1380 19 409 [IP-task] TCP: active 64070 => 22d01365ip:8883 set ESTAB (scaling 1) 20 409 [IP-task] Socket 64070 -> 22d01365ip:8883 State eCONNECT_SYN->eESTABLISHED 21 423 [IP-task] MSS: sending 1380 22 820 [MQTTEcho] [INFO ][MQTT][8200] Establishing new MQTT connection.

ravibhagavandas commented 5 years ago

@bgintz-rsc Going through the debug logs the TCP/IP stack is sending TCP SYN packets to driver but waiting for server to respond with SYN-ACK and eventually times out.

Could you provide details on the signal strength of the Wi-Fi Access point in your setup ? This will help in investigating any possible issues with the WiFi driver.

To unblock, you could add following retry logic to connection creation within aws_iot_update_demo.c . This behavior was something which is missing comparing the old OTA demo. We will fix the same in our new version.

` static BaseType_t prxCreateNetworkConnection( void ) {

BaseType_t xStatus = pdFALSE;
uint32_t ulRetriesLeft = echoCONN_RETRY_LIMIT;
uint32_t ulRetryIntervalMS = ( echoCONN_RETRY_INTERVAL_SECONDS * 1000 );
do
{
    /* If no networks are available, block for a physical network connection */
    if( (AwsIotNetworkManager_GetConnectedNetworks() & otaDemoNETWORK_TYPES) == AWSIOT_NETWORK_TYPE_NONE)
    {
        /* Block for a Network Connection. */
        configPRINTF(("Waiting for a network connection.\r\n"));
        xSemaphoreTake(xNetworkAvailableLock, portMAX_DELAY);
    }

    /* At least one network type is available. Connect to the network type. */
    xStatus = xMqttDemoCreateNetworkConnection( &xConnection, otaDemoNETWORK_TYPES );

    if (xStatus == pdTRUE)
    {
        /* Connection to the network succeded. */
        break;
    }
    else
    {
        if ( ulRetriesLeft > 0)
        {
            configPRINTF(( "Failed to create a network connection, retry delay %d seconds, retries left % lu\r\n", 
                              echoCONN_RETRY_INTERVAL_SECONDS,
                              ulRetriesLeft ));
            vTaskDelay( pdMS_TO_TICKS( ulRetryIntervalMS));
        }
    }
} while( ulRetriesLeft-- > 0 );

return xStatus;

}`

ravibhagavandas commented 5 years ago

Captured wireshark trace of the packets and a TCP timeout occurs whenever device doesn't get response to a SYN packet from the broker endpoint. The timeout occurs only when there is a non-graceful shutdown of the socket, like resetting the device abruptly. I did a test by calling connection open and close in a loop with a delay of 1 seconds and did not see any failures.

After going through the OTA demo code, found that the demo isn't shutting down the TCP socket after a successful OTA before resetting the device here . This could probably be a reason for intermittent failures. I will do more tests and update with the results.

rrivadeneira-rsc commented 5 years ago

Thank you for the tests and retry logic, @ravibhagavandas. Sounds like you are getting close to the root of the issue.

We will be on the lookout for updates. Thanks again!

ravibhagavandas commented 5 years ago

After disconnecting MQTT and shutting down the socket for successful and failure OTA, I am not seeing the issue with connection timeouts.

Could you add following lines to disconnect MQTT and shutdown socket connection within the demo: IotMqtt_Disconnect( xConnection.xMqttConnection, 0); vMqttDemoDeleteNetworkConnection( &xConnection );

Lines to be added are here and here.

Please let us know how the test goes.

rrivadeneira-rsc commented 5 years ago

Hi Ravi,

Thank you for this. Just to make sure I see where these lines go correctly - they will go right before activating the new image (OTA_ActiveteNewImage), correct?

Thank you

ravibhagavandas commented 5 years ago

Yes, that's correct.

rrivadeneira-rsc commented 5 years ago

Perfect. I will try this and update as soon as possible.

cobusve commented 5 years ago

I am going to close this issue as we have not heard back for a long time now and Ravi was not able to reproduce this after making the change. If you do manage to reproduce this with the modified code please do come back and post here and we can re-open the issue.

We did add those lines to gracefully disconnect to the code base and it looks like this has resolved the issue.

chegewara commented 5 years ago

Hi, i am having the same issue (i think):

I (5004) WIFI: SYSTEM_EVENT_STA_CONNECTED
6 851 [IP-task] vDHCPProcess: offer c0a86405ip
I (8824) event: sta ip: 192.168.100.5, mask: 255.255.255.0, gw: 192.168.100.1
I (8824) WIFI: SYSTEM_EVENT_STA_GOT_IP
7 855 [IP-task] vDHCPProcess: offer c0a86405ip
I (16214) MAIN: free heap: 129672
W (16214) BT_APPL: bta_dm_disable BTA_DISABLE_DELAY set to 200 ms
I (16424) MAIN: free heap: 170208
8 1615 [MainShadowTask] [INFO ][Shadow][16150] Shadow library successfully initialized.
I (16434) SHADOW: Shadow_ClientConnect will now happen
9 2119 [MainShadowTask] [ERROR][NET][21190] Failed to establish new connection. Error is -116
10 2120 [MainShadowTask] [ERROR][MQTT][21200] Failed to establish new MQTT connection, error NETWORK ERROR.
I (21484) SHADOW: Shadow_ClientConnect unsuccessful, returned 2
I (21494) SHADOW: Failed to initialize shadow client.

Its my procedure:

This issue occurs very often, especially after re-flashing device. Device: esp32, branch: based on ble-beta

ravibhagavandas commented 5 years ago

Hi @chegewara,

ble-beta branch is not actively maintained anymore. We have released new version 201906.00 Major to the master branch which has the BLE library up to date .

To see the change log for the release you can visit this link: https://github.com/aws/amazon-freertos/blob/master/CHANGELOG.md

For a detailed information on the release including migrating your application, check this page: https://aws.amazon.com/blogs/iot/announcing-amazon-freertos-20190600/

Could you move to the master branch and post back if the issue still persists ?

pvyawaha commented 4 years ago

Hi @chegewara,

Is this issue resolved?

chegewara commented 4 years ago

I think it is. I switched to release-1.5 and i dont have any issues so far, thanks.

pvyawaha commented 4 years ago

Thank you for confirmation that this issue is resolved. Closing this issue now.