aws / amazon-freertos

DEPRECATED - See README.md
https://aws.amazon.com/freertos/
MIT License
2.54k stars 1.1k forks source link

[BUG] <Wrong order of transportTimeoutSetup and connectToServer in transport_secure_sockets.c> #3458

Closed maa1370 closed 2 years ago

maa1370 commented 2 years ago

Describe the bug transportTimeoutSetup() should be moved before connectToServer() in transport_secure_sockets, otherwise the system may get stuck while doing TLS handshake. I have tested this, and with current order, if internet connection is lost exactly at a specific location (which may seem not common, but can happen), system gets stuck. (and only continue working after a very long time, about 20 minutes)

System information

Expected behavior correct timeouts should be set before TLS handshake, so a problem in handshake occurs, socket receive (and maybe send) function does not cause the system to get stuck.

Steps to reproduce bug

  1. add these lines:
        TLS_PRINT( ( "TLS before negotiate\r\n" ) );
        vTaskDelay(pdMS_TO_TICKS(4000));

    before

        while( 0 != ( xResult = mbedtls_ssl_handshake( &pxCtx->xMbedSslCtx ) ) )

    in iot_tls.c (in libraries\freertos_plus\standard\tls\src). this will be needed later.

  2. use a windows 10 (11?) laptop, connect to internet, configure a WiFi hotspot, and configure the system to connect to this hotspot.
  3. start the system, and exactly after you saw "TLS before negotiate" in log console, disconnect your laptop from internet (but be sure to nut turn off the hotspot instead, windows 10 does not automatically turns the hotspot off.)
  4. the system get stuck in TLS handshake. by adding some logs, i can tell you the exact location: in ssl_tls.c (in libraries\3rdparty\mbedtls\library), in function mbedtls_ssl_fetch_input, at this:
                    ret = ssl->f_recv( ssl->p_bio,
                                       ssl->in_hdr + ssl->in_left, len );

    it seems it should get stuck for ever because receive function has no timeout set yet, but at least in my tests, handshake fails after about 20 minutes. (configTICK_RATE_HZ set to 100, maybe some overflow in timeout value has happened, or maybe this is correct behavior, don't know)

Solution reverse order of these two blocks in establishConnect() in transport_secure_sockets.c (in libraries\abstractions\transport\secure_sockets): move

    if( returnStatus == TRANSPORT_SOCKET_STATUS_SUCCESS )
    {
        /* Configure send and receive timeouts for the socket. */
        secureSocketStatus = transportTimeoutSetup( tcpSocket, pSocketsConfig->sendTimeoutMs, pSocketsConfig->recvTimeoutMs );

        if( secureSocketStatus != ( int32_t ) SOCKETS_ERROR_NONE )
        {
            LogError( ( "Failed to configure send and receive timeouts for socket: secureSocketStatus=%d.", secureSocketStatus ) );
            returnStatus = TRANSPORT_SOCKET_STATUS_INTERNAL_ERROR;
        }
    }

before

    if( returnStatus == TRANSPORT_SOCKET_STATUS_SUCCESS )
    {
        /* Establish the TCP connection. */
        returnStatus = connectToServer( tcpSocket,
                                        pServerInfo );
    }

Regards

AniruddhaKanhere commented 2 years ago

Hello @maa1370,

Thank you for reporting the issue to us. What you said makes sense to me. Would you like to create a PR or shall I do that for you?

Thanks, Aniruddha

maa1370 commented 2 years ago

@AniruddhaKanhere Thanks for responce.

as I'm not experienced with git, please create PR yourself. also please take a look at #3455 too, it is related to another problem in transport_secure_sockets.c i have reported, it may be easier to fix them together.

(also as you are who made PR 3406, please check my comment in issue 1646, it is not related to this issues, just wanted to write a note about it to be sure you have seen that)

Regards

aggarg commented 2 years ago

As far as I remember correctly, the small timeout are set after the connection is established because handshake fails if done otherwise. In any case, this is just a sample implementation and feel free to update it as per your application requirements.

maa1370 commented 2 years ago

@aggarg Thanks for response.

I disagree with you. I still think this is a bug. I think the correct method is this: If a high timeout before handshake is needed, set a high default timeout (60 seconds e.g.) before it, and after handshake, set the timeout from parameters. Trying to handshake without a timeout set, may cause to a never or very long timeout as i described, which i think should be considered a bug.

aggarg commented 2 years ago

Trying to handshake without a timeout set, may cause to a never or very long timeout as i described, which i think should be considered a bug.

That depends on your application requirements. If application is designed such that it can only function after it is connected to the MQTT broker, it would probably want to block for long during connect. Again, as I said before, this is just one sample implementation of the transport interface and you can update it as per your application requirements.

maa1370 commented 2 years ago

If application is designed such that it can only function after it is connected to the MQTT broker, it would probably want to block for long during connect.

i think the application should rely on its retry mechanism rather than a blocking handshake.

see the do while in mqtt_demo_mutual_auth as an example. that's a good example of a retry mechanism, much better than a blocking handshake. also consider two cases: 1-in my first post, after step 3, connect to laptop to internet again after a few seconds so the internet is reachable on hotspot for device again. yes, the handshake may continue (i tested, and it continues), but i think it is possible that it does not continue. i do not know the exact details really, so i'm not sure if the handshake continues or not, but are we sure the process always continues (to fail or success, not important) if the internet is reachable after X seconds? or we may have a infinite block (or maybe a very long delay in some conditions in reality, see my note about 20 minutes in my case) here? so why rely on a thing that we are not sure that behaves correctly or not when there is a better solution available? 2-consider the device has 3 WiFi Credentials saved in it's memory, and it want's to try a different one after a failed handshake. with current code and in condition i described in my first post, it will not be able to do that (or can only try this after 20 minutes in my case).

so i think a reasonable timeout for handshake is a must, because in any case having a timeout for handshake is beneficial.

i know i can change the code for my usecase (and i have already done it), but as described i think it may cause problems for others too, because the sample implementation is used by many people without going to details. and there is a good solution available. so why not make it better?