aws / aws-iot-device-sdk-embedded-C

SDK for connecting to AWS IoT from a device using embedded C.
MIT License
974 stars 622 forks source link

Excessive data consumption #1909

Closed coratron closed 2 months ago

coratron commented 2 months ago

I am using the following:

A. Before migrating the project to this new stack I had all the comms (unencrypted) running on the Modem by using AT commands. Data consumption for the operation: 1kB

B. After migrating to the embedded SDK (using PKCS#11 for managing certificates, etc). Data consumption for the operation: 19kB. I am not talking about the initial 30kB for the TLS handshake etc, but just the ongoing MQTT transmission.

C. From B. but changing the following #define TRANSPORT_SEND_RECV_TIMEOUT_MS 5000U to an absurd value of: #define TRANSPORT_SEND_RECV_TIMEOUT_MS 50000U The data consumption for the operation : 4kB

What is going on here?

The ideal goal is to achieve the minimum amount of data consumption as possible as we are using SIM cards that are billed by the MB. I would expect a higher consumption (mostly due to the handshake), but not the ongoing 4x compared to what we used to have before migrating the project to use this sdk and certificates.

Any ideas? The only thing I can see on the terminal at the moment is a periodic:

D (690224) AWS: Failed to read data. However, a read can be retried on this error. mbedTLSError= SSL - The operation timed out : <No-Low-Level-Code>.
D (690227) coreMQTT: MQTT PINGREQ packet size is 2.
D (690233) coreMQTT: sendBuffer: Bytes Sent=2, Bytes Remaining=0
D (690243) coreMQTT: Sent 2 bytes of PINGREQ packet.

I am wondering if this is causing some extra data consumption in the background.

YOSI-yoshidayuji commented 2 months ago

If the cause is that PINGREQ is occurring frequently, then I think the following patch is needed: https://github.com/FreeRTOS/coreMQTT/pull/235

coratron commented 2 months ago

If the cause is that PINGREQ is occurring frequently, then I think the following patch is needed: FreeRTOS/coreMQTT#235

thanks for your prompt reply @YOSI-yoshidayuji , I was precisely looking at the keep alive conditions and the other defines that can override the values. Then following your link did find out that indeed this is missing in that LTS release which most certainly would have an impact:

    if( status == MQTTSuccess )
    {
        pContext->lastPacketRxTime = pContext->getTime();
    }

I have added that manually for now and will do some tests, but may have to look at using the master branch. Unfortunately, nobody at ESP is maintaining the esp-aws-iot repo... The issues keep piling up hence why I sought help here.

coratron commented 2 months ago

No changes unfortunately, still 4kB.

The following was appearing every 50s. Not a coincidence.

D (690224) AWS: Failed to read data. However, a read can be retried on this error. mbedTLSError= SSL - The operation timed out : <No-Low-Level-Code>.
D (690227) coreMQTT: MQTT PINGREQ packet size is 2.
D (690233) coreMQTT: sendBuffer: Bytes Sent=2, Bytes Remaining=0
D (690243) coreMQTT: Sent 2 bytes of PINGREQ packet.

That TRANSPORT_SEND_RECV_TIMEOUT_MS is what gets passed to static MbedtlsPkcs11Status_t configureMbedtls( MbedtlsPkcs11Context_t * pMbedtlsPkcs11Context, const char * pHostName, const MbedtlsPkcs11Credentials_t * pMbedtlsPkcs11Credentials, uint32_t recvTimeoutMs )

I have increased the value to 120s now and the message on the terminal indeed appears roughly every 120s. The data consumption is further reduced to 3kB.

Whatever is happening in the background in terms of transactions after this odd timeout occurs is what is driving this unwanted data consumption. Seems to be TLS related.

aggarg commented 2 months ago

Can you track where this log is coming from:

D (690224) AWS: Failed to read data. However, a read can be retried on this error. mbedTLSError= SSL - The operation timed out : <No-Low-Level-Code>.

The reason I ask it might help us in finding out if some operation is being retried and resulting in more data consumption.

coratron commented 2 months ago

@aggarg

It comes from mbedtls_pkcs11_posix.c:

int32_t Mbedtls_Pkcs11_Recv( NetworkContext_t * pNetworkContext,
                             void * pBuffer,
                             size_t bytesToRecv )
{
    MbedtlsPkcs11Context_t * pMbedtlsPkcs11Context = NULL;
    int32_t tlsStatus = 0;

    assert( ( pNetworkContext != NULL ) && ( pNetworkContext->pParams != NULL ) );

    pMbedtlsPkcs11Context = pNetworkContext->pParams;
    tlsStatus = ( int32_t ) mbedtls_ssl_read( &( pMbedtlsPkcs11Context->context ),
                                              pBuffer,
                                              bytesToRecv );

    if( ( tlsStatus == MBEDTLS_ERR_SSL_TIMEOUT ) ||
        ( tlsStatus == MBEDTLS_ERR_SSL_WANT_READ ) ||
        ( tlsStatus == MBEDTLS_ERR_SSL_WANT_WRITE ) )
    {
        LogDebug( ( "Failed to read data. However, a read can be retried on this error. "
                    "mbedTLSError= %s : %s.",
                    mbedtlsHighLevelCodeOrDefault( tlsStatus ),
                    mbedtlsLowLevelCodeOrDefault( tlsStatus ) ) );

        /* Mark these set of errors as a timeout. The libraries may retry read
         * on these errors. */
        tlsStatus = 0;
    }
    else if( tlsStatus < 0 )
    {
        LogError( ( "Failed to read data: mbedTLSError= %s : %s.",
                    mbedtlsHighLevelCodeOrDefault( tlsStatus ),
                    mbedtlsLowLevelCodeOrDefault( tlsStatus ) ) );
    }
    else
    {
        /* Empty else marker. */
    }

    return tlsStatus;
}

which in turn is the function that I assigned to transport.recv

  /* Fill in TransportInterface send and receive function pointers.
   * For this demo, TCP sockets are used to send and receive data
   * from the network. pNetworkContext is an SSL context for OpenSSL.*/
  transport.pNetworkContext = pNetworkContext;
  transport.send = Mbedtls_Pkcs11_Send;
  transport.recv = Mbedtls_Pkcs11_Recv;
  transport.writev = NULL; 

in mqtt_operations.c, just like it is done in the fleet provisioning example:

https://github.com/aws/aws-iot-device-sdk-embedded-C/blob/main/demos/fleet_provisioning/fleet_provisioning_with_csr/mqtt_operations.c

aggarg commented 2 months ago

Okay - do we know what read operation fails. In other words, do we have a call stack when the read fails? Is there any possibility of capturing network traffic? Also, are you using QoS1? If yes, can you try QoS0?

coratron commented 2 months ago

This is over CAT-M1, I do not have the tools to capture this traffic unfortunately. I may have to switch over to WiFi if it comes to that but metering would be done differently (unable to use the original reference for comparison).

In any case, the increased data consumption appears to be the symptom of this timeout error.

Here is the call stack.

0x400e2ac5: Mbedtls_Pkcs11_Recv at mbedtls_pkcs11_posix.c:965 (discriminator 34)
0x400e8d8a: receiveSingleIteration coreMQTT/coreMQTT/source/core_mqtt.c:1658
0x400e92d0: MQTT_ProcessLoop at coreMQTT/coreMQTT/source/core_mqtt.c:3071
0x400dde31: processMqtt() see below...
0x400d9432: networkingTask(void*) 
void processMqtt(void)
{
    MQTTStatus_t mqttStatus = MQTTBadParameter;

    /* Acquire the mqtt mutex lock. */
    if(pthread_mutex_lock(&mqttMutex) == 0)
    {
        /* Loop to receive packet from transport interface. */
        mqttStatus = MQTT_ProcessLoop(&mqttContext);

        pthread_mutex_unlock(&mqttMutex);
    }
    if(mqttStatus != MQTTSuccess)
    {
        LogError(("Failed to process MQTT loop with error = %u.", mqttStatus));
    }
    else
    {
        LogInfo(("Processed MQTT loop successfully."));
    }
}

I call processMqtt() periodically to handle any pending MQTT traffic in my networkingTask What happens is the system stalls for the duration TRANSPORT_SEND_RECV_TIMEOUT_MS . When that ends up in failure (timeout), the system must be sending something.

I was using QoS1 but I have tried QoS0 and it produces the same results.

aggarg commented 2 months ago

I call processMqtt() periodically to handle any pending MQTT traffic in my networkingTask What happens is the system stalls for the duration TRANSPORT_SEND_RECV_TIMEOUT_MS . When that ends up in failure (timeout), the system must be sending something.

Does it mean that there is no data to read and this receive is expected to timeout? We need to still find out why the amount of data consumption is more. Can you log the number of bytes sent and bytes received in the transport send and receive functions and compare them?

coratron commented 2 months ago

@aggarg Thanks for bearing with me.

I would assume that this timeout indeed should not be a problem if there is no data to receive. The observation is the data consumption is directly proportional to the timeout condition, I did not know what data is being exchanged here. I finally managed to enable the mbedTLS debug logging - (platform specific vs library...).

Cornering the problem a little further. Here are the new observations.

I get 5 messages like this (one message for every processMqtt() call as discussed previously).

I (418920) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5
I (418920) mbedTLS: |2| 0x3ffb49e4: ssl->f_recv(_timeout)() returned -26624 (-0x6800)
I (418924) mbedTLS: |1| 0x3ffb49e4: mbedtls_ssl_fetch_input() returned -26624 (-0x6800)
I (418933) mbedTLS: |1| 0x3ffb49e4: ssl_get_next_record() returned -26624 (-0x6800)
I (418943) mbedTLS: |1| 0x3ffb49e4: mbedtls_ssl_read_record() returned -26624 (-0x6800)

D (418951) AWS: Failed to read data. However, a read can be retried on this error. mbedTLSError= SSL - The operation timed out : <No-Low-Level-Code>.

Backtrace: ... (same as before)

I (418985) mbedTLS: |2| 0x3ffb49e4: => read
I (418985) mbedTLS: |2| 0x3ffb49e4: => read record
I (418991) mbedTLS: |2| 0x3ffb49e4: => fetch input
I (418992) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5

And then the 6th looks like this:

I (534185) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5
I (534185) mbedTLS: |2| 0x3ffb49e4: ssl->f_recv(_timeout)() returned -26624 (-0x6800)
I (534192) mbedTLS: |1| 0x3ffb49e4: mbedtls_ssl_fetch_input() returned -26624 (-0x6800)
I (534200) mbedTLS: |1| 0x3ffb49e4: ssl_get_next_record() returned -26624 (-0x6800)
I (534208) mbedTLS: |1| 0x3ffb49e4: mbedtls_ssl_read_record() returned -26624 (-0x6800)

D (534215) AWS: Failed to read data. However, a read can be retried on this error. mbedTLSError= SSL - The operation timed out : <No-Low-Level-Code>.

Backtrace: ... (same as before)

I (534240) mbedTLS: |2| 0x3ffb49e4: => write
I (534245) mbedTLS: |2| 0x3ffb49e4: => write record
I (534250) mbedTLS: |2| 0x3ffb49e4: => encrypt buf
I (534256) mbedTLS: |4| 0x3ffb49e4: dumping 'before encrypt: output payload' (2 bytes)
I (534265) mbedTLS: |4| 0x3ffb49e4: 0000:  c0 00                                            ..
I (534274) mbedTLS: |4| 0x3ffb49e4: dumping 'IV used (internal)' (12 bytes)
I (534282) mbedTLS: |4| 0x3ffb49e4: 0000:  ed 28 2b 61 00 00 00 00 00 00 00 2c              .(+a.......,
I (534292) mbedTLS: |4| 0x3ffb49e4: dumping 'IV used (transmitted)' (8 bytes)
I (534300) mbedTLS: |4| 0x3ffb49e4: 0000:  00 00 00 00 00 00 00 2c                          .......,
I (534313) mbedTLS: |4| 0x3ffb49e4: dumping 'additional data used for AEAD' (13 bytes)
I (534321) mbedTLS: |4| 0x3ffb49e4: 0000:  00 00 00 00 00 00 00 2c 17 03 03 00 02           .......,.....
I (534330) mbedTLS: |3| 0x3ffb49e4: before encrypt: msglen = 2, including 0 bytes of padding
I (534339) mbedTLS: |4| 0x3ffb49e4: dumping 'after encrypt: tag' (16 bytes)
I (534353) mbedTLS: |4| 0x3ffb49e4: 0000:  c4 51 34 23 49 fc d6 2d 2d d7 a6 d6 8f d4 8a 43  .Q4#I..--......C
I (534361) mbedTLS: |2| 0x3ffb49e4: <= encrypt buf
I (534363) mbedTLS: |3| 0x3ffb49e4: output record: msgtype = 23, version = [3:3], msglen = 26
I (534372) mbedTLS: |4| 0x3ffb49e4: dumping 'output record sent to network' (31 bytes)
I (534381) mbedTLS: |4| 0x3ffb49e4: 0000:  17 03 03 00 1a 00 00 00 00 00 00 00 2c f5 8a c4  ............,...
I (534394) mbedTLS: |4| 0x3ffb49e4: 0010:  51 34 23 49 fc d6 2d 2d d7 a6 d6 8f d4 8a 43     Q4#I..--......C
I (534402) mbedTLS: |2| 0x3ffb49e4: => flush output
I (534410) mbedTLS: |2| 0x3ffb49e4: message length: 31, out_left: 31
I (534418) mbedTLS: |2| 0x3ffb49e4: ssl->f_send() returned 31 (-0xffffffe1)
I (534423) mbedTLS: |2| 0x3ffb49e4: <= flush output
I (534434) mbedTLS: |2| 0x3ffb49e4: <= write record
I (534435) mbedTLS: |2| 0x3ffb49e4: <= write
I (534442) mbedTLS: |2| 0x3ffb49e4: => read
I (534450) mbedTLS: |2| 0x3ffb49e4: => read record
I (534452) mbedTLS: |2| 0x3ffb49e4: => fetch input
I (534458) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5
I (535394) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5
I (535394) mbedTLS: |2| 0x3ffb49e4: ssl->f_recv(_timeout)() returned 5 (-0xfffffffb)
I (535401) mbedTLS: |2| 0x3ffb49e4: <= fetch input
I (535404) mbedTLS: |4| 0x3ffb49e4: dumping 'input record header' (5 bytes)
I (535411) mbedTLS: |4| 0x3ffb49e4: 0000:  17 03 03 00 1a                                   .....
I (535421) mbedTLS: |3| 0x3ffb49e4: input record: msgtype = 23, version = [0x303], msglen = 26
I (535434) mbedTLS: |2| 0x3ffb49e4: => fetch input
I (535442) mbedTLS: |2| 0x3ffb49e4: in_left: 5, nb_want: 31
I (535443) mbedTLS: |2| 0x3ffb49e4: in_left: 5, nb_want: 31
I (535450) mbedTLS: |2| 0x3ffb49e4: ssl->f_recv(_timeout)() returned 26 (-0xffffffe6)
I (535458) mbedTLS: |2| 0x3ffb49e4: <= fetch input
I (535463) mbedTLS: |4| 0x3ffb49e4: dumping 'input record from network' (31 bytes)
I (535474) mbedTLS: |4| 0x3ffb49e4: 0000:  17 03 03 00 1a 00 00 00 00 00 00 00 13 07 c7 bc  ................
I (535482) mbedTLS: |4| 0x3ffb49e4: 0010:  c1 42 da 3d de f3 a6 3c b5 ac 2a df 30 e3 a8     .B.=...<..*.0..
I (535493) mbedTLS: |2| 0x3ffb49e4: => decrypt buf
I (535499) mbedTLS: |4| 0x3ffb49e4: dumping 'additional data used for AEAD' (13 bytes)
I (535507) mbedTLS: |4| 0x3ffb49e4: 0000:  00 00 00 00 00 00 00 13 17 03 03 00 02           .............
I (535522) mbedTLS: |4| 0x3ffb49e4: dumping 'IV used' (12 bytes)
I (535525) mbedTLS: |4| 0x3ffb49e4: 0000:  ca 77 ab 33 00 00 00 00 00 00 00 13              .w.3........
I (535535) mbedTLS: |4| 0x3ffb49e4: dumping 'TAG used' (16 bytes)
I (535542) mbedTLS: |4| 0x3ffb49e4: 0000:  bc c1 42 da 3d de f3 a6 3c b5 ac 2a df 30 e3 a8  ..B.=...<..*.0..
I (535553) mbedTLS: |2| 0x3ffb49e4: <= decrypt buf
I (535563) mbedTLS: |4| 0x3ffb49e4: dumping 'input payload after decrypt' (2 bytes)
I (535571) mbedTLS: |4| 0x3ffb49e4: 0000:  d0 00                                            ..
I (535579) mbedTLS: |2| 0x3ffb49e4: <= read record
I (535582) mbedTLS: |2| 0x3ffb49e4: <= read
I (536393) mbedTLS: |2| 0x3ffb49e4: => read
I (536394) mbedTLS: |2| 0x3ffb49e4: => read record
I (536395) mbedTLS: |2| 0x3ffb49e4: => fetch input
I (536401) mbedTLS: |2| 0x3ffb49e4: in_left: 0, nb_want: 5

Then it starts all over.

aggarg commented 2 months ago

The observation is the data consumption is directly proportional to the timeout condition, I did not know what data is being exchanged here.

How are you measuring this data consumption?

coratron commented 2 months ago

Directly from the carrier (that's how they bill), I am using their API as well to get higher resolution - but can only see data in chunks of about 15 min.

coratron commented 2 months ago

@aggarg just found this 6th packet is happening for a ping request by enabling coreMQTT logs as well. D (302432) coreMQTT: MQTT PINGREQ packet size is 2.

It seems I am back to @YOSI-yoshidayuji 's suggestion. I will have to check if that patch has been fully implemented in the version I am using.

coratron commented 2 months ago

Ok, I think we can put this to rest. I have removed PINGREQ for now. Data consumption is 3kB per operation which is substantially higher than what I expected but will have to settle for that. I can't see what else but TLS is responsible for the added data.

Thanks for the help @aggarg and @YOSI-yoshidayuji

aggarg commented 2 months ago

I can't see what else but TLS is responsible for the added data.

I did not realize that you were not using encryption before. If you want to further dive deep, we would need some way to capture data. One way can be to setup MQTT broker on a PC and run capture on that.

Thank you for sharing your solution!

coratron commented 2 months ago

Thanks @aggarg

No problem, this will work for now. The next big thing will be to see if the folks at ESP will release a better wrapper for this SDK. Just for perspective, my binary has gone from 800kB to 1.5MB in size.