Azure / azure-iot-sdk-c

A C99 SDK for connecting devices to Microsoft Azure IoT services
https://azure.github.io/azure-iot-sdk-c
Other
588 stars 737 forks source link

Help needed with connectionStatusCallback not triggered when sending control packet data failed #2627

Closed pcoll-jcm closed 2 months ago

pcoll-jcm commented 5 months ago

We are encountering an issue with the Azure IoT SDK for ESP32.

SDK version: LTS_03_2024

When the following error trace appears:

Error: Time: File:./components/esp-azure/port/src/tlsio_esp_tls.c Func:tlsio_esp_tls_send_async Line:552 tlsio_esp_tls_send_async without a prior successful open

Error: Time: File:./components/esp-azure/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:sendPacketItem Line:393 Failure sending control packet data

the connectionStatusCallback with the result IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED is not being triggered, and the error appears indefinitely. In all other cases of connection and disconnection, the callback is triggered correctly. (so , in general, the callback worked excepting this case)

In this case, the hub sees the client as disconnected, but our software cannot take any action to reconnect without the callback, as it is our only tool to get connectivity feedback.

What other method can we use to confirm that the periodic control packet has been sent successfully?

It's crucial for us to keep data consumption minimal. We know that if any telemetry is sent in this state, your SDK reports that the message wasn't delivered and our software can start the reconnection procedure.. However, constantly sending messages for connectivity checks isn't feasible.

Essentially, we need a method to detect and handle this situation effectively.

Thank you.

pcoll-jcm commented 2 months ago

Good morning,

We've been investigating this issue on our devices. Our theory is that the problem comes from the MQTT connection between the device and IoT Hub.

Below there's the expected SDK behaviour based on the code review:

  1. We connect our device to IoT Hub
  2. Once connected, in order to check the IoT Hub connection the device executes a ping to the server every 240 seconds (4 minutes).
  3. If the ping is successful, the device interprets that it is connected, does nothing and waits for another 240 seconds to ping again. This is done by setting the TimeToPing counter variable again to 240 seconds.
  4. If the ping is unsuccessful, it waits for a ping response for 80 seconds and after that, the callback error is catched by the SDK.

Here's our steps to reproduce the behaviour at SDK level (We've been doing the tests with LTS_08_2024 version of SDK):

  1. We connect our device to IoT Hub
  2. The device starts the process to execute a ping to the server every 240 seconds (4 minutes).
  3. During a TimeToPing waiting time, we disable the device from IoTHub.
  4. After doing that, the IoTHub closes the MQTT connection to the device (the only trace we see in our ESP32 is esp-tls-mbedtls: read error :-0x004c). So in this state the connection is closed.
  5. Once the ping is executed again, an error appears 'Failure Sending control packet data', and the TimeToPing is setted again to 240 seconds, entering into an infinite loop and avoiding to catch the callback error and starting the reconnection process again. From this status, if we enable the device again, the reconnection process is never started and the device needs to be rebooted.

image (1)

From there, we started checking the code:

The ping is sended from the highlighted lines below (see mqtt_client.c -> mqtt_client_dowork

image (2)

The implementation of sendPacket controls some causes of failure when trying to execute. If it fails, it generates a MU_FAILURE that does not update the packetSendTimeMs.

image (3)

Coming back to the mqtt_client_dowork routine, the result of the sendPacketItem is not catched. Because of that, the tickcounter_get_current_ms is never executed.

image (4)

The code that generates the callback is never executed, so the device does not know it is not connected. And does not try to connect again.

image (5)

After reviewing the code, we applied a patch on the mqtt module in order to avoid this infinite loop (file mqtt_client_modified attached):

In the file mqtt_client_modified.c we have modified the mqtt_client.c to add some logs on our screen and apply the patch

If you define #define PATCHED_MQTT_PING_JCM the patch is applied

If you define #define LOG_TEST_PINGREQ the log to see what is happening is activated

Here is the behaviour with the patch applied, the device did not enter into the infinite loop and started the reconnection process successfully.

image (6)

In the following zip there are the original and modified files. The modified includes the patch code.

mqtt_client - original & modified.zip

pcoll-jcm commented 2 months ago

We opened a new issue as a bug as a continuation of this thread. You can find it here: https://github.com/Azure/azure-iot-sdk-c/issues/2649