Azure / azure-c-shared-utility

Azure C SDKs common code
Other
111 stars 203 forks source link

Overflow in get_time_ms() after ~25 days uptime on OpenWrt 19.07.4 and C library musl 1.1.24 #526

Closed knn closed 3 years ago

knn commented 3 years ago

We use the following hardware/software setup:

Our IoT devices aren't regularly restarted, but can typically run for multiple weeks or months at a time. We noticed that after about 25 days of uptime, the devices aren't usable anymore, because they persistently perform disconnect-reconnect cycles every few seconds.

The disconnect causes an invocation of the connection status callback with status IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED and reason IOTHUB_CLIENT_CONNECTION_NO_NETWORK, although the network is available, because a subsequent reconnection is successful.

We were able to rule out any WiFi connection problems, so we dug deeper into this issue.

Using the debug output of the MQTT communication, we detected that the SDK doesn't send PINGREQ packets to the IoT hub anymore:

09:40:17 CONNECT | VER: 4 | KEEPALIVE: 10 | FLAGS: 128 | USERNAME: <redacted>.azure-devices.net/<redacted>/?api-version=2020-09-30&DeviceClientType=iothubclient%2f1.6.0%20(native%3b%20Linux%3b%20mips) | CLEAN: 0
09:40:17 CONNACK | SESSION_PRESENT: true | RETURN_CODE: 0x0
09:40:17 SUBSCRIBE | PACKET_ID: 57028 | TOPIC_NAME: devices/<redacted>/messages/devicebound/# | QOS: 1 | TOPIC_NAME: $iothub/twin/res/# | QOS: 0 | TOPIC_NAME: $iothub/twin/PATCH/properties/desired/# | QOS: 0
09:40:17 PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_LEAST_ONCE | TOPIC_NAME: devices/<redacted>/messages/events/%24.ct=application%2Fjson&%24.ce=utf-8 | PACKET_ID: 57029 | PAYLOAD_LEN: 396
writev(1, [{iov_base=" PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_LEAST_ONCE | TOPIC_NAME: devices/<redacted>/messages/events/%24.ct=application%2Fjson&%24.ce=utf-8 | PACKET_ID: 57030 | PAYLOAD_LEN: 676
09:40:18 SUBACK | PACKET_ID: 57028 | RETURN_CODE: 1 | RETURN_CODE: 0 | RETURN_CODE: 0
09:40:18 PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_MOST_ONCE | TOPIC_NAME: $iothub/twin/GET/?$rid=57031
09:40:18 PUBACK | PACKET_ID: 57029
09:40:18 PUBACK | PACKET_ID: 57030
09:40:18 PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: QOS_VALUE_INVALID | TOPIC_NAME: $iothub/twin/res/200/?$rid=57031 | PAYLOAD_LEN: 341
09:40:18 PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_MOST_ONCE | TOPIC_NAME: <redacted> | PACKET_ID: 3526 | PAYLOAD_LEN: 346
09:40:18 ", iov_len=1013}, {iov_base="PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_LEAST_ONCE | TOPIC_NAME: devices/<redacted>/messages/events/%24.ct=application%2Fjson&%24.ce=utf-8 | PACKET_ID: 57032 | PAYLOAD_LEN: 395", iov_len=206}], 2) = 1219
09:40:19 PUBACK | PACKET_ID: 57032

Missing PINGREQ packets from the client

09:40:35 CONNECT | VER: 4 | KEEPALIVE: 10 | FLAGS: 128 | USERNAME: <redacted>.azure-devices.net/<redacted>/?api-version=2020-09-30&DeviceClientType=iothubclient%2f1.6.0%20(native%3b%20Linux%3b%20mips) | CLEAN: 0
09:40:36 CONNACK | SESSION_PRESENT: true | RETURN_CODE: 0x0
09:40:36 SUBSCRIBE | PACKET_ID: 57033 | TOPIC_NAME: devices/<redacted>/messages/devicebound/# | QOS: 1 | TOPIC_NAME: $iothub/twin/res/# | QOS: 0 | TOPIC_NAME: $iothub/twin/PATCH/properties/desired/# | QOS: 0
09:40:36 PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_LEAST_ONCE | TOPIC_NAME: devices/<redacted>/messages/events/%24.ct=application%2Fjson&%24.ce=utf-8 | PACKET_ID: 57034 | PAYLOAD_LEN: 396
09:40:36 PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_LEAST_ONCE | TOPIC_NAME: devices/<redacted>/messages/events/%24.ct=application%2Fjson&%24.ce=utf-8 | PACKET_ID: 57035 | PAYLOAD_LEN: 676
writev(1, [{iov_base=" SUBACK | PACKET_ID: 57033 | RETURN_CODE: 1 | RETURN_CODE: 0 | RETURN_CODE: 0
09:40:37 PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_MOST_ONCE | TOPIC_NAME: $iothub/twin/GET/?$rid=57036
09:40:37 PUBACK | PACKET_ID: 57034
09:40:37 PUBACK | PACKET_ID: 57035
09:40:37 PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: QOS_VALUE_INVALID | TOPIC_NAME: $iothub/twin/res/200/?$rid=57036 | PAYLOAD_LEN: 341
09:40:37 PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_MOST_ONCE | TOPIC_NAME: <redacted> | PACKET_ID: 3527 | PAYLOAD_LEN: 346
09:40:37 PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_LEAST_ONCE | TOPIC_NAME: devices/<redacted>/messages/events/%24.ct=application%2Fjson&%24.ce=utf-8 | PACKET_ID: 57037 | PAYLOAD_LEN: 395
09:40:37 ", iov_len=1013}, {iov_base="PUBACK | PACKET_ID: 57037", iov_len=25}], 2) = 1038

The culprit is this condition in the MQTT client, which doesn't evaluate to true even when the MQTT keep-alive timeout (10 seconds in our case) has passed. As already mentioned, this causes the SDK to cease sending PINGREQ packets to the IoT hub. The IoT hub subsequently terminates the connection, because it hasn't received a PINGREQ packet within the max. keep-alive waiting time. The disconnection on the IoT hub side leads to a reconnection on the client side.

We then inspected the Linux time modules tickcounter_linux.c, linux_time.h and linux_time.c (in our case, CLOCK_MONOTONIC is defined). The MQTT client uses the function tickcounter_get_current_ms() function that itself uses the function get_time_ms() which returns a count of milliseconds as a time_t type. Apparently, the count is the current clock converted to milliseconds:

time_t get_time_ms()
{
    struct timespec ts;
    if (get_time_ns(&ts) != 0)
    {
        LogError("Failed to get the current time");
        return INVALID_TIME_VALUE;
    }

    return (time_t)(ts.tv_sec * 1000 + ts.tv_nsec / 1000000);
}

(Link)

The time_t type is not portable, because it isn't defined to a fixed type in the C standard (Source).

For our musl version, time_t is defined to a long which is an int32_t integer on MIPS. This means time_t overflows after counting 2^31 milliseconds which is ~25 days. In fact, it may even overflow earlier, because the starting point of clock_gettime() is unspecified (Source):

CLOCK_MONOTONIC
    Clock that cannot be set and represents monotonic time since some unspecified starting point.

In our setup, the starting point is the current uptime of the OpenWrt Linux system. Note that the musl C library has a 64-bit time_t since version 1.2.0 (musl time64 Release Notes). This version, however, is not available on OpenWrt 19.07.4 and the RC of the next version 21.02.0 also uses musl 1.1.x (OpenWrt 21.02.0-rc1 Release Notes), so this issue can't be fixed by a library update.

Thus, we use the attached patch to fix this problem on our HW/SW setup. We've replaced the time_t with the int64_t datatype and replaced the difftime() invocation with a simple subtraction.

I haven't submitted a pull request, because I haven't had time to look at the other platforms supported by the Azure IoT C SDK.

Could someone please check if this problem also applies to the other platforms?

Index: azure-iot-sdk-c-LTS_01_2021_Ref01/c-utility/adapters/linux_time.c
===================================================================
--- azure-iot-sdk-c-LTS_01_2021_Ref01.orig/c-utility/adapters/linux_time.c
+++ azure-iot-sdk-c-LTS_01_2021_Ref01/c-utility/adapters/linux_time.c
@@ -61,7 +61,7 @@ int get_time_ns(struct timespec* ts)
     return err;
 }

-time_t get_time_ms()
+int64_t get_time_ms()
 {
     struct timespec ts;
     if (get_time_ns(&ts) != 0)
@@ -70,6 +70,6 @@ time_t get_time_ms()
         return INVALID_TIME_VALUE;
     }

-    return (time_t)(ts.tv_sec * 1000 + ts.tv_nsec / 1000000);
+    return (int64_t)(ts.tv_sec * 1000 + ts.tv_nsec / 1000000);
 }

Index: azure-iot-sdk-c-LTS_01_2021_Ref01/c-utility/adapters/linux_time.h
===================================================================
--- azure-iot-sdk-c-LTS_01_2021_Ref01.orig/c-utility/adapters/linux_time.h
+++ azure-iot-sdk-c-LTS_01_2021_Ref01/c-utility/adapters/linux_time.h
@@ -4,8 +4,8 @@
 #ifndef LINUX_TIME_H
 #define LINUX_TIME_H

-#include <time.h>
 #include <pthread.h>
+#include <stdint.h>

 #ifndef __MACH__
 extern clockid_t time_basis;
@@ -13,9 +13,9 @@ extern clockid_t time_basis;

 extern void set_time_basis(void);
 extern int get_time_ns(struct timespec* ts);
-extern time_t get_time_ms(void);
+extern int64_t get_time_ms(void);

-#define INVALID_TIME_VALUE      (time_t)(-1)
+#define INVALID_TIME_VALUE      (int64_t)(-1)

 #define NANOSECONDS_IN_1_SECOND 1000000000L
Index: azure-iot-sdk-c-LTS_01_2021_Ref01/c-utility/adapters/tickcounter_linux.c
===================================================================
--- azure-iot-sdk-c-LTS_01_2021_Ref01.orig/c-utility/adapters/tickcounter_linux.c
+++ azure-iot-sdk-c-LTS_01_2021_Ref01/c-utility/adapters/tickcounter_linux.c
@@ -12,7 +12,7 @@

 typedef struct TICK_COUNTER_INSTANCE_TAG
 {
-    time_t init_time_value;
+    int64_t init_time_value;
     tickcounter_ms_t current_ms;
 } TICK_COUNTER_INSTANCE;

@@ -57,7 +57,7 @@ int tickcounter_get_current_ms(TICK_COUN
     }
     else
     {
-        time_t time_value = get_time_ms();
+        int64_t time_value = get_time_ms();
         if (time_value == INVALID_TIME_VALUE)
         {
             result = MU_FAILURE;
@@ -65,7 +65,7 @@ int tickcounter_get_current_ms(TICK_COUN
         else
         {
             TICK_COUNTER_INSTANCE* tick_counter_instance = (TICK_COUNTER_INSTANCE*)tick_counter;
-            tick_counter_instance->current_ms = (tickcounter_ms_t)(difftime(time_value, tick_counter_instance->init_time_value));
+            tick_counter_instance->current_ms = (tickcounter_ms_t) time_value - tick_counter_instance->init_time_value;
             *current_ms = tick_counter_instance->current_ms;
             result = 0;
         }
danewalton commented 3 years ago

Hey @knn thanks for the investigative work and suggested fix. We have a PR to fix this with #557. Long term issues like this are often hard to find and repro so we appreciate the effort!

KevinKao809 commented 3 years ago

Hi danewalton,

May I know does this issue only impacts on Azure IoT Device (SDK) ? or also on Azure IoT Edge module? We understand both of them to share the same IoT device SDK.

danewalton commented 3 years ago

Yes any device which is 32 bit which uses this code could theoretically run into this issue. If you think you might be exposed to it I would suggest you update using the fix.