Closed knn closed 3 years ago
Hey @knn thanks for the investigative work and suggested fix. We have a PR to fix this with #557. Long term issues like this are often hard to find and repro so we appreciate the effort!
Hi danewalton,
May I know does this issue only impacts on Azure IoT Device (SDK) ? or also on Azure IoT Edge module? We understand both of them to share the same IoT device SDK.
Yes any device which is 32 bit which uses this code could theoretically run into this issue. If you think you might be exposed to it I would suggest you update using the fix.
We use the following hardware/software setup:
Our IoT devices aren't regularly restarted, but can typically run for multiple weeks or months at a time. We noticed that after about 25 days of uptime, the devices aren't usable anymore, because they persistently perform disconnect-reconnect cycles every few seconds.
The disconnect causes an invocation of the connection status callback with status
IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED
and reasonIOTHUB_CLIENT_CONNECTION_NO_NETWORK
, although the network is available, because a subsequent reconnection is successful.We were able to rule out any WiFi connection problems, so we dug deeper into this issue.
Using the debug output of the MQTT communication, we detected that the SDK doesn't send
PINGREQ
packets to the IoT hub anymore:The culprit is this condition in the MQTT client, which doesn't evaluate to true even when the MQTT keep-alive timeout (10 seconds in our case) has passed. As already mentioned, this causes the SDK to cease sending
PINGREQ
packets to the IoT hub. The IoT hub subsequently terminates the connection, because it hasn't received aPINGREQ
packet within the max. keep-alive waiting time. The disconnection on the IoT hub side leads to a reconnection on the client side.We then inspected the Linux time modules tickcounter_linux.c, linux_time.h and linux_time.c (in our case,
CLOCK_MONOTONIC
is defined). The MQTT client uses the functiontickcounter_get_current_ms()
function that itself uses the functionget_time_ms()
which returns a count of milliseconds as atime_t
type. Apparently, the count is the current clock converted to milliseconds:(Link)
The
time_t
type is not portable, because it isn't defined to a fixed type in the C standard (Source).For our musl version,
time_t
is defined to along
which is anint32_t
integer on MIPS. This meanstime_t
overflows after counting 2^31 milliseconds which is ~25 days. In fact, it may even overflow earlier, because the starting point ofclock_gettime()
is unspecified (Source):In our setup, the starting point is the current uptime of the OpenWrt Linux system. Note that the musl C library has a 64-bit
time_t
since version 1.2.0 (musl time64 Release Notes). This version, however, is not available on OpenWrt 19.07.4 and the RC of the next version 21.02.0 also uses musl 1.1.x (OpenWrt 21.02.0-rc1 Release Notes), so this issue can't be fixed by a library update.Thus, we use the attached patch to fix this problem on our HW/SW setup. We've replaced the
time_t
with theint64_t
datatype and replaced thedifftime()
invocation with a simple subtraction.I haven't submitted a pull request, because I haven't had time to look at the other platforms supported by the Azure IoT C SDK.
Could someone please check if this problem also applies to the other platforms?