Azure / azure-iot-sdk-c

A C99 SDK for connecting devices to Microsoft Azure IoT services
https://azure.github.io/azure-iot-sdk-c
Other
587 stars 739 forks source link

Condition for telemetry message resend #2499

Closed ClaesIvarsson closed 11 months ago

ClaesIvarsson commented 1 year ago

Environment: Wifi connection over mesh network (NAT) ESP32 MQTT protocol

I have noticed that some telemetry messages sent from user level will actually be sent twice to the IoT hub . Enabling Trace on the SDK stack shows a resend of the message, both will later get PUBACK response but only after the message is sent twice.

This happens only for some messages and not for all. At first I thought it was related to size of the payload but it is not consistent.

I have tried executing the IoTHubDeviceClient_LL_DoWork with different period but it is also not clear even though it seems to be less frequent with longer period (changed from 40 to 100 ms).

How long is the PUBACK timeout ? I have understood that there is no time set by the MQTT standard , even though I have seen 60s mentioned. In my case the retransmission happens within less than a second.

The environment with wifi connection and possible delays might be a reason for this to happen but I do not understand why the retransmission occurs so fast after first message being sent.

Any other explanation that come to mind ?

ericwolz commented 1 year ago

Timeout is documented in the below link. Please post your device logs.

https://github.com/Azure/azure-iot-sdk-c/blob/main/doc/connection_and_messaging_reliability.md

ClaesIvarsson commented 1 year ago

Timeout is documented in the below link. Please post your device logs.

https://github.com/Azure/azure-iot-sdk-c/blob/main/doc/connection_and_messaging_reliability.md

Thanks for you reply but I think that my questions regarding timeout management for PUBACK in MQTT is not answered in the documentation you are referring to.

I can't find anything related to timeout waiting for PUBACK when using MQTT.

OPTION_MESSAGE_TIMEOUT is marked as deprecated OPTION_RETRY_INTERVAL_SEC is used for connection retries OPTION_DO_WORK_FREQUENCY_IN_MS is not applicable when using IoTHubDeviceClientLL* api's

In the code, there is two timeouts define in ProcessPendingTelemetryMessages. These timeouts are not possible to change in user scope and is set to 60s and 120s.

Here is a log where I get duplication of telemetry messages -> (159454) PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_LEAST_ONCE | TOPIC_NAME: devices//messages/events/%24.ct=application%2fjson&%24.ce=utf-8 | PACKET_ID: 18 | PAYLOAD_LEN: 950 -> (159508) PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_LEAST_ONCE | TOPIC_NAME: devices//messages/events/%24.ct=application%2fjson&%24.ce=utf-8 | PACKET_ID: 18 | PAYLOAD_LEN: 950 -> (159559) PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_MOST_ONCE | TOPIC_NAME: $iothub/twin/PATCH/properties/reported/?$rid=19 | PAYLOAD_LEN: 412 -> (159564) PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_LEAST_ONCE | TOPIC_NAME: devices//messages/events/%24.ct=application%2fjson&%24.ce=utf-8 | PACKET_ID: 20 | PAYLOAD_LEN: 196 <- (159573) PUBACK | PACKET_ID: 18 <- (159656) PUBACK | PACKET_ID: 18 <- (159790) PUBACK | PACKET_ID: 20 <- (159915) PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_MOST_ONCE | TOPIC_NAME: $iothub/twin/res/204/?$rid=19&$version=7661 | PAYLOAD_LEN: 0

Number in parentheses is millisec from start DoWork is executed approximately every 20ms and the delta times between original PUB and the second PUB is 54. For me this is strange that it happens so fast since all timeouts are on a completely different level.

I am just trying to understand why this happens! As telemetry messages are DELIVER_AT_LEAST_ONCE, it could happen that there are duplicates but I can't find an explanation when this will happen.

ericwolz commented 1 year ago

You have removed pertinent information from the SDK logs for us to debug your situation.

ClaesIvarsson commented 1 year ago
-> (159454) PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_LEAST_ONCE | TOPIC_NAME: devices/a584297f-a9eb-4eab-9fdc-d392fa863dc4/messages/events/%24.ct=application%2fjson&%24.ce=utf-8 | PACKET_ID: 18 | PAYLOAD_LEN: 950
-> (159508) PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_LEAST_ONCE | TOPIC_NAME: devices/a584297f-a9eb-4eab-9fdc-d392fa863dc4/messages/events/%24.ct=application%2fjson&%24.ce=utf-8 | PACKET_ID: 18 | PAYLOAD_LEN: 950
-> (159559) PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_MOST_ONCE | TOPIC_NAME: $iothub/twin/PATCH/properties/reported/?$rid=19 | PAYLOAD_LEN: 412
-> (159564) PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_LEAST_ONCE | TOPIC_NAME: devices/a584297f-a9eb-4eab-9fdc-d392fa863dc4/messages/events/%24.ct=application%2fjson&%24.ce=utf-8 | PACKET_ID: 20 | PAYLOAD_LEN: 196
<- (159573) PUBACK | PACKET_ID: 18
<- (159656) PUBACK | PACKET_ID: 18
<- (159790) PUBACK | PACKET_ID: 20
<- (159915) PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_MOST_ONCE | TOPIC_NAME: $iothub/twin/res/204/?$rid=19&$version=7661 | PAYLOAD_LEN: 0
ericwolz commented 1 year ago

There are no service logs with a device ID of a584297f-a9eb-4eab-9fdc-d392fa863dc4. What hub is this on? What are the UTC timestamp of these messages?

ClaesIvarsson commented 1 year ago

Do you suspect that this is invalid behavior or can it really happen under certain circumstances ?

I am using 1.11.0 of the azure-iot-sdk-c on an ESP32 WROVER-IE in a wifi mesh network with NAT (might impact delays in the network). command used for sending is IoTHubDeviceClient_LL_SendEventAsync I use only IoTHubDeviceClientLL* api's I always call these from the same task. DoWork is called every 50ms

I will not be able to provide a more detailed log until after my vacation, starting today. Back in 4 weeks.

ericwolz commented 1 year ago

This could happen if your clocks drift and there is a significant time jump when updating NTP. The ESP32 internal oscillator is not very good. The message callback should have status on what is going on.

ClaesIvarsson commented 1 year ago

OK, so how much is a significant time jump ?

ericwolz commented 1 year ago

greater than the message timeout.

ClaesIvarsson commented 1 year ago

So, I guess we are back to my original question: How long is the timeout for PUBACK of an MQTT message ? I cannot find any documentation stating this information.

ericwolz commented 1 year ago

The max timeout wait for PUBACK is 2 minutes.

https://github.com/Azure/azure-iot-sdk-c/blob/97fef570416467598100b782ef27ceadad9ca796/iothub_client/src/iothubtransport_mqtt_common.c#L50

ClaesIvarsson commented 11 months ago

Problem I have is not related to the 2 minute timeout, I have checked that there is no time jump that could explain the 2 minute timeout to expire.

ericwolz commented 11 months ago

You need to trace thru the code to debug it.