AzureIoTHubClient_SendTelemetry() returns success when the message is not successfully delivered

rtheil-growlink commented 2 years ago

The expectation is that when calling AzureIoTHubClient_SendTelemetry(), if it’s not able to deliver the telemetry to iothub at that moment, it will return something like eAzureIoTErrorFailed, and block until delivered or timed out. However, what I’m finding is that it always returns eAzureIoTSuccess until the azure client has detected the disconnection. I notice that these messages are then going into some sort of queue because minutes later when I restore the internet connection, it delivers these messages. This isn’t the behavior we want to implement. It’s important that it’s known at the time of calling SendTelemetry whether or not the delivery of the message was successful, even if the call is blocking.

The scenario I am trying to prevent: Controller is successfully connected to azure. Internet connection is lost (in my testing, a firewall rule to block the controller's internet). AzureIoTHubClient_SendTelemetry is called and returns success. Eventually azure client recognizes connectivity is lost, then shortly after, there’s a power failure. The telemetry will never be delivered because it’s in the azure client queue. If AzureIoTHubClient_SendTelemetry returns something other than success, we know at that moment to write the message to flash and it’ll be delivered later when connectivity is restored.

I have tested defining the macro azureiotconfigKEEP_ALIVE_TIMEOUT_SECONDS and that does seem to be working well to reduce the likelihood of the aforementioned scenario. I’ve lowered it to 10 seconds for testing, and it behaves as expected. However, even if it is set very low, my example scenario can still occur since we send telemetry upwards of every 30 seconds, and sometimes we’ll send 3 messages at that 30 second interval.

Is there any way to get AzureIoTHubClient_SendTelemetry() to block until it is confirmed the message is delivered to IoT Hub?

danewalton commented 2 years ago

Hi @rtheil-growlink I think you can achieve the behavior you are looking for by providing a callback for the telemetry "send", which will be invoked when a PUBACK is received for a sent message. For QOS0 and QOS1, there is no way for us to know within AzureIoTHubClient_SendTelemetry() whether the hub actually received the message or not. If it's sent with QOS0, there is no guarantee that the message will ever be delivered. With QOS1, we could have put a "wait" loop in the API which waited for the puback for the message, but we decided to give flexibility to the developer to care (or not) about the puback for the QOS1 scenario. If it's important, they can provide the callback. If not, then they can have simpler code without it.

Let us know if that helps!

rtheil-growlink commented 2 years ago

I think this becomes tough in my case because I'm implementing all this into an existing codebase that relies on a boolean response since older hardware submitted these messages via http(s). We'd know within a second or two if the delivery was successful, so when sending telemetry, in order to emulate the http method, there's an expectation of a boolean response in order to tell the call stack that the delivery was unsuccessful, so write to storage for later upload.

I think I will play with the idea of using a callback with a short wait after the SendTelemetry call, and see if I can make that work. Would a callback tied to SendTelemetry be called when AzureIoTHubClient_ProcessLoop is run?

danewalton commented 2 years ago

Correct. The callback is invoked here on an incoming event, processed in the _ProcessLoop():

https://github.com/Azure/azure-iot-middleware-freertos/blob/ccfda9716f39a70c6772b6fa22f8bf7fecacee90/source/azure_iot_hub_client.c#L145-L162

In theory the PUBACK should be quick so you could wrap the _Send() and _ProcessLoop() in one function and exit once the puback is returned, essentially making it a "sync" function. It's not a guarantee that the first message which the ProcessLoop will pick up is the puback, which is why we decided against putting it directly in the _SendTelemetry() API.

rtheil-growlink commented 2 years ago

Yeah, I think that all makes sense. I have azure running itself in its own freertos task, which constantly calls _ProcessLoop(), so since messages are sent in a separate task, I should be able to do a loop that waits for a few seconds while calling vTaskDelay(1). If success, return true, if timeout, return false.

Thanks for all the help with this stuff!

Azure / azure-iot-middleware-freertos

AzureIoTHubClient_SendTelemetry() returns success when the message is not successfully delivered #233