Azure / azure-iot-sdk-c

A C99 SDK for connecting devices to Microsoft Azure IoT services
https://azure.github.io/azure-iot-sdk-c
Other
585 stars 739 forks source link

IoTHubDeviceClient_UploadMultipleBlocksToBlob_Async can miss some errors #2569

Closed pulkomandy closed 3 months ago

pulkomandy commented 7 months ago

Environment

Problem description

We use IoTHubDeviceClient_UploadMultipleBlocksToBlob_Async to start an upload. However, there is an error during the upload setup. Logs extract:

[warning]: [azure] Failure in HTTP communication: server reply code is 404
[info]: [azure] HTTP Response:<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">
                                                         <HTML><HEAD><TITLE>Not Found</TITLE>
                                                         <META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>
                                                         <BODY><h2>Not Found</h2>
                                                         <hr><p>HTTP Error 404. The requested resource is not found.</p>
                                                         </BODY></HTML>
[warning]: [azure] HTTP failed response code was 404
[warning]: [azure] unable to HTTPAPIEX_ExecuteRequest
[warning]: [azure] error in IoTHubClient_LL_UploadToBlob_GetBlobCredentialsFromIoTHub
[warning]: [azure] Failed initializing upload in IoT Hub

These error come from https://github.com/Azure/azure-iot-sdk-c/blob/6b7753837ee18d2198759aa625d99081e255248f/iothub_client/src/iothub_client_ll_uploadtoblob.c#L908 and https://github.com/Azure/azure-iot-sdk-c/blob/6b7753837ee18d2198759aa625d99081e255248f/iothub_client/src/iothub_client_core_ll.c#L2881

In our case we are using asynchronous upload so this is called from here inside uploadMultipleBlock_thread:

https://github.com/Azure/azure-iot-sdk-c/blob/main/iothub_client/src/iothub_client_core.c#L2437

The thread function returns an error, but this is not propagated to the caller function that started the thread: https://github.com/Azure/azure-iot-sdk-c/blob/main/iothub_client/src/iothub_client_core.c#L2475 (and it can't be, because it's happening asynchronously in another thread).

There is no way for the caller code to know that the upload has failed. IoTHubClientCore_UploadMultipleBlocksToBlobAsync but the upload callback is never called. I think it could make sense in this case to call the callback (IOTHUB_CLIENT_FILE_UPLOAD_GET_DATA_CALLBACK_EX getDataCallbackEx) with an IOTHUB_CLIENT_FILE_UPLOAD_RESULT error code indicating a failed upload. Otherwise, our software has no way to know that the upload is failed and should be retried. In our case, we make sure to only have one upload at a time, so we stay blocked and never upload anything anymore if this happens.

ewertons commented 5 months ago

Hi @pulkomandy , could you please take a look at these new more-granular upload-to-blob API functions? https://github.com/Azure/azure-iot-sdk-c/blob/main/iothub_client/samples/iothub_client_sample_upload_to_blob_with_retry/iothub_client_sample_upload_to_blob_with_retry.c

They would give you more control over detecting intermediary errors and attemping retries.

ewertons commented 5 months ago

But it's well noted, your assessment of the issue is correct. We will look into fixing it. Update coming soon.

pulkomandy commented 5 months ago

Hello,

We have been using this patch to fix the problem:

From f9d46850bc024989c16110f7c7f8b1a25d51823f Mon Sep 17 00:00:00 2001
From: Nicolas Savatier <nicolas.savatier@viveris.fr>
Date: Thu, 1 Feb 2024 10:22:46 +0000
Subject: [PATCH] [LCB-1144] Add callback call on upload failure in Azure
 Worker Thread

---
 iothub_client/src/iothub_client_core.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/iothub_client/src/iothub_client_core.c b/iothub_client/src/iothub_client_core.c
index 6da8b7216..08b911b7f 100644
--- a/iothub_client/src/iothub_client_core.c
+++ b/iothub_client/src/iothub_client_core.c
@@ -2431,10 +2431,22 @@ static int uploadMultipleBlock_thread(void* data)
     if (threadInfo->uploadBlobMultiblockSavedData.getDataCallback != NULL)
     {
         result = IoTHubClientCore_LL_UploadMultipleBlocksToBlob(llHandle, threadInfo->destinationFileName, threadInfo->uploadBlobMultiblockSavedData.getDataCallback, threadInfo->context);
+        if(result != IOTHUB_CLIENT_OK)
+        {
+            //An error happened during IoTHubClientCore_LL_UploadMultipleBlocksToBlob(),
+            //notify the caller by calling its callback function.
+            threadInfo->uploadBlobMultiblockSavedData.getDataCallback(result, NULL, NULL, threadInfo->context);
+        }
     }
     else
     {
         result = IoTHubClientCore_LL_UploadMultipleBlocksToBlobEx(llHandle, threadInfo->destinationFileName, threadInfo->uploadBlobMultiblockSavedData.getDataCallbackEx, threadInfo->context);
+        if(result != IOTHUB_CLIENT_OK)
+        {
+            //An error happened during IoTHubClientCore_LL_UploadMultipleBlocksToBlobEx(),
+            //notify the caller by calling its callback function.
+            threadInfo->uploadBlobMultiblockSavedData.getDataCallbackEx(result, NULL, NULL, threadInfo->context);
+        }
     }
     (void)markThreadReadyToBeGarbageCollected(threadInfo);

-- 
2.39.2

Sorry, we forgot to update the issue with the patch. We have confirmed that it works for us and solves the problem.

We know of the new granular API, but if we can avoid rewriting our code with a simple fix like this, it saves us a lot of time.

ewertons commented 5 months ago

Thank you a lot, @pulkomandy . I thought about the fix following these guidelines, but there is a catch there. It could result in the callback to the user app being called twice (in case the failure happens in IoTHubClient_LL_UploadToBlob_UploadMultipleBlocks). It will need a callback into the iothub_client_core layer first that can be used to control if a callback from iothub_client_ll_uploadtoblob hasn't been invoked already.

ewertons commented 5 months ago

Hi @pulkomandy , we merged a fix for this issue. Thank you very much for reporting it! We will close this issue early next week. If you have a chance to verify the fix from your side, we would appreciate it. Thanks, Azure IoT SDK Team.

pulkomandy commented 5 months ago

Hello,

Thanks for the fix. I don't know if we can fully verify it, since we noticed the error after a problem on the Azure IoT cloud side, which we can't reproduce predictably.

We will try to update the version of the SDK we use whenever possible (possibly waiting for the next release).

ewertons commented 5 months ago

Perfect. Thank you @pulkomandy , we will close this issue for now then.

pulkomandy commented 4 months ago

Hello @ewertons, we have started using this patch and it improves things, unfortunately we have hit another problematic case.

The following scenario happens:

At this point, our code decides that the upload is succesful, we delete the corresponding local file and the upload context.

However, later on there is an error in IotHubClient_LL_UploadToBlob_NotifyCompletion (this happens 1 minute later in our case since it is a timeout error):

 [azure] curl_easy_perform() failed: Timeout was reached
 [azure] (result = HTTPAPI_OPEN_REQUEST_FAILED (4))
 [azure] unable to recover sending to a working state
 [azure] unable to HTTPAPIEX_ExecuteRequest
 [azure] unable to execute send_http_request
 [azure] IoTHubClient_LL_UploadToBlob_NotifyIoTHubOfUploadCompletion failed
 [azure] Failed completing upload to blob.

since there were no errors during the upload phase, isCallbackInvokedWithError is not set, and so the error callback is called.

When using a synchronous upload function such as IoTHubClientCore_LL_UploadMultipleBlobsToBlockEx, this would be equivalent to the callback reporting a success, but the function eventually returning an error code.

As a result of this, we can't know for sure when an upload is really fully completed. Even after we received a notification with FILE_UPLOAD_OK and data or size == NULL, which should mean the upload is succesfully completed, we can still receive an error callback later on. So, we don't know when it is safe to consider an upload complete and successful.

I'm not sure what's the best option here, some ideas are:

pulkomandy commented 4 months ago

Hello,

I made some sequence diagrams to explain the problem more clearly:

image

And the modification I made:

image

ewertons commented 3 months ago

Hi @pulkomandy , I was investigating this latest issue and I think I got it wrong the first time. For iothub_client_core, this should be the logic:

That way, for the convenience layer the final callback will always be called after the whole upload to blob logic is completed. I'll post a PR with those changes, once I do that could you please validate it as well before we merged it?

pulkomandy commented 3 months ago

Hello,

Thanks for the update.

The solution you describe seems fine to me, and may be a bit simpler than what I suggested in my merge request. I am out of office this week, but that isn't a problem, let me know when you have a patch ready for testing and I'll forward it to my colleagues, or look into it when I'm back.

It will take some time before we ship a release to our customer and they are able to test it. Unfortunately, we could reproduce these problems only with on-the-field device and not with our testing platform.

ewertons commented 3 months ago

This is the initial PR. https://github.com/Azure/azure-iot-sdk-c/tree/ewertons/FixThreadedUploadToBlob

I could not run our pipeline against it yet, but it worked locally in my sample run.

ewertons commented 3 months ago

PR https://github.com/Azure/azure-iot-sdk-c/pull/2615 posted and pipeline tests run. Please review if possible, we will be merging it soon.

ewertons commented 3 months ago

HI @pulkomandy, The PR with the fix has been merged. Please let us know if you have any other feedback. Feel free to reopen this case if you need to. And thank you once more for your contributions, we really value your feedback and investment in this project. Thanks, Azure IoT SDKs Team