Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.35k stars 1.98k forks source link

[BUG] JAVA SDK - BlobClient randomly attempts retry during binary data upload, resulting error 409 #37748

Open ylimc opened 11 months ago

ylimc commented 11 months ago

Describe the bug In our application we configured the client timeout to be 30 seconds for binary data upload using BlobClient upload method (Note: this method sets the overwrite to be false). However we discovered from the application log that once a while there would be an error 409 from the upload method complaining that the blob was already created.

We looked at the application log closely and found that the application code that invokes the upload method was blocked for 34 seconds (which is more than 30 seconds timeout) and got error 409.

Upon checking the Blob Storage log, we then found that there were 2 attempts of the uploads made by the client. The first one took a little bit over 30 seconds and successfully uploaded the binary data, then after a few seconds a second upload for the same binary data appeared and resulted in error 409.

Due to this issue, whenever the application encounters error 4XX, it will mark the upload as failure. This results in passing the wrong information to the downstream application since the binary data upload was actually successful.

Exception or Stack Trace N/A

To Reproduce Steps to reproduce the behavior: Based on the observation it appears that this issue would only happen if the upload takes a little bit over the configured timeout from the client side. In our case, the timeout was set to 30 seconds and the first upload attempt used 31 seconds.

Code Snippet blobClient.upload(BinaryData.fromString(content))

Expected behavior When using the upload method without overwrite, the client should not retry upon timeout. Instead the upload attempt should fail and it's up to the application side to decide whether or not to retry.

If the retry mechanism is built-in and set as the default behaviour, please make it explicitly configurable when using the upload or equivalent overloaded methods to enable or disable client automatic retries.

Screenshots N/A

Setup (please complete the following information):

Additional context N/A.

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

ibrahimrabab commented 11 months ago

Hi @ylimc Thank you for reaching out! Can you provide the exact code snippet that is being used to reproduce this behavior so that we can inspect the code and try to reproduce it on our end? Thanks!

ylimc commented 11 months ago

Hi @ibrahimrabab Thank you so much for your reply. Unfortunately due to my company's policy I cannot share you the exact code snippet.

However, the code we had wasn't much different from the example provided by this doc https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-upload-java.

Apart from the upload code, we only configured the read/write/response timeout to be 30 seconds when configuring the httpClient for the BlobServiceClient.

ibrahimrabab commented 11 months ago

@ylimc how much data are you uploading with the upload method? Also, can you try upgrading to the latest version of azure-storage-blob (12.25.0) and and see if you are still seeing this issue?

ylimc commented 11 months ago

Hi @ibrahimrabab thank you for getting back to this issue. The data that the application upload with each attempt is very small, the data is in JSON format and is usually less than 50 lines.

We haven't upgrade to the latest 12.25.0 version, but we are planning to do so for the next release. However until then we cannot verify if this issue is fixed or not.

ibrahimrabab commented 10 months ago

@ylimc Are you calling upload(BinaryData) on the same blob that has already been created, or calling it on new blobs only? Because calling upload() on an existing blob will throw a 409.

Also for your httpClient, can you share how you are setting that up with the 30 second timeout for the BlobServiceClient? Thanks!

github-actions[bot] commented 10 months ago

Hi @ylimc. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

ylimc commented 10 months ago

Hi @ibrahimrabab sorry for the late reply.

The following code snippet is how we are setting the 30s timeout for the client.

HttpClient httpClient = new NettyAsyncHttpClientBuilder()
    .connectTimeout(Duration.ofSeconds(10))
    .readTimeout(Duration.ofSeconds(30))
    .writeTimeout(Duration.ofSeconds(30))
    .responseTimeout(Duration.ofSeconds(30))
    .build();

As of your question regarding the blobClient that we are using to call upload() method, we use the same blobClient. We are using spring-boot so we created the blobClient as a singleton bean and it should always be that same object that's being invoked.

We first create a singleton BlobContainerClient with container name. Then we call the getBlobClient(filePath) to get the actual BlobClient to call the upload() method. So I think we are probably reusing the same blobClient to call the upload method in this case.

ibrahimrabab commented 9 months ago

@ylimc The 409 error code could be occurring due to a race condition. What could be happening is the blob may complete downloading just as the connection times out, so the blob exists on the service side, but the client side is getting a TimeoutException. Because of this, the client may retry the upload again but result in 409 (conflict).

Can you add some logic to inspect the error more closely? Since you get back a 409 for BlobAlreadyExists, can you check either programmatically or in the portal whether the blob actually exists?

As for our side, we could provide functionality where if the operation has HTTP preconditions and the operation is known to mutate data on the service side, we won't retry. But this will result in the TimeoutException that is caught and retried to be thrown from the SDK, does this work better for you than receiving a BlobStorageException with a 409 status code and a BlobErrorCode (https://learn.microsoft.com/en-us/rest/api/storageservices/blob-service-error-codes)?

ylimc commented 9 months ago

Hi @ibrahimrabab thanks for the updated info, what you described seems to be what happened to us.

As to your question, we did check the blob storage side and we verified that the blob existed when encountering this 409 error. So we aren't really losing anything as the data were successfully uploaded. However the issue with this error is that the client side doesn't know that the blob was uploaded when encountering any 4XX error, hence giving the wrong info to the downstream application and would cause confusion in the end.

Also to your last comment, I think from our side we'd like the SDK to throw exception when encountering timeout instead of silently retrying on its own. It is up to the client side to decide whether to retry when encountering exception.

In our case we have implemented retry logic when encountering 5XX errors or timeout, which means that OUR application code would call the upload again if encountering TimeoutException. So again if the SKD can just throw "TimeoutException" instead of silently retry the blob upload for us, it'd be a preferred solution for us.