Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.3k stars 1.96k forks source link

[BUG]`DataLakeFileClient.renameWithResponse` operation on Azure storage occasionally fails with `500` #39122

Open findinpath opened 6 months ago

findinpath commented 6 months ago

Describe the bug While performing a com.azure.storage.file.datalake.DataLakeFileClient.renameWithResponse call the operation sometimes fails without being able to trace its cause.

FYI The storage account on which we're performing the rename operation has the Hierarchical namespace enabled.

Exception or Stack Trace

Caused by: java.io.IOException: Rename file from abfs://***@***.dfs.core.windows.net/integrationtests/galaxy_objectstore_test11ermbmikg439w_26022024_0931/test-85f290a87e734cdcaac0e37493b9c6ef/_delta_log/00000000000000000001.json.d805ec93-f095-4040-a9f6-c016b29276c8.tmp to abfs://***@***.dfs.core.windows.net/integrationtests/galaxy_objectstore_test11ermbmikg439w_26022024_0931/test-85f290a87e734cdcaac0e37493b9c6ef/_delta_log/00000000000000000001.json failed
....
Caused by: com.azure.storage.file.datalake.models.DataLakeStorageException: Status code 500, "{"error":{"code":"InternalError","message":"Server encountered an internal error. Please try again after some time.\nRequestId:efc80591-101f-009e-3bfb-68f320000000\nTime:2024-02-26T21:33:40.3567187Z"}}"
    at java.base/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:733)
    at com.azure.core.implementation.MethodHandleReflectiveInvoker.invokeWithArguments(MethodHandleReflectiveInvoker.java:35)
    at com.azure.core.implementation.http.rest.ResponseExceptionConstructorCache.invoke(ResponseExceptionConstructorCache.java:51)
    at com.azure.core.implementation.http.rest.RestProxyBase.instantiateUnexpectedException(RestProxyBase.java:356)
    at com.azure.core.implementation.http.rest.AsyncRestProxy.lambda$ensureExpectedStatus$1(AsyncRestProxy.java:128)
    at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:113)
    at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2400)
    at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:171)
    at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.set(Operators.java:2196)
    at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.onSubscribe(Operators.java:2070)
    at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onSubscribe(FluxMapFuseable.java:96)
    at reactor.core.publisher.MonoJust.subscribe(MonoJust.java:55)
    at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:157)
    at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:129)
    at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:137)
    at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:129)
    at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:137)
    at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:79)
    at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1839)
    at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.complete(MonoIgnoreThen.java:292)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onNext(MonoIgnoreThen.java:187)
    at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:79)
    at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1839)
    at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.complete(MonoIgnoreThen.java:292)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onNext(MonoIgnoreThen.java:187)
    at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:79)
    at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1839)
    at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.complete(MonoIgnoreThen.java:292)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onNext(MonoIgnoreThen.java:187)
    at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:79)
    at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1839)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:151)
    at reactor.core.publisher.FluxDelaySubscription$DelaySubscriptionMainSubscriber.onNext(FluxDelaySubscription.java:189)
    at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:129)
    at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1839)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:151)
    at reactor.core.publisher.MonoCreate$DefaultMonoSink.success(MonoCreate.java:172)
    at com.azure.core.http.okhttp.OkHttpAsyncHttpClient$OkHttpCallback.onResponse(OkHttpAsyncHttpClient.java:270)
    at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519)
    ... 3 more
    Suppressed: java.lang.Exception: #block terminated with an error
        at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:100)
        at reactor.core.publisher.Mono.block(Mono.java:1742)
        at com.azure.storage.common.implementation.StorageImplUtils.blockWithOptionalTimeout(StorageImplUtils.java:147)
        at com.azure.storage.file.datalake.DataLakeFileClient.renameWithResponse(DataLakeFileClient.java:1384)
        at io.trino.filesystem.azure.AzureFileSystem.renameGen2File(AzureFileSystem.java:223)
        at io.trino.filesystem.azure.AzureFileSystem.renameFile(AzureFileSystem.java:209)

To Reproduce Steps to reproduce the behavior:

Code Snippet https://github.com/trinodb/trino/blob/eec77b60301a8a483cf9f20a5770f85ebf9f9be5/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/writer/AzureTransactionLogSynchronizer.java#L50-L55

https://github.com/trinodb/trino/blob/eec77b60301a8a483cf9f20a5770f85ebf9f9be5/lib/trino-filesystem-azure/src/main/java/io/trino/filesystem/azure/AzureFileSystem.java#L210-L232

Expected behavior I expect the rename operation to complete always successfully

Screenshots If applicable, add screenshots to help explain your problem.

Setup (please complete the following information):

If you suspect a dependency version mismatch (e.g. you see NoClassDefFoundError, NoSuchMethodError or similar), please check out Troubleshoot dependency version conflict article first. If it doesn't provide solution for the problem, please provide:

Additional context Add any other context about the problem here.

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

joshfree commented 6 months ago

@ibrahimrabab @alzimmermsft could you please follow up

alzimmermsft commented 6 months ago

Thanks for filing this issue @findinpath, do you know if the SDK attempted retries when this status was returned by the service? 500 status codes are retried by default.

If retries are being attempted and exhausted, I'm not sure there is much the SDK could do in this space as this is the service repeatedly returned a service error status code. Options moving forward in this case would be modifying the retry options to find a set that will handle this as required (ex, more attempts, different backing off strategies, etc).

findinpath commented 6 months ago

do you know if the SDK attempted retries when this status was returned by the service?

@alzimmermsft here is the code we used to build the client

https://github.com/trinodb/trino/blob/eec77b60301a8a483cf9f20a5770f85ebf9f9be5/lib/trino-filesystem-azure/src/main/java/io/trino/filesystem/azure/AzureFileSystem.java#L486-L495

By looking into the azure-sdk-for-java code i noticed in https://github.com/Azure/azure-sdk-for-java/blob/21e4d0f0d859ad4564c88b0aec55d6c5bf0bfb02/sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/implementation/BuilderUtils.java that we use indeed the default retry mechanism:

https://learn.microsoft.com/en-us/azure/storage/blobs/storage-retry-policy#configure-retry-options

The fact that the error coming back is Server encountered an internal error. Please try again after some time. made me assume that we're dealing in fact with a Azure REST API issue.

Options moving forward in this case would be modifying the retry options to find a set that will handle this as required (ex, more attempts, different backing off strategies, etc).

Why don't we have then higher values for the default retry policy in azure-sdk-for-java? I'm not sure what values should a user of Azure REST API is supposed to set.

alzimmermsft commented 5 months ago

@findinpath nothing appears wrong in the way the client is configured.

As for the question around default retry values, there are a few levels to this. There are what is called out in the document you linked (which is for the .NET SDKs but the Java SDKs are effectively the same), where there is a baseline backoff, backoff strategy, maximum backoff, and retry count. The backoff time and strategy can be overridden based on a few HTTP headers (Retry-After, retry-after-ms, x-ms-retry-after-ms) if the service returns them, which would align better with expectations of the service. Unfortunately, the backing off strategy is fairly generic by default given the requirement to work with many different services and scenarios, which is why the service returning a backoff is the best choice but not always available.

The retrying options are left easy to override in a service by passing RetryOptions, or in the case of Storage RequestRetryOptions, with more appropriate values for your scenario. And unfortunately, this is another case where I'm not in a good spot to offer options as this is such a variable problem, but I can recommend an initial investigation into either a larger maximum back off or more retries.

@ibrahimrabab could you follow up further by looking into if there are certain reasons a 500 response would come back frequently? Given the amount of integration, performance, and stress testing we do I'd have expected us to see this more frequently but we haven't.