Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.3k stars 1.96k forks source link

[BUG] Connection reset by Azure #41647

Open snalli opened 2 weeks ago

snalli commented 2 weeks ago

Describe the bug We are consistently seeing connection-reset errors from Azure SDK impacting our uploads to Azure

Exception or Stack Trace Add the exception log and stack trace if available

2024/08/26 23:06:23.586 WARN [HttpClientConnect] [reactor-http-epoll-6] [ambry-vcr] [] [6aa3dc58-5554, L:/10.183.58.25:44730 - R:ambryprodmainbackupwus2.blob.core.windows.net/10.4.49.4:443] The connection observed an error, the request cannot be retried as the headers/body were sent
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer

To Reproduce Steps to reproduce the behavior: None, happens no matter what we try

Code Snippet Add the code snippet that causes the issue. None, this is Azure SDK internal, not our application

Expected behavior A clear and concise description of what you expected to happen. Expected behaviour is to not have connection resets

Screenshots If applicable, add screenshots to help explain your problem.

Setup (please complete the following information):

If you suspect a dependency version mismatch (e.g. you see NoClassDefFoundError, NoSuchMethodError or similar), please check out Troubleshoot dependency version conflict article first. If it doesn't provide solution for the problem, please provide:

Additional context Add any other context about the problem here.

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

snalli commented 2 weeks ago
2024/08/26 23:06:23.586 WARN [HttpClientConnect] [reactor-http-epoll-6] [ambry-vcr] [] [6aa3dc58-5554, L:/10.183.58.25:44730 - R:ambryprodmainbackupwus2.blob.core.windows.net/10.4.49.4:443] The connection observed an error, the request cannot be retried as the headers/body were sent
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
2024/08/26 23:06:23.638 WARN [HttpClientConnect] [reactor-http-epoll-28] [ambry-vcr] [] [85ecb0bc-1814, L:/10.183.58.25:56026 - R:ambryprodmainbackupwus2.blob.core.windows.net/10.4.49.4:443] The connection observed an error, the request cannot be retried as the headers/body were sent
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
2024/08/26 23:06:23.690 WARN [HttpClientConnect] [reactor-http-epoll-16] [ambry-vcr] [] [5dc5e28c-6459, L:/10.183.58.25:51920 - R:ambryprodmainbackupwus2.blob.core.windows.net/10.4.49.4:443] The connection observed an error, the request cannot be retried as the headers/body were sent
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
2024/08/26 23:06:23.770 WARN [HttpClientConnect] [reactor-http-epoll-33] [ambry-vcr] [] [d779f744-1939, L:/10.183.58.25:35598 - R:ambryprodmainbackupwus2.blob.core.windows.net/10.4.49.4:443] The connection observed an error, the request cannot be retried as the headers/body were sent
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
2024/08/26 23:06:23.878 WARN [HttpClientConnect] [reactor-http-epoll-29] [ambry-vcr] [] [c7f1791e-1840, L:/10.183.58.25:35570 - R:ambryprodmainbackupwus2.blob.core.windows.net/10.4.49.4:443] The connection observed an error, the request cannot be retried as the headers/body were sent
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
snalli commented 2 weeks ago

@alzimmermsft is there a way to print out the full http request/response when a conn-reset is encountered ?

alzimmermsft commented 2 weeks ago

Unfortunately, we don't have a simple mechanism that enables full request-response logging when a connection reset error is encountered. There are some options available for logging that would always be on that could help in this situation.

Given this is the Netty implementation, you can configure a custom implementation where Netty enables logging:

HttpClient azureHttpClient = new NettyAsyncHttpClientBuilder(reactor.netty.http.client.HttpClient.create()
    .wiretap("logger-name", LogLevel.INFO, AdvancedByteBufFormat.SIMPLE))
    .build();

BlobServiceClient serviceClient = new BlobServiceClientBuilder()
    .httpClient(azureHttpClient)
    .endpoint("blob-service-endpoint")
    .credential(auth)
    .build();

This will add logging at the Netty level, logging at INFO level, with a simple formatting. This will log on all operations, meaning there will be a lot of logging but it should help troubleshoot. There are more advanced options here with configuring the logging handler directly, if needed.

Similarly, at a higher-level within Azure SDK code you can enable HTTP request-response logging: https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/core/azure-core#http-request-and-response-logging

This too by default always logs, even if there isn't an error. When there is an error it will add the associated client request ID into the log to help correlate the request with the error.

Side question, when connection reset errors happen, is there a request retry occurring for the request?

snalli commented 2 weeks ago

@alzimmermsft IDK if a retry is happening as I have configured the azure-java-client to handle retries and it does not inform or log any retries. It only prints a failure - possibly after exhausting all retry attempts. How do I find out if retries are happening when a connection reset occurs ? Furthermore, what causes connection resets - azure machines restart, connection remains idle etc. ? How do we debug without a ton of logging ?

alzimmermsft commented 2 weeks ago

IDK if a retry is happening as I have configured the azure-java-client to handle retries and it does not inform or log any retries. It only prints a failure - possibly after exhausting all retry attempts. How do I find out if retries are happening when a connection reset occurs ?

If HTTP request and response logging is enabled (the link I shared) those logs will include which request attempt is being tried. There will be a log key-value of tryCount: <attempt number>. If logging is also enabled for the retry policy it will log a message when all attempts are exhausted, the classes for these logs will either be com.azure.core.http.policy.RetryPolicy or com.azure.storage.common.policy.RequestRetryPolicy.

Furthermore, what causes connection resets - azure machines restart, connection remains idle etc. ?

There can be many reasons for this. Both of those you mentioned are possible, another possibility is when using a proxy the proxy connection may have a limited lifetime that is different than what is configured by the SDK client or the Azure service.

How do we debug without a ton of logging ?

If it's possible, I'd recommend getting a reproduction with logging even if it is a lot of logging. As the lead up to the connection resets may be the cause and the connection reset is just the symptom (but also possible for the connection reset to be the cause).

Are these connection reset logs just happening or are they resulting in errors during the application runtime?

snalli commented 2 weeks ago

It does lead to errors like this :

2024/08/27 19:46:52.540 WARN [HttpClientConnect] [reactor-http-epoll-31] [ambry-vcr] [] [5ea72d6b-505, L:/10.183.58.25:46036 - R:ambryprodmainbackupwus2.table.core.windows.net/10.4.49.5:443] The connection observed an error, the request cannot be retried as the headers/body were sent
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
2024/08/27 19:46:52.540 ERROR [NettyAsyncHttpClient] [VcrReplicaThread-Intra-401-prod-lor1] [ambry-vcr] [] io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
2024/08/27 19:48:47.795 WARN [HttpClientConnect] [reactor-http-epoll-26] [ambry-vcr] [] [4d07a6f2-587, L:/10.183.58.25:45974 - R:ambryprodmainbackupwus2.table.core.windows.net/10.4.49.5:443] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 60000 milliseconds.
        at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:200) ~[com.azure.azure-core-http-netty-1.13.7.jar:1.13.7]
        at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:184) ~[com.azure.azure-core-http-netty-1.13.7.jar:1.13.7]
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:406) [io.netty.netty-transport-classes-epoll-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
        at java.lang.Thread.run(Thread.java:834) [?:?]
snalli commented 2 weeks ago

I have two suspicions

  1. Some timeout is kicking in causing the conn to terminate - either from our side or azure ?
  2. We are streaming some content to azure that we receive from another machine. Perhaps we could buffer the content in-memory before sending it out to Azure.
snalli commented 2 weeks ago

Option 2 has not worked. So for Option 1 - is there a timeout kicking that I can change ?

alzimmermsft commented 2 weeks ago

It does lead to errors like this :

2024/08/27 19:46:52.540 WARN [HttpClientConnect] [reactor-http-epoll-31] [ambry-vcr] [] [5ea72d6b-505, L:/10.183.58.25:46036 - R:ambryprodmainbackupwus2.table.core.windows.net/10.4.49.5:443] The connection observed an error, the request cannot be retried as the headers/body were sent
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
2024/08/27 19:46:52.540 ERROR [NettyAsyncHttpClient] [VcrReplicaThread-Intra-401-prod-lor1] [ambry-vcr] [] io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
2024/08/27 19:48:47.795 WARN [HttpClientConnect] [reactor-http-epoll-26] [ambry-vcr] [] [4d07a6f2-587, L:/10.183.58.25:45974 - R:ambryprodmainbackupwus2.table.core.windows.net/10.4.49.5:443] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 60000 milliseconds.
        at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:200) ~[com.azure.azure-core-http-netty-1.13.7.jar:1.13.7]
        at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:184) ~[com.azure.azure-core-http-netty-1.13.7.jar:1.13.7]
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:406) [io.netty.netty-transport-classes-epoll-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
        at java.lang.Thread.run(Thread.java:834) [?:?]

This is a client-side timeout happening. By default, we configure our HTTP clients to timeout after 60 seconds of inactivity on a request or response. This is a request timeout indicating that no bytes were written to the network request for 60 seconds, more information about these timeouts can be seen here:

https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/core/azure-core#http-timeouts

And that includes how you can configure these timeouts with both environment configuration changes and code changes.

This still doesn't answer the issue on the connection being reset being reset by the remote peer. If this is still something that can happen consistently it may be useful to get a Wireshark capture of the network traffic for these requests when the exception happens. Additionally, if an x-ms-client-request-id can be captured for the request that fails it may be possible to track that to what is happening.

snalli commented 1 week ago

Hmm ok, will try wireshark