Open snalli opened 2 weeks ago
2024/08/26 23:06:23.586 WARN [HttpClientConnect] [reactor-http-epoll-6] [ambry-vcr] [] [6aa3dc58-5554, L:/10.183.58.25:44730 - R:ambryprodmainbackupwus2.blob.core.windows.net/10.4.49.4:443] The connection observed an error, the request cannot be retried as the headers/body were sent
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
2024/08/26 23:06:23.638 WARN [HttpClientConnect] [reactor-http-epoll-28] [ambry-vcr] [] [85ecb0bc-1814, L:/10.183.58.25:56026 - R:ambryprodmainbackupwus2.blob.core.windows.net/10.4.49.4:443] The connection observed an error, the request cannot be retried as the headers/body were sent
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
2024/08/26 23:06:23.690 WARN [HttpClientConnect] [reactor-http-epoll-16] [ambry-vcr] [] [5dc5e28c-6459, L:/10.183.58.25:51920 - R:ambryprodmainbackupwus2.blob.core.windows.net/10.4.49.4:443] The connection observed an error, the request cannot be retried as the headers/body were sent
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
2024/08/26 23:06:23.770 WARN [HttpClientConnect] [reactor-http-epoll-33] [ambry-vcr] [] [d779f744-1939, L:/10.183.58.25:35598 - R:ambryprodmainbackupwus2.blob.core.windows.net/10.4.49.4:443] The connection observed an error, the request cannot be retried as the headers/body were sent
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
2024/08/26 23:06:23.878 WARN [HttpClientConnect] [reactor-http-epoll-29] [ambry-vcr] [] [c7f1791e-1840, L:/10.183.58.25:35570 - R:ambryprodmainbackupwus2.blob.core.windows.net/10.4.49.4:443] The connection observed an error, the request cannot be retried as the headers/body were sent
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
@alzimmermsft is there a way to print out the full http request/response when a conn-reset is encountered ?
Unfortunately, we don't have a simple mechanism that enables full request-response logging when a connection reset error is encountered. There are some options available for logging that would always be on that could help in this situation.
Given this is the Netty implementation, you can configure a custom implementation where Netty enables logging:
HttpClient azureHttpClient = new NettyAsyncHttpClientBuilder(reactor.netty.http.client.HttpClient.create()
.wiretap("logger-name", LogLevel.INFO, AdvancedByteBufFormat.SIMPLE))
.build();
BlobServiceClient serviceClient = new BlobServiceClientBuilder()
.httpClient(azureHttpClient)
.endpoint("blob-service-endpoint")
.credential(auth)
.build();
This will add logging at the Netty level, logging at INFO level, with a simple formatting. This will log on all operations, meaning there will be a lot of logging but it should help troubleshoot. There are more advanced options here with configuring the logging handler directly, if needed.
Similarly, at a higher-level within Azure SDK code you can enable HTTP request-response logging: https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/core/azure-core#http-request-and-response-logging
This too by default always logs, even if there isn't an error. When there is an error it will add the associated client request ID into the log to help correlate the request with the error.
Side question, when connection reset errors happen, is there a request retry occurring for the request?
@alzimmermsft IDK if a retry is happening as I have configured the azure-java-client to handle retries and it does not inform or log any retries. It only prints a failure - possibly after exhausting all retry attempts. How do I find out if retries are happening when a connection reset occurs ? Furthermore, what causes connection resets - azure machines restart, connection remains idle etc. ? How do we debug without a ton of logging ?
IDK if a retry is happening as I have configured the azure-java-client to handle retries and it does not inform or log any retries. It only prints a failure - possibly after exhausting all retry attempts. How do I find out if retries are happening when a connection reset occurs ?
If HTTP request and response logging is enabled (the link I shared) those logs will include which request attempt is being tried. There will be a log key-value of tryCount: <attempt number>
. If logging is also enabled for the retry policy it will log a message when all attempts are exhausted, the classes for these logs will either be com.azure.core.http.policy.RetryPolicy
or com.azure.storage.common.policy.RequestRetryPolicy
.
Furthermore, what causes connection resets - azure machines restart, connection remains idle etc. ?
There can be many reasons for this. Both of those you mentioned are possible, another possibility is when using a proxy the proxy connection may have a limited lifetime that is different than what is configured by the SDK client or the Azure service.
How do we debug without a ton of logging ?
If it's possible, I'd recommend getting a reproduction with logging even if it is a lot of logging. As the lead up to the connection resets may be the cause and the connection reset is just the symptom (but also possible for the connection reset to be the cause).
Are these connection reset logs just happening or are they resulting in errors during the application runtime?
It does lead to errors like this :
2024/08/27 19:46:52.540 WARN [HttpClientConnect] [reactor-http-epoll-31] [ambry-vcr] [] [5ea72d6b-505, L:/10.183.58.25:46036 - R:ambryprodmainbackupwus2.table.core.windows.net/10.4.49.5:443] The connection observed an error, the request cannot be retried as the headers/body were sent
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
2024/08/27 19:46:52.540 ERROR [NettyAsyncHttpClient] [VcrReplicaThread-Intra-401-prod-lor1] [ambry-vcr] [] io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
2024/08/27 19:48:47.795 WARN [HttpClientConnect] [reactor-http-epoll-26] [ambry-vcr] [] [4d07a6f2-587, L:/10.183.58.25:45974 - R:ambryprodmainbackupwus2.table.core.windows.net/10.4.49.5:443] The connection observed an error
java.util.concurrent.TimeoutException: Channel response timed out after 60000 milliseconds.
at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:200) ~[com.azure.azure-core-http-netty-1.13.7.jar:1.13.7]
at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:184) ~[com.azure.azure-core-http-netty-1.13.7.jar:1.13.7]
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:406) [io.netty.netty-transport-classes-epoll-4.1.94.Final.jar:4.1.94.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final]
at java.lang.Thread.run(Thread.java:834) [?:?]
I have two suspicions
Option 2 has not worked. So for Option 1 - is there a timeout kicking that I can change ?
It does lead to errors like this :
2024/08/27 19:46:52.540 WARN [HttpClientConnect] [reactor-http-epoll-31] [ambry-vcr] [] [5ea72d6b-505, L:/10.183.58.25:46036 - R:ambryprodmainbackupwus2.table.core.windows.net/10.4.49.5:443] The connection observed an error, the request cannot be retried as the headers/body were sent io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer 2024/08/27 19:46:52.540 ERROR [NettyAsyncHttpClient] [VcrReplicaThread-Intra-401-prod-lor1] [ambry-vcr] [] io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer 2024/08/27 19:48:47.795 WARN [HttpClientConnect] [reactor-http-epoll-26] [ambry-vcr] [] [4d07a6f2-587, L:/10.183.58.25:45974 - R:ambryprodmainbackupwus2.table.core.windows.net/10.4.49.5:443] The connection observed an error java.util.concurrent.TimeoutException: Channel response timed out after 60000 milliseconds. at com.azure.core.http.netty.implementation.AzureSdkHandler.responseTimedOut(AzureSdkHandler.java:200) ~[com.azure.azure-core-http-netty-1.13.7.jar:1.13.7] at com.azure.core.http.netty.implementation.AzureSdkHandler.lambda$startResponseTracking$2(AzureSdkHandler.java:184) ~[com.azure.azure-core-http-netty-1.13.7.jar:1.13.7] at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final] at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:406) [io.netty.netty-transport-classes-epoll-4.1.94.Final.jar:4.1.94.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty.netty-common-4.1.94.Final.jar:4.1.94.Final] at java.lang.Thread.run(Thread.java:834) [?:?]
This is a client-side timeout happening. By default, we configure our HTTP clients to timeout after 60 seconds of inactivity on a request or response. This is a request timeout indicating that no bytes were written to the network request for 60 seconds, more information about these timeouts can be seen here:
https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/core/azure-core#http-timeouts
And that includes how you can configure these timeouts with both environment configuration changes and code changes.
This still doesn't answer the issue on the connection being reset being reset by the remote peer. If this is still something that can happen consistently it may be useful to get a Wireshark capture of the network traffic for these requests when the exception happens. Additionally, if an x-ms-client-request-id
can be captured for the request that fails it may be possible to track that to what is happening.
Hmm ok, will try wireshark
Describe the bug We are consistently seeing connection-reset errors from Azure SDK impacting our uploads to Azure
Exception or Stack Trace Add the exception log and stack trace if available
To Reproduce Steps to reproduce the behavior: None, happens no matter what we try
Code Snippet Add the code snippet that causes the issue. None, this is Azure SDK internal, not our application
Expected behavior A clear and concise description of what you expected to happen. Expected behaviour is to not have connection resets
Screenshots If applicable, add screenshots to help explain your problem.
Setup (please complete the following information):
If you suspect a dependency version mismatch (e.g. you see
NoClassDefFoundError
,NoSuchMethodError
or similar), please check out Troubleshoot dependency version conflict article first. If it doesn't provide solution for the problem, please provide:mvn dependency:tree -Dverbose
)Additional context Add any other context about the problem here.
Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report