dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.28k stars 4.74k forks source link

.NET Core 2.1 "Connection reset by peer" errors #27401

Closed ryanpagel closed 4 years ago

ryanpagel commented 6 years ago

We recently switched over to use .NET Core 2.1 from 2.0 and we have begun to see a significant amount of "connection reset by peer" errors coming back. Our APIs make a lot of calls out to other APIs and thus the errors. These appear to be TCP RST packets that are breaking our connections. We are using the HttpClientFactory.CreateClient() to create a client and are not doing anything else out of the ordinary. We noticed that a lot of properties on the SocketsHttpHandler are not documented so we are not sure which knobs to turn. By turning off the SocketsHttpHandler using "AppContext.SetSwitch("System.Net.Http.UseSocketsHttpHandler", false);" it seems to have resolved the issues we are seeing but this doesn't answer the problem of why all of the issues when using SocketsHttpHandler. I'm sorry if this is unclear. I'm happy to provide more information or debug traces, etc if that would help.

We are running on the microsoft/dotnet:2.1-aspnetcore-runtime Docker image so we are on the latest minor version.

davidfowl commented 6 years ago

Are you fully asynchronous? Is this an ASP.NET Core application? Have you seen increased request latency

ryanpagel commented 6 years ago

Yes, we make all of our calls asynchronous. This is an ASP.NET Core app. We're still running our .NET Core migration of the site in a "beta" environment so it is not under load yet but there doesn't appear to be any increased latency from our initial testing.

davidfowl commented 6 years ago

Is this on windows or Linux?

ryanpagel commented 6 years ago

This is Linux running in AWS.

ryanpagel commented 6 years ago

After a bit of testing with the SocketsHttpHandler turned off, it appears that we are getting similar errors although the exact exception is a bit different:

An error occurred while sending the request. System.Private.CoreLib

Failure when receiving data from the peer System.Net.Http at System.Net.Http.CurlHandler.ThrowIfCURLEError(CURLcode error)\n at System.Net.Http.CurlHandler.MultiAgent.FinishRequest(StrongToWeakReference1 easyWrapper, CURLcode messageResult)

We have a non-Linux version of this site (.NET Fx 4.7.1) that is hitting the same APIs and doesn't produce any errors like this.

ryanpagel commented 6 years ago

I think I may have found the issue. The default MaxIdleTime for an outgoing HTTP request in .NET Core 2.0 was 100 seconds. The default with SocketsHttpHandler is 120 seconds. We're hitting an IIS server backend that has a default idle connection timeout of 120 seconds. I bumped our SocketsHttpHandler.PooledConnectionIdleTimeout down to 15 seconds and that seems to have solved it.

davidfowl commented 6 years ago

This might be something worth documenting.

ryanpagel commented 6 years ago

It would be good to document the default values for PooledConnectionIdleTimeout and PooledConnectionLifetime as well as what they both do and how to modify them like this:

services.AddHttpClient("default").ConfigurePrimaryHttpMessageHandler(() => new SocketsHttpHandler() { PooledConnectionIdleTimeout = TimeSpan.FromSeconds(15), PooledConnectionLifetime = TimeSpan.FromSeconds(60) });

Also there is an extension method called SetHandlerLifetime(...) which I believe does the same as setting PooledConnectionLifetime like above. It would be good to document that as well.

karelz commented 6 years ago

We plan to document SocketsHttpHandler APIs. It is on our list. Note that SetHandlerLifetime is part of ASP.NET wrappers, so it should be raised there as well.

caesar-chen commented 5 years ago

Notice there is a issue in api docs repo: https://github.com/dotnet/dotnet-api-docs/issues/843

caesar-chen commented 5 years ago

Merged in https://github.com/dotnet/dotnet-api-docs/pull/1546

karelz commented 5 years ago

Thanks @caesar1995 for making the change!

khaledsobhy83 commented 5 years ago

Can anyone explain why these differences in timeout would cause this exception? I am having exactly the same issue and want to make sure I am applying the correct fix. @ryanpagel

ryanpagel commented 5 years ago

@khaledsobhy83 Honestly, I gave up trying to find an elegant way to solve this problem and never could get the errors to go away. What we did was to implement low-level HTTP retries when we get back certain error codes. We're using the .NET Polly library and it seems to be working.

InCerryGit commented 5 years ago

if your linux kernel version below 4.14.36. you can try update kernel version. more detail : https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v4.14.37&id=b8d4055372b58aad4a51b67e176eabdcc238fde3

if your use nginx or other load balancing software. need check connection keep-alive exist http header. httpClient.DefaultRequestHeaders.Connection.ParseAdd("keep-alive");