dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.97k stars 4.66k forks source link

Address "System.Net.Sockets.SocketException: Address already in use" on K8S/Linux using HttpClient/TCP #29327

Closed yuezengms closed 4 years ago

yuezengms commented 5 years ago

~Assumption: Duplicate of dotnet/runtime#27274 which was fixed by dotnet/corefx#32046 - goal: Port it (once confirmed it is truly duplicate).~ This is HttpClient/TCP spin off. UdpClient is covered fully by dotnet/runtime#27274.

Issue Title

"System.Net.Sockets.SocketException: Address already in use" on Linux

General

Our .net core(v 2.2.0) services are running on Azure Kubernettes Linux environment. Recently we experimenced a lot of error "System.Net.Http.HttpRequestException: Address already in use" while calling dependencies, e.g. Active Directory, CosmosDB and other services. Once the issue started, we kept getting the same errors and had to restart the service to get rid of it. Our http clients are using DNS address, not specific ip and port. The following is the call stack on one example. What can cause such issues and how to fix it?

System.Net.Http.HttpRequestException: Address already in use ---> 
System.Net.Sockets.SocketException: Address already in use#N#   
at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)#N#   --- 
End of inner exception stack trace ---#N#   

at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)#N#   
at System.Net.Http.HttpConnectionPool.CreateConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)#N#   
at System.Net.Http.HttpConnectionPool.WaitForCreatedConnectionAsync(ValueTask`1 creationTask)#N#   
at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)#N#   
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)#N#   
at System.Net.Http.DiagnosticsHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)#N#   
at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Http.HttpClientWrapper.GetResponseAsync()#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Http.AdalHttpClient.GetResponseAsync[T](Boolean respondToDeviceAuthChallenge)#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Http.AdalHttpClient.GetResponseAsync[T]()#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Flows.AcquireTokenHandlerBase.SendHttpMessageAsync(IRequestParameters requestParameters)#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Flows.AcquireTokenHandlerBase.SendTokenRequestAsync()#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Flows.AcquireTokenHandlerBase.CheckAndAcquireTokenUsingBrokerAsync()#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.Internal.Flows.AcquireTokenHandlerBase.RunAsync()#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.AuthenticationContext.AcquireTokenForClientCommonAsync(String resource, ClientKey clientKey)#N#   
at Microsoft.IdentityModel.Clients.ActiveDirectory.AuthenticationContext.AcquireTokenAsync(String resource, ClientCredential clientCredential)#N#   
karelz commented 5 years ago

This is AWESOME! Thanks a lot @antonioortizpola, fingers crossed that we will be now able to quickly root-cause it and fix it in 3.0/2.2! 🙏

antonioortizpola commented 5 years ago

@wfurt, were you able to reproduce the scenario? just to know if the repo code worked for you, or if there is something more that I can help with

wfurt commented 5 years ago

I'm still diffing through @antonioortizpola . I got services up and I could see

root@9262db4be2d7:/app# netstat -natu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 127.0.0.11:42827        0.0.0.0:*               LISTEN
tcp        0      0 172.18.0.3:43294        5.153.231.4:80          TIME_WAIT
tcp        0      0 172.18.0.3:43286        5.153.231.4:80          TIME_WAIT
tcp        0      0 172.18.0.3:34212        151.101.52.204:80       TIME_WAIT
tcp        0      0 172.18.0.3:34218        151.101.52.204:80       TIME_WAIT
tcp        0      0 172.18.0.3:34214        151.101.52.204:80       TIME_WAIT
tcp6       0      0 :::80                   :::*                    LISTEN
udp        0      0 127.0.0.11:34452        0.0.0.0:*

but if I wait a little bit, all the connections go away.

root@9262db4be2d7:/app# netstat -natu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 127.0.0.11:42827        0.0.0.0:*               LISTEN
tcp6       0      0 :::80                   :::*                    LISTEN
udp        0      0 127.0.0.11:34452        0.0.0.0:*

I also could not find any direct usage of HttpClient. Everything seems to be wrapped in some high-level calls so I'll need to unwind that.

antonioortizpola commented 5 years ago

@wfurt, did you run over the 3 version or 2.1? that is the behavior that i got from the 2.1 version. In the core 3 version the sockets would never exit the CLOSE_WAIT state, even an hour after the test ended.

jarlehal commented 5 years ago

a little confused by the comment from @karelz. the fix should be applied to 3.0/2.2, as 2.1 does not have this socket issue, but the memory thing that should be fixed/merged in 2.2

karelz commented 5 years ago

@jarlehal typo, fixed, thanks for pointing it out.

stephentoub commented 5 years ago

@wfurt has been doing a good job digging into this, and shared with me that he noticed something suspicious, that in a repro when analyzing it with SOS there ended up being a small number of Sockets on the heap but a large number of SafeSocketHandles. Based on that, I have a theory that this is due to https://github.com/dotnet/corefx/pull/32845 / https://github.com/dotnet/corefx/pull/32793. I don’t think it actually caused the problem so much as the bug it was fixing was actually masking this problem that’s existed for a long time.

SocketsHttpHandler creates a Socket for each connection. Each Socket creates a SafeSocketHandle (that’s its name in 3.0; prior to that it was internal and named SafeCloseSocket), a SafeHandle that wraps the underlying file descriptor (there’s actually a secondary SafeHandle in the middle, but that’s not relevant). On Unix, when the Socket is connected, it’s registered with the SocketAsyncEngine, which is the code responsible for running the event loop interacting with the epoll handle. Whenever the epoll wait shows that there’s work available to be done, the event loop maps the relevant file descriptor back to the appropriate SafeSocketHandle so that the relevant work can be performed and callbacks invoked. In order to do that mapping, the SocketAsyncEngine stores a ConcurrentDictionary<IntPtr, SafeSocketHandle>, and the engines themselves are stored in a static readonly SocketAsyncEngine[] array… the punchline here is that these SafeSocketHandles end up being strongly rooted by a static array.

The other important piece of information is that there’s a Timer inside SocketsHttpHandler that runs periodically to check whether connections in the connection pool are still viable, and if they’re not, Dispose’s of them. The bug that the aforementioned issues fixed was that there was an unexpected cycle formed between the timer and the connection pool that ended up keeping everything alive indefinitely, resulting in a non-trivial memory leak. However, as a side effect of that leak, it meant that the timer would continue to run, and every time it fired, it would loop through all of the open connections and Dispose of the ones that were no longer viable. In the fullness of time, all of them would get Dispose’d. Disposing of the connection would dispose of the Socket which would Dispose of the SafeSocketHandle and remove it from the SocketAsyncEngine’s dictionary.

Now, with the aforementioned fixes, if code fails to Dispose of the HttpClient/SocketsHttpHandler when done with them and drops the references to them, the timer gets collected, as does the connection pool, as do all of the HttpConnection objects in the pool. None of those have finalizers, nor should they need them. But here’s the rub. Socket does have a finalizer, yet its finalizer ends up being a nop. Since the storing of the SafeSocketHandle into the static dictionary isn’t something that can be undone automatically by GC, we actually need a finalizer to remove that registration should everything get dropped. Since all those objects don’t have finalizers, and since Socket’s finalizer isn’t doing the unregistration, everything gets collected above the SafeSocketHandle, which then remains registered effectively forever, never being disposed of, and never closing its file descriptor.

I don’t know for certain whether this is the cause of this issue. It’s just a theory, and @wfurt is working through the repro, debugging, and testing out theories. If this doesn’t turn out to be the root cause here, I suspect it’s still a bug we need to fix. If it does turn out to be the root cause, I don’t think the fix is to revert the aforementioned fixes: they were valid, they just revealed this existing problem they had been masking by creating a different leak that in turn allowed the timer to dispose of these resources... plus, this issue would apply to all uses of Sockets that weren't disposed of, not just those used from SocketsHttpHandler. The actual fix would likely be to either use a weak reference when storing the SafeSocketHandle into the dictionary (which might be the right fix but could also potentially cause perf or otherwise unforeseen problems), or to ensure that a finalizer is put in place to undo that registration (most likely changing Socket’s finalizer accordingly on Unix).

In the meantime, assuming this is the cause, in addition to fixing it in System.Net.Sockets, code using HttpClient/HttpClientHandler/SocketsHttpHandler should also be Dispose'ing of those instances when done with them. If you just create a single HttpClient instance that's stored in a static, there's no real need to dispose of it, as everything will go away when the process ends. But if you're doing something that creates an instance, uses it for one or more requests, and then get rid of it, when getting rid of it it should be disposed.

cc: @geoffkizer, @tmds

wfurt commented 5 years ago

I'm making some progress. On the note above: If you add cenamOperationClient.Close() to GetSubscriberDetailsF() after response is received to close wfc, there are no lingering sockets at all @antonioortizpola With old platform handlers, socket can be closed independently but now usual reference counting works and disposing HttpClient when not used can lead to delayed release of resources. So as any reference to HttpResponseStream would keep underlying socket open. I think there is definitely issue with 2.2+ but there can be more than one reason for observed behavior.

vasicvuk commented 5 years ago

We have a same issue. All ports created with IHttpClientFactory stays in CLOSE_WAIT state

stephentoub commented 5 years ago

All ports created with IHttpClientFactory stays in CLOSE_WAIT state

@glennc, @rynowak, is HttpClientFactory disposing of all handlers it creates?

@vasicvuk, are you disposing of all HttpResponseMessages you're given and response Streams you're given?

antonioortizpola commented 5 years ago

@wfurt, thanks a lot for your investigation! I will change the code so the service closes the connection.

I am glad the repository could help to replicate the problem and I hope it could help others.

Please correct me if I am wrong, but I think the main problems are:

For the first group, please make sure you are using HttpClient correctly, most probably, it will fix the problem and improve your system.

For the second group, search for methods that could close the connection or IDisposable interfaces, and make tests monitoring your sockets (like with netstat -natu) to check if it can help fix the problem. Or check if you can reuse your connections.

If the problem persist tell us how are you using the client or socket or library, and if possible create a simple repository with a reproduction case.

wfurt commented 5 years ago

The repro was extremely useful, thanks @antonioortizpola.

In either case we should not leak OS resource and we do right now in some cases with 2.2 code. Disposing explicitly is best as everything is released when not needed. In the other case socket can stay opened until GC kicks in and that may take some time depending on many variables.

stephentoub commented 5 years ago

I think the main problems are

There are two issues here:

  1. There's a bug in Sockets on Unix where if you allow a connected Socket to be collected without it having been disposed, it'll leak the underlying handle.
  2. Consumers of HttpClient are sometimes not using it correctly, leading to the above bug getting triggered.

Fixing either of those is technically sufficient to address this issue, although even when (1) is fixed, it's important for (2) to be done, as the fix for (1) is still non-deterministic and could take an unbounded amount of time to kick-in.

vasicvuk commented 5 years ago

@stephentoub Hi, We are using HttpClientFactory and using for HttpResponseMessages. So as i understand issue is that socket is not beign disposed after some time if there is no strict dispose in code. I guess that using block will fix this?

tmds commented 5 years ago

I guess that using block will fix this?

yes, a using block causes Dispose to be called.

wfurt commented 5 years ago

I submitted fix to 3.0 master. It would be great if anybody can grab daily build and verify that it solves their issue.

Since this is somewhat generic error, there may be more than one issue under the cover. In either case any feedback would be useful.

kudos to @antonioortizpola who was able to isolate repro.

jarroda commented 5 years ago

Will it be possible to get this fix in 2.2?

wfurt commented 5 years ago

possibly. It will be easier to get approval if we can confirm that this fix solves observed issues. e.g try 3.0 before and after.

anilpras commented 5 years ago

For us having exact issue symptoms isolated due to the application code.

alxbog commented 5 years ago

This issue is making big troubles for us in production apps.

We reviewed all HttpClient usages in our code and migrated to HttpClientFactory. But it still happens from time to time.

The interesting thing is that we're having 2 similar apps (they use HttpClient the same way) on Azure Kubernetes and GKE (Google's Kubernetes) and the issue only happens on Azure.

Any plans to fix it in 2.2?

wfurt commented 5 years ago

2.2 port is waiting for verification @alxbog. We need to get enough evidence that dotnet/corefx#38499 fixes it or we need separate repro for 2.2. Until then it is unlikely we get permission for 2.x changes.

karelz commented 5 years ago

If someone is willing to try a private build with the fix ported to 2.2/2.1, that would help us prove it is worth porting. If you are that person let us know and we can provide privates with the change ported for testing ...

phillijw commented 5 years ago

Can anyone confirm if this documentation is correct? I've seen some comments suggesting we should be disposing HttpClients now even though the docs pretty much say the opposite: https://docs.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-3.0#httpclient-and-lifetime-management

karelz commented 5 years ago

@phillijw note that closed issues are not monitored. The docs you linked are for IHttpClientFactory (part of ASP.NET code). In general, you should not dispose HttpClient, but if you want to avoid stale DNS records problem, it is healthy to recycle your static instance on regular basis (HttpClientFactory does it for you).

stephentoub commented 5 years ago

Can anyone confirm if this documentation is correct? I've seen some comments suggesting we should be disposing HttpClients now even though the docs pretty much say the opposite

The short answer is: HttpClient is an IDisposable, and as with any IDisposable, you should Dispose of it whenever you're done with it.

The question then becomes "when should I be done with it?"

If you're creating your own HttpClient instance, e.g. new HttpClient() or new HttpClient(new SocketsHttpHandler()), then you should ideally be reusing the instance over and over and over, rather than creating a new one per request. That's because the underlying handler owns the connection pool, disposing of the handler will dispose of the connection pool, and the aformentioned constructors end up using the public HttpClient(HttpMessageHandler handler, bool disposeHandler) constructor with disposeHandler:true. You still want to Dispose of the HttpClient when you're done with it, but in the common case you shouldn't ever be "done" with it, as you just stash it into a static field and use it for all subsequent requests. If you do decide to be done with it at some point, such as if you want to replace it with a different instance for some reason, then you'd want to Dispose of it then. Again, in this way, it's no different from any other IDisposable: it owns resources, dispose of it when you no longer need those resources.

However, the docs you link to are for IHttpClientFactory. It muddies the waters a little, because it maintains and manages its own set of HttpMessageHandler instances, and its design is to give you back a new HttpClient instance every time you ask for one: the intent here is that you use that HttpClient for the lifetime of your request, at which point you're "done" with it, and as such per my previous comments at which point it should be disposed. That instance wraps one of these shared HttpMessageHandler instances, but was constructed with disposeHandler:false argument: https://github.com/aspnet/Extensions/blob/c2147ae6a07c5ebf6aa6ef2f8de86e0851fc13ca/src/HttpClientFactory/Http/src/DefaultHttpClientFactory.cs#L117-L134 such that disposing of the HttpClient won't dispose of the underlying shared handler.

phillijw commented 5 years ago

Thanks for the explanation @stephentoub. The clarity on disposeHandler is what I was really missing. I feel like the examples on the docs could be updated to maybe discuss that point a bit more or to show examples where the http client DOES get disposed. For instance, https://docs.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-3.0#use-ihttpclientfactory-in-a-console-app does not dispose even though it fits your example of where it should be disposed, I think.

stephentoub commented 5 years ago

does not dispose even though it fits your example of where it should be disposed

Yes, the client in the GetPage method in that sample should be disposed. Thanks for pointing that out. cc: @glennc, @rynowak

MrZoidberg commented 4 years ago

@karelz I experience this problem with .net core 2.2. Is it possible to get private libs to test the backport?

karelz commented 4 years ago

@wfurt can you please create a private build against 2.2 for @MrZoidberg to test?

rockstardev commented 4 years ago

UPDATE: Disregard this comment... the System.Net.Sockets.SocketException: Address already in use bug kept happening in both cases until. I've only solved it by moving my class from services.AddScoped into services.AddSingleton

PREVIOUS COMMENT (again, disregard): I'm getting this error a lot now in situation where I've switched from:

httpResponse = await _client.GetAsync(method);

to:

using (var req = create(HttpMethod.Get, finalUrl)) {
    httpResponse = await _client.SendAsync(req);
}

protected virtual HttpRequestMessage create(HttpMethod method, string url, string postBody = null) {
    var req = new HttpRequestMessage(method, url);
    return req;
}

I'm disposing of httpResponse after deserializing response in both cases. Any opinion on the code @karelz @stephentoub? I'm on .NET Core 2.2 and would prefer not upgrading to .NET Core 3.0, unless it would help you further debug and solve the problem.

antonioortizpola commented 4 years ago

The problem with the socket comes from the HttpClient, not HttpRequestMessage, how are you generating the HttpClient? are you using DI, HttpClientFactory?

adelhelal commented 4 years ago

Have you registered your HttpClient at bootstrap ? i.e. something like this:

services.AddHttpClient<IMyService, MyService>(c =>
{
    c.BaseAddress = new Uri("https://base_uri");
    c.Timeout = TimeSpan.FromSeconds(30);
}).SetHandlerLifetime(TimeSpan.FromSeconds(30));

Then from your MyService inject it in:

private readonly HttpClient _client;

public MyService(HttpClient client)
{
    _client = client;
}

Implementation should be as simple as:

using (var response = await _client.GetAsync(url))
{
    return await response.Content?.ReadAsStringAsync();
}

Since you're using the GET method the GetAsync(url) function should be sufficient.

This lets .NET Core handle all your connection pooling for you so you don't have to worry about managing it yourself.

rockstardev commented 4 years ago

I'm creating HttpClient by myself in both cases. However if I use it HttpRequestMessage and SendAsync - problems with System.Net.Sockets.SocketException: Address already in use start.

If I just use GetAsync - no problems.

@karelz @stephentoub can you please review this and provide any feedback? I'm hoping if you investigate in this direction that you can potentially solve the problem for others as well.

stephentoub commented 4 years ago

However if I use it HttpRequestMessage and SendAsync - problems with System.Net.Sockets.SocketException: Address already in use start.

Can you share the code you use with SendAsync? Are you calling it with ResponseHeadersRead?

OpenSourceAries commented 4 years ago

I'm hoping if you

GetAsync is just a wrapper around SendAsync.

https://github.com/dotnet/corefx/blob/092ede257ff5521c6bae20b27f1334536b1b985c/src/System.Net.Http/src/System/Net/Http/HttpClient.cs#L300-L304

draco111 commented 4 years ago

i got same error in production too(.netcore 2.2 ubuntu k8s) we follow the sample code with aspnet suggest doc and here is the demo code:

in Startup.cs

        public void ConfigureServices(IServiceCollection services)
        {
            services.AddMvc();

            // clients
            services.AddHttpClient<IAddressClient, AddressClient>()
                .AddHttpMessageHandler(handler => new TimingOutDelegatingHandler())
                .AddHttpMessageHandler(handler => new RetryPolicyDelegatingHandler());

here is the AddressClient.cs

public AddressClient(
            HttpClient httpClient,
            IOptions<UsersApiDomainConfig> usersApiConfig,
            ILogger<AddressClient> logger)
            : base(usersApiConfig)
        {
            _httpClient = httpClient;
            _logger = logger;
        }

here is the use of httpclient:

        public async Task SyncAddressAsync(AddressModel address)
        {
            var url = GetUrl(Path);
            var response = await _httpClient.PutAsync(
                url,
                new StringContent(JsonConvert.SerializeObject(address),
                    Encoding.UTF8,
                    HttpClientConstants.ApplicationJson));

i think we do as the doc suggest to

rockstardev commented 4 years ago

However if I use it HttpRequestMessage and SendAsync - problems with System.Net.Sockets.SocketException: Address already in use start.

Can you share the code you use with SendAsync? Are you calling it with ResponseHeadersRead?

Thanks for your help @stephentoub and @OpenSourceAries ... in the end I solved my problem in different fashion since I was still getting errors... updated original post: https://github.com/dotnet/corefx/issues/37044#issuecomment-545038295

chrisdrobison commented 4 years ago

I'm also seeing this exception with MongoDB driver in K8s, which isn't using the HttpClient. So is this fix this a thing or is the scope larger than originally anticipated?