dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.26k stars 4.73k forks source link

Unknown socket error #64305

Open sliderhouserules opened 2 years ago

sliderhouserules commented 2 years ago

Description

We have an application based on template dotnet create webapi project template and just adds an IHostedService to do background processing. Kestrel is used in order to be able to answer health checks from Kubernetes, as well as feed new requests into the service to be queued. Our background processing logic involves pretty heavy interaction with AWS resources (SQS, S3, STS), and we are encountering SocketExceptions that have been very difficult to pin down.

The canon guidance to use a static HttpClient instance application-wide does not work well when using the AWS clients in Amazon's SDK packages. The various AWS client constructors, and the config objects you can hand in, do not accept an instance of HttpClient. Instead, you must either rely on their built-in caching of HttpClient instance(s) that are created internally, or you have to provide an implementation that derives from their HttpClientFactory. This is fairly straight-forward if you use IHttpClientFactory in .Net. The Create() method override that you implement simply calls .Create() on an instance of IHttpClientFactory provided via DI.

However, even when providing this HttpClientFactory to all instances of any AWS client that is instantiated anywhere and everywhere in this service, it does not solve the SocketException problem. I will provide the stack trace below, but the properties of the actual SocketException object are extremely unhelpful:

Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074

The twist here is that these errors have only been encountered on Linux/in containers. Extended running tests on Windows cannot reproduce the problem when tested under load.

The other twist is that our Kestrel calls are producing System.InvalidOperationException: Handle is already used by another Socket. errors (they are logged as Warning level) in Kestrel's handling pipeline. It would appear that Kestrel, and the HttpWebRequest pool used underneath IHttpClientFactory are stomping on each other. If I am mischaracterizing the issue and there is a different cause, please let me know.

The fact that the runtime is throwing a SocketException with an error code that isn't even in the spec for TCP errors (nice article here) is the reason I am bringing it to this group. The runtime is burping up what would appear to be the equivalent of the default case in a switch statement.

Any help or guidance here is appreciated.

Reproduction Steps

The basic structure is a Web API project created with dotnet create webapi with an IHostedService doing background processing interacting heavily with AWS resources. I can provide application code privately, as needed, so as to not have to share proprietary/private logic and access keys publicly.

Expected behavior

Interaction with AWS resources from a web application should not cause SocketExceptions.

Actual behavior

We are encountering SocketExceptions that have error codes that aren't even part of the TCP specs:

Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074

Full stack trace:

System.Net.Http.HttpRequestException: Unknown socket error (sqs.us-west-2.amazonaws.com:443)
 ---> System.Net.Sockets.SocketException (0xFFFDFFFE): Unknown socket error
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|277_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(HttpRequestMessage request)
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at Microsoft.Extensions.Http.Logging.LoggingHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at Microsoft.Extensions.Http.PolicyHttpMessageHandler.SendCoreAsync(HttpRequestMessage request, Context context, CancellationToken cancellationToken)
   at Polly.Retry.AsyncRetryEngine.ImplementationAsync[TResult](Func3 action, Context context, CancellationToken cancellationToken, ExceptionPredicates shouldRetryExceptionPredicates, ResultPredicates1 shouldRetryResultPredicates, Func5 onRetryAsync, Int32 permittedRetryCount, IEnumerable1 sleepDurationsEnumerable, Func4 sleepDurationProvider, Boolean continueOnCapturedContext)
   at Polly.AsyncPolicy1.ExecuteAsync(Func3 action, Context context, CancellationToken cancellationToken, Boolean continueOnCapturedContext)
   at Microsoft.Extensions.Http.PolicyHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at Microsoft.Extensions.Http.Logging.LoggingScopeHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   at Amazon.Runtime.HttpWebRequestMessage.GetResponseAsync(CancellationToken cancellationToken)
   at Amazon.Runtime.Internal.HttpHandler1.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.Unmarshaller.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.SQS.Internal.ValidationResponseHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CredentialsRetriever.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.ErrorCallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.MetricsHandler.InvokeAsync[T](IExecutionContext executionContext)
   at <our code starts here>

Regression?

The application in its current form is a .Net 6 solution. In a prior version, it was .Net 5 and the SocketExceptions were very rare if not non-existent.

However, even in the .Net 5 solution Kestrel was throwing the System.InvalidOperationException: Handle is already used by another Socket. warnings when handling basic health check calls.

Known Workarounds

We have implemented heavy retry policies using Polly, which allows the application to move past the errors, but it quite often takes a near complete restart of the background processing logic to clear up the SocketExceptions and allow the application to restart a job and create new connections to the AWS resources being used.

Configuration

The errors only appear running in containers on Linux.

Other information

N/A

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details
### Description We are POC'ing a basic microservice architecture that uses the template _dotnet create webapi_ project and just adds an IHostedService to do background processing. Kestrel is used in order to be able to answer health checks from Kubernetes, so this basic architectural idea seems to make sense. Our background work for this POC involves pretty heavy interaction with AWS resources (SQS, S3, STS), and we are encountering SocketExceptions that have been very difficult to pin down. The canon guidance to use a static HttpClient instance application-wide does not work well when using the AWS clients in Amazon's SDK packages. The various AWS client constructors, and the config objects you can hand in, do not accept an instance of HttpClient. Instead, you must either rely on their built-in caching of HttpClient instance(s) that are created internally, or you have to provide an implementation that derives from their [HttpClientFactory](https://github.com/aws/aws-sdk-net/blob/master/sdk/src/Core/Amazon.Runtime/Pipeline/HttpHandler/_netstandard/HttpRequestMessageFactory.cs#L38). This is fairly straight-forward if you use [IHttpClientFactory](https://docs.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-6.0) in .Net. The Create() method override that you implement simply calls .Create() on an instance of IHttpClientFactory provided via DI. However, even when providing this HttpClientFactory to all instances of any AWS client that is instantiated anywhere and everywhere in this service, it does not solve the SocketException problem. I will provide the stack trace below, but the properties of the actual SocketException object are extremely unhelpful: `Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074` The twist here is that these errors have only been encountered on Linux/in containers. Extended running tests on Windows cannot reproduce the problem when tested under load. The other twist is that our health checks calls are producing `System.InvalidOperationException: Handle is already used by another Socket.` errors (they are logged as Warning level) in Kestrel's handling pipeline. It would appear that Kestrel, and the HttpWebRequest pool used underneath IHttpClientFactory are stomping on each other. Given that this architectural choice of using a WebApplication to host a long-lived service doing background processing seems the most ideal and straightforward way of designing basic microservices, we would like to get to the bottom of this issue and figure out a way to have Kestrel and IHttpClientFactory play nice. If I am mischaracterizing the issue and there is a different cause, again, we'd very much like to get to the bottom of it. The fact that the runtime is throwing a SocketException with an error code that isn't even in the spec for TCP errors (nice article [here](https://blog.jetbrains.com/dotnet/2020/04/27/socket-error-codes-depend-runtime-operating-system/)) is the reason I am bringing it to this group. The runtime is burping up what would appear to be the equivalent of the `default` case in a `switch` statement. Any help or guidance here is appreciated. ### Reproduction Steps The basic structure is a web api project created with _dotnet create webapi_ with an `IHostedService` doing background processing interacting heavily with AWS resources. I can provide application code privately, as needed, so as to not have to share proprietary/private logic and access keys publicly. ### Expected behavior Interaction with AWS resources from a web application should not cause SocketExceptions. ### Actual behavior We are encountering SocketExceptions that have error codes that aren't even part of the TCP specs: `Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074` Full stack trace: `System.Net.Http.HttpRequestException: Unknown socket error (sqs.us-west-2.amazonaws.com:443) ---> System.Net.Sockets.SocketException (0xFFFDFFFE): Unknown socket error at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token) at System.Net.Sockets.Socket.g__WaitForConnectWithCancellation|277_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken) --- End of inner exception stack trace --- at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(HttpRequestMessage request) at System.Threading.Tasks.TaskCompletionSourceWithCancellation1.WaitWithCancellationAsync(CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken) at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at Microsoft.Extensions.Http.Logging.LoggingHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at Microsoft.Extensions.Http.PolicyHttpMessageHandler.SendCoreAsync(HttpRequestMessage request, Context context, CancellationToken cancellationToken) at Polly.Retry.AsyncRetryEngine.ImplementationAsync[TResult](Func3 action, Context context, CancellationToken cancellationToken, ExceptionPredicates shouldRetryExceptionPredicates, ResultPredicates1 shouldRetryResultPredicates, Func5 onRetryAsync, Int32 permittedRetryCount, IEnumerable1 sleepDurationsEnumerable, Func4 sleepDurationProvider, Boolean continueOnCapturedContext) at Polly.AsyncPolicy1.ExecuteAsync(Func3 action, Context context, CancellationToken cancellationToken, Boolean continueOnCapturedContext) at Microsoft.Extensions.Http.PolicyHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at Microsoft.Extensions.Http.Logging.LoggingScopeHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Net.Http.HttpClient.g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken) at Amazon.Runtime.HttpWebRequestMessage.GetResponseAsync(CancellationToken cancellationToken) at Amazon.Runtime.Internal.HttpHandler1.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.Unmarshaller.InvokeAsync[T](IExecutionContext executionContext) at Amazon.SQS.Internal.ValidationResponseHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.CredentialsRetriever.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.ErrorCallbackHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.MetricsHandler.InvokeAsync[T](IExecutionContext executionContext) at ` ### Regression? The application in its current form is a .Net 6 solution. In a prior version, it was .Net 5 and the SocketExceptions were very rare if not non-existent. However, even in the .Net 5 solution Kestrel was throwing the `System.InvalidOperationException: Handle is already used by another Socket.` warnings when handling basic health check calls. ### Known Workarounds We have implemented heavy retry policies using Polly, which allows the application to move past the errors, but it quite often takes a near complete restart of the background processing logic to clear up the SocketExceptions and allow the application to restart a job and create new connections to the AWS resources being used. ### Configuration - .NET 6. - Ubuntu 20.04 via the [base docker images](https://hub.docker.com/_/microsoft-dotnet-aspnet/) provided directly by Microsoft - x64, Kubernetes in AWS EKS The errors only appear running in containers on Linux. ### Other information _No response_
Author: sliderhouserules
Assignees: -
Labels: `area-System.Net.Http`, `untriaged`
Milestone: -
wfurt commented 2 years ago

Any chance you can run it under strace to see what is happening at OS level? The error will likely come from kernel so base OS is probably more important than the container.
cc: @tmds

ghost commented 2 years ago

This issue has been marked needs-author-action since it may be missing important information. Please refer to our contribution guidelines for tips on how to report issues effectively.

tmds commented 2 years ago

The error code we get isn't helpful.

Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074

From the stacktrace we see the exception occurs during connect to sqs.us-west-2.amazonaws.com:443.

If you let .NET generate a trace for the app, the Private.InternalDiagnostics.System.Net* and System.Net* event sources may tell us more about when connect fails (e.g. during name resolution, or when making the socket connect).

The other exception is different problem:

System.InvalidOperationException: Handle is already used by another Socket.

This exception can happen when multiple Socket instances perform async I/O on the same underlying file descriptor. Does your app do this? The stacktrace can tell us more.

sliderhouserules commented 2 years ago

From the stacktrace we see the exception occurs during connect to sqs.us-west-2.amazonaws.com:443.

If you let .NET generate a trace for the app, the Private.InternalDiagnostics.System.Net and System.Net event sources may tell us more about when connect fails (e.g. during name resolution, or when making the socket connect).

Are you talking about running strace as Tomas suggested? I don't know much about this utility yet, but I have access to open a shell into the pods so I can run this. Or are you talking about dotnet-counters or something? I could run that too, I've used it before.

The other exception is different problem:

System.InvalidOperationException: Handle is already used by another Socket.

This exception can happen when multiple Socket instances perform async I/O on the same underlying file descriptor. Does your app do this? The stacktrace can tell us more.

Yeah, I brought this part up because I have strong suspicions that this may be what is causing all of the problems. This part of the issue has existed since the inception of this application. If Kestrel is being allocated sockets in a way that is not copacetic to the way IHttpClientFactory/HttpClient is, then they could be stepping on each other.

tmds commented 2 years ago

Are you talking about running strace as Tomas suggested? I don't know much about this utility yet, but I have access to open a shell into the pods so I can run this. Or are you talking about dotnet-counters or something? I could run that too, I've used it before.

strace requires you to be privileged on the machine.

I mean dotnet-trace.

Could your file descriptor hunch maybe have something to do with how Linux has a lot of interaction with the file system (/proc/ folder and other system files)? Our application does nothing whatsoever with the file system.

Something causes the file descriptor to be registered twice. It could be your code or a 3rd party library. If you're not doing anything special with handles, you should try and figure out what 3rd party library it causing the issue.

I doubt it is .NET or ASP.NET Core because there would be reports about it.

sliderhouserules commented 2 years ago

This is the exception from the Handle is already used by another Socket error/warning:

System.InvalidOperationException: Handle is already used by another Socket.
   at System.Net.Sockets.SocketAsyncEngine.TryRegisterCore(IntPtr socketHandle, SocketAsyncContext context, Error& error)
   at System.Net.Sockets.SocketAsyncEngine.TryRegisterSocket(IntPtr socketHandle, SocketAsyncContext context, SocketAsyncEngine& engine, Error& error)
   at System.Net.Sockets.SocketAsyncContext.TryRegister(Error& error)
   at System.Net.Sockets.SocketAsyncContext.OperationQueue`1.StartAsyncOperation(SocketAsyncContext context, TOperation operation, Int32 observedSequenceNumber, CancellationToken cancellationToken)
   at System.Net.Sockets.SocketAsyncContext.ReceiveAsync(Memory`1 buffer, SocketFlags flags, Int32& bytesReceived, Action`5 callback, CancellationToken cancellationToken)
   at System.Net.Sockets.SocketAsyncEventArgs.DoOperationReceive(SafeSocketHandle handle, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.ReceiveAsync(SocketAsyncEventArgs e, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketReceiver.WaitForDataAsync(Socket socket)
   at Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketConnection.DoReceive()
   at System.IO.Pipelines.Pipe.GetReadResult(ReadResult& result)
   at System.IO.Pipelines.Pipe.ReadAsync(CancellationToken token)
   at System.IO.Pipelines.Pipe.DefaultPipeReader.ReadAsync(CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Http1Connection.BeginRead(ValueTask`1& awaitable)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequestsAsync[TContext](IHttpApplication`1 application)

That doesn't look like anything to do with the file system unless that's under the hood.

tmds commented 2 years ago

That doesn't look like anything to do with the file system unless that's under the hood.

Yes, this stacktrace doesn't tell us much. We're interested to find out who registered the first Socket for this fd.

tmds commented 2 years ago

About Handle is already used by another Socket:

There is a related issue (https://github.com/dotnet/runtime/issues/56750, which may probably be closed). A blog post gets mentioned In that case, the issue disappeared when removing a 3rd party library: https://zblesk.net/blog/aspnetcore-identity-litedb-breaks-on-ubuntu/.

cc @antonfirsov @karelz

About Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074.

Maybe you can make a small reproducer by making HttpClient perform a call against sqs.us-west-2.amazonaws.com:443.

antonfirsov commented 2 years ago

Having a repro for this would be extremely valuable, this is the 4th report of #56750 if I'm counting well.

tmds commented 2 years ago

@antonfirsov it doesn't happen often, but when it does, it is a pain to debug. And when I see this, I wonder: can there still be a bug in SocketAsyncEngine...

Thinking out loud.

We could add some envvar which causes the SocketAsyncEngine to track for each SocketAsyncContext the Environment.StackTrace that does the registration. Then, when we throw this exception, we could include that stacktrace.

Or we extend the event source logging so it contains the StackTrace and the user can try match the handle values.

wfurt commented 2 years ago

Do you know @tmds its our tracing actually logs the actual OS error or only the translated one? This is reason why I suggested 'strace' as that gives the raw value from kernel.

tmds commented 2 years ago

Do you know @tmds its our tracing actually logs the actual OS error or only the translated one? This is reason why I suggested 'strace' as that gives the raw value from kernel.

I don't think so. The NativeErrorCode is very weird. I hope the dotnet-trace can help us figure out where in the connect it fails. I assume it's not easy to strace where this app is running due to needed privileges. Also if the error comes from name resolution, it wouldn't show in strace.

wfurt commented 2 years ago

For name resolution the tracing should be sufficient AFAIK.

karelz commented 2 years ago

Triage: This is not the first time we see problems like these. However, we need more info (logs) to make it actionable. @sliderhouserules are you in position to work with us here or privately to root cause it in your environment and help us improve .NET Core? We would appreciate any help here. Thanks!

ghost commented 2 years ago

This issue has been marked needs-author-action since it may be missing important information. Please refer to our contribution guidelines for tips on how to report issues effectively.

sliderhouserules commented 2 years ago

So... in our testing, the SocketException we can get to for logging on the back-end (Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074 -- that is my log statement) was occurring around 25% to 30% of the time for every outgoing call our code was making. Around 75% or so of the incoming calls hitting Kestrel were logging the warning from the handling pipeline (Handle is already used by another Socket -- I don't know how to get access to those exceptions to be able to log more).

When I shelled into one of the containers and tried running strace, it would attach but then produce no output even though the app was working, both front-end and back-end (and producing errors/warnings). I'm sure I was doing something wrong. dotnet-trace would attach, but almost immediately detach (Detected a potentially broken trace. Continuing with best-efforts to convert the trace [/snip]). If I tried running it over and over, it would eventually work and produce some output, but not for long. Again, I'm sure I was doing something wrong.

So I scoured our code again from top to bottom, left to right and made sure every call was using the IHttpClientFactory. I dug into the AWS source code to make sure every call they are making would use it, just to be sure. I then dug into and found that the main log sink/forwarder we were using is using WebRequest.Create! I got rid of it and worked with our infrastructure to get Kubernetes forwarding logs from STDOUT. However, this didn't get rid of the errors.

Next, I dug into New Relic's .Net Agent source code. We are using RecordMetric and NoticeError quite a bit. I had thought that they were using something like RPC or something from previous trips down that rabbit hole, because their code is confusing. But then I finally found this -- WebRequest.Create... Ugh.

So I commented out every call to the NewRelic Agent, every using, every NuGet reference. The system has been under load for around 3 hours now and the errors have disappeared. (That's a bit of a lie, Kestrel has generated the warning given previously a few times in that three hours.)

tldr; New Relic's .Net Agent is using WebRequest.Create and stripping NR from the app has remedied the problems. I'm going to file an issue with them.

So what's left here that maybe deserves some further scrutiny?

Providing a full repro that doesn't expose custom logic/credentials may be hard for me to get figured out. We are doing heavy interaction with AWS as mentioned. A Kestrel app, with an IHostedService hitting an external resource (GETs, PUTs and POSTs), properly using a shared HttpClient, and then anything using WebRequest instead of a shared HttpClient should recreate this problem under load. Targeting and testing both .Net 5 and .Net 6 should show a difference in frequency of the errors.

Sorry this is so long. Let me know what I can do further.

wfurt commented 2 years ago

may be related to #61798. @MihaZupan was looking into that. I don't see direct correlation but they certainly do many unusual things.

antonfirsov commented 2 years ago

may be related to #61798

If it's about https://github.com/newrelic/newrelic-dotnet-agent/issues/803, NewRelic agent 9.2.0 should contain the fix for that problem.

I can file an issue over on the aspnet board for this, just let me know.

56750 was originally opened against aspnet, and transferred to us, so this is not necessary.

Focusing on the Handle is already used by another Socket issue: I would go with the idea from https://github.com/dotnet/runtime/issues/64305#issuecomment-1023445687, and share a private build of System.Net.Sockets.dll which contains some extra logging. @sliderhouserules would you be able to try it? Since neither of us has contact info in bio, and GH doesn't have direct messages, I have sent an invitation for a private repository named Repro-64305, we can collaborate there or just exchange email contacts in an issue.

sliderhouserules commented 2 years ago

Yes, I could try that. My email is my github username at gmail.

sliderhouserules commented 2 years ago

In relation to the New Relic issue already being fixed, we tried v9.3.0 and it isn't fixed. They're using WebRequest, so unless that plays nice with HttpClient, then they still have the problem.

Anyway, that's separate from you guys, and I've filed an issue on their board that you can see linked above, so I'll follow up with them.

wfurt commented 2 years ago

HttpWebRequest is built on top of HttpClient ... unlike .NET Framework where the relation is opposite.

sliderhouserules commented 2 years ago

Right, but they're not using HttpWebRequest, they're using WebRequest...

antonfirsov commented 2 years ago

The mentioned WebRequest.Create lines will create an HttpWebRequest which is subclass of WebRequest. NewRelic seems to use synchronous path. What's weird that the error + stack trace posted in the opening comment doesn't come from that sync path, however it looks like getting rid of those synchronous NewRelic requests eliminates the weird unknown socket error coming from the bottom of the MetricsHandler call.

tmds commented 2 years ago

I would go with the idea from #64305 (comment), and share a private build of System.Net.Sockets.dll which contains some extra logging.

I've implemented the stacktrace tracking in https://github.com/tmds/runtime/tree/socketcontext_stacktrace. Below is the System.Net.Sockets.dll which you can place in a 6.0.101 sdk at shared/Microsoft.NETCore.App/6.0.1/System.Net.Sockets.dll. The exception message will include the stacktrace of the first registration.

System.Net.Sockets.dll.tar.gz

So I commented out every call to the NewRelic Agent, every using, every NuGet reference. The system has been under load for around 3 hours now and the errors have disappeared.

Perhaps the agent makes the request that causes the socket error, so removing it gets rid of the socket error.

It doesn't mean the agent does something wrong. It can still be an issue with .NET.

The weird NativeErrorCode suggests .NET has an issue understanding/dealing with an error that happens during connect.

There is no clear relation (for me) between newrelic/newrelic-dotnet-agent#803 and the unknown socket error.

sliderhouserules commented 2 years ago

The stack traces I've provided above are just a couple out of many. Here's one from where I have a catch on a SocketException from a call to SQS, and so I spin up an HttpClient using IHttpClientFactory and just send a GET to GitHub. When it errors too, this is the stack trace.

System.Net.Http.HttpRequestException: Unknown socket error (api.github.com:443)
 ---> System.Net.Sockets.SocketException (0xFFFDFFFE): Unknown socket error
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|277_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(HttpRequestMessage request)
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at Microsoft.Extensions.Http.Logging.LoggingHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at Microsoft.Extensions.Http.Logging.LoggingScopeHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
[my code here]
tmds commented 2 years ago

@sliderhouserules we're most interested in seeing the stacktrace of Handle is already used by another Socket.

For the Unknown socket error it would be interesting to get the dotnet traces.

This information can also help:

antonfirsov commented 2 years ago

@sliderhouserules can you run an experiment by replacing System.Net.Sockets.dll with the one shared in https://github.com/dotnet/runtime/issues/64305#issuecomment-1024019280 and share some logs containing the registration stack trace (https://github.com/tmds/runtime/commit/9cb7381d245de47f9d27a7a2d848a48fd039652d)?

You can share it here, or in private, pinged you on your email, not sure if you see it.

sliderhouserules commented 2 years ago

Sorry for the incommunicado. Had some priorities shift that pulled me to other tasks.

I'm also working through some unrelated errors we're getting with the 3rd party module that we use for the main processing in our application. It is unmanaged code that we call in-process. But... we can't get a straight answer out of our vendor as to whether the module is calling home for authentication. Working through setting up network tracing today, and will keep you posted on that.

I'm trying to get new stack traces from our managed code using the new dll @tmds provided above asap for you all. Shifting priorities mentioned above have had focus move to the pre-dotnet6 version of the app, which does not get these errors so prolifically (actually, not very often...). Haven't had a chance to throw load at it, and there are quite a few changes along with the upgrade to dotnet6, so that part of the cause is hard to pin down and test for well, until I can get some time in a test environment besides my workstation.

Have I mentioned that this only happens on Linux? I have never seen this happen on Windows, but I haven't tested much in the way of load because my workstation is the only place I can test that (I've done sustained testing, like letting it run over the weekend a couple times, but not much load). I'll try to pull some logs today. But yeah, running a Linux container locally I've seen the errors before. I've never seen them on Windows. So there's that.

sliderhouserules commented 2 years ago

For the Unknown socket error it would be interesting to get the dotnet traces.

As mentioned, this has been particularly hard to do. I'll keep trying.

This information can also help:

  1. Do you know if there are calls made against the host names that appear in the stacktrace that succeed?
  2. Does curl/wget work against them?
  3. Can you resolve the addresses for the hostname (e.g. nslookup api.github.com)?
  4. Are there only IPv4 or also IPv6 addresses?
  1. Yes, almost all of these calls are to either AWS resources (S3, SQS, STS, etc.) or New Relic. The app would be dead in the water if all those calls were failing.
  2. Yep. Certainly haven't tested that under load, but I've logged into the pod/container and hit www.google.com just fine.
  3. I will try to specifically test some app-used hostnames from inside the container... Our app certainly can.
  4. I don't know if we have IPv6 being used for any actual comms out in k8s, but I know the pods all get a v4 address. As mentioned before (I think), the k8s health check has been causing the Kestrel problems from day 1. This may be an avenue to explore more. I have seen some weird, very intermittent, errors in the logs pertaining to IPv6 before. I'll try to track those down.
sliderhouserules commented 2 years ago

From what I answered above, a follow up.

Yes, almost all of these calls are to either AWS resources (S3, SQS, STS, etc.) or New Relic. The app would be dead in the water if all those calls were failing.

The app will sometimes recover after 1, 2, 10 retries. It varies. However, sometimes the task retries so many times my retry logic kills it altogether cause it gets stuck.

wfurt commented 2 years ago

Actually, this is fragment from TryCompleteConnect would set Unknown error regardless of the OS failure.

https://github.com/dotnet/runtime/blob/1df36c702a46363336b6ea5d0d9558d513e7be8e/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs#L709-L714

It exists for very long time but the path above with failed Interop.Sys.Poll perhaps not. I wish we have error trace there but we don't. If there are experiments with instrumented build it would be nice to know if we are hitting any of the error path here @antonfirsov .

tmds commented 2 years ago

Actually, this is fragment from TryCompleteConnect would set Unknown error regardless of the OS failure.

Any async connect that fails gets reported as SocketError.SocketError. We need to propagate the real error.

NativeErrorCode: -131074 is 0x20002.

https://github.com/dotnet/runtime/blob/1df36c702a46363336b6ea5d0d9558d513e7be8e/src/libraries/Common/src/Interop/Unix/Interop.Errors.cs#L105

tmds commented 2 years ago

SocketError.SocketError can also occur when we catch an Exception other than SocketException here:

https://github.com/dotnet/runtime/blob/59662add23fba7d49e1d276168d201d54edcc088/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs#L780-L783

which leads to

https://github.com/dotnet/runtime/blob/59662add23fba7d49e1d276168d201d54edcc088/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs#L430-L438

antonfirsov commented 2 years ago

Shouldn't we persist the native error code instead of mapping it there and back, inevitably loosing information? Seems like an untrivial refactor though and requires a new SocketException constructor, or the change of the current ctr. semantics:

https://github.com/dotnet/runtime/blob/73ddf6e50e20a81492209d14588a05ee9a2b68d4/src/libraries/System.Net.Primitives/src/System/Net/SocketException.cs#L19-L31

tmds commented 2 years ago

Shouldn't we persist the native error code instead of mapping it there and back,

Yes, we'd like NativeErrorCode to be a native error code, and not 0x20002.

tmds commented 2 years ago

Found another: https://github.com/dotnet/runtime/blob/bdae17fc0ec222e75e651d96af74e9667c01602e/src/libraries/System.Net.NameResolution/src/System/Net/NameResolutionPal.Unix.cs#L41-L43

There are a few Debug statements around these usages, so I've recompiled the dll I shared earlier in Debug config. Give this a try @sliderhouserules: System.Net.Sockets.dll.tar.gz

sliderhouserules commented 2 years ago

Awesome, I'll give it a try asap and let you know.

sliderhouserules commented 2 years ago

Didn't want to leave you all hanging...

I'm working to build a small app to repro this so I can hand something off to you that will help more than some random stack traces. For our main app we've backed off of the big .net 6 push to a previous version of the code that isn't presenting the problem so far (I have tried, and it doesn't show up, and I've had little chance to test with the new dlls,)

I should know within a couple days if I can successfully repro the problem.

MMOSimca commented 2 years ago

I have encountered this 'Unknown socket error' issue extensively in a similar yet slightly different setting.

My team migrated an ancient application with some questionable programming practices from:

AWS Elastic Compute Cloud instance Windows Server 2019 IIS .NET Core 2.2 Public subnets with an Internet Gateway

to

AWS Elastic Container Service (Fargate) task Amazon Linux 2 instance running Ubuntu 20.04 images provided by Microsoft in a container Kestrel .NET 6 Private subnets with no internet access at all

We've also tightened up security extensively in security groups, IAM permissions, and other avenues. There's so many changes that were required that it is difficult to pinpoint where things went wrong.

We started to encounter random 500s (returned by Kestrel directly), 502s (from Application Load Balancer), and 504s (full timeouts, from Application Load Balancer) from our website. There's no obvious cause. All API calls seem to exhibit this behavior, seemingly at random.

Digging in deeper, we found errors like this:

2022-03-01 20:18:00.0408|ip-10-0-101-228|ERROR|System.AggregateException: One or more errors occurred. (Unknown socket error)
---> System.Net.Sockets.SocketException (0xFFFDFFFE): Unknown socket error
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at System.Threading.Tasks.ValueTask.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)
--- End of stack trace from previous location ---
at System.Net.Sockets.TcpClient.CompleteConnectAsync(Task task)
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at MySql.Data.Common.StreamCreator.GetTcpStream(MySqlConnectionStringBuilder settings, MyNetworkStream& networkStream)
at MySql.Data.MySqlClient.NativeDriver.Open()
at MySql.Data.MySqlClient.Driver.Open()
at MySql.Data.MySqlClient.Driver.Create(MySqlConnectionStringBuilder settings)
at MySql.Data.MySqlClient.MySqlPool.CreateNewPooledConnection()
at MySql.Data.MySqlClient.MySqlPool.GetPooledConnection()
at MySql.Data.MySqlClient.MySqlPool.TryToGetDriver()
at MySql.Data.MySqlClient.MySqlPool.GetConnection()
at MySql.Data.MySqlClient.MySqlConnection.Open()
at -my code happens here-

I suspect the problem goes further than purely our MySQL RDS connection, though. The 502s and 504s suggest that sometimes Kestrel cannot properly respond to requests either, as if it is also experiencing similar problems. However, it doesn't log any errors to that effect at least with its default settings.

Unfortunately, I'm not assigned to this project any longer and can't afford to spend work hours debugging this issue. However, this has made me curious enough that if the project's new team hasn't solved it somehow by the weekend, I'll try out the modified DLL or see if I can get a dotnet-trace, as I still have access to the environment (and I'm sure they would greatly appreciate any help they could get, since they're pulling their hair out over this issue too).

tmds commented 2 years ago

I'll try out the modified DLL or see if I can get a dotnet-trace

:+1:, that will help us move this issue forward.

sliderhouserules commented 2 years ago

Do you have a .Net 5-compatible version of that dll? I had to abandon the work that had seemed to dead end with the SocketException issue, and we went back to the .Net 5 version of the app and moved forward from there, and that seemed to fix the issue. But now we're trying to make the final push into production with our project and the SocketExceptions are back.

If it would be better to open a formal support ticket with MS so that we can share our actual app with you, then we can do that.

tmds commented 2 years ago

Do you have a .Net 5-compatible version of that dll?

.NET 5 is supported for only 2 more months. It would be best if we can reproduce it with .NET 6.

If it would be better to open a formal support ticket with MS so that we can share our actual app with you, then we can do that.

I'm not a Microsoft employee, so that would exclude me from the investigation.

We should start by trying the dll and see if that makes us wiser.

sliderhouserules commented 2 years ago

Ah, I see. I'll get you some results as soon as I can. Thanks.

MMOSimca commented 2 years ago

I was having trouble getting your DLL to work so I gave up for a while but then I realized I'm dumb and downloaded the first version of your DLL.

The latest one you posted works perfectly to give a more direct stack trace. It's a hell of a lot more useful than the vague 'unknown socket error' results anyway which Google will lead you exclusively this thread.

2022-03-10 15:24:39.4513|ip-10-0-136-140|ERROR|JHTNQS: System.Net.Sockets.SocketException (125): Operation canceled
   at System.Net.Sockets.Socket.UpdateStatusAfterSocketErrorAndThrowException(SocketError error, String callerName)
   at System.Net.Sockets.Socket.get_Available()
   at MySql.Data.MySqlClient.MySqlStream.SendPacket(MySqlPacket packet)
   at MySql.Data.MySqlClient.NativeDriver.CloseStatement(Int32 statementId)
   at MySql.Data.MySqlClient.Driver.CloseStatement(Int32 id)
   at MySql.Data.MySqlClient.PreparableStatement.CloseStatement()
   at MySql.Data.MySqlClient.MySqlCommand.Dispose()
   at -my code- (a much more direct and distinct trace here too)

We've tried dozens and dozens of things at this point to narrow down what's wrong, but having a new error message to look at will help a huge amount.

This still doesn't explain to me why Kestrel instantly returns back to our Application Load Balancer like crazy after one of these errors happens. We get -one- instance of this error in the logs and then the entire site becomes unusable as Kestrel bounces all traffic.

@sliderhouserules Trying to find a few more commonalities between us. Do you make RDS connections? Do you use an S3 client? How sync-based is your code (this legacy code is very heavily sync and awaits or .Results any async)? Do you use any custom middleware (we use a few)? Do you make any POST requests to external services with your IHttpClientFactory?

sliderhouserules commented 2 years ago

Basic breakdown of our application:

What we're dealing with at this point:

Incriminating details:

We cannot remove New Relic, as we lose all telemetry insight into our application, so we are living with robust (as we can make it) mitigation through retries and resiliency via SQS replays, and our error rate as such is around 0.1% once all retry logic has failed. For now we are living with this and just pushed to Beta in Production yesterday.

I have a few other things on my TODO list before I can get back to trying to deal with these SocketExceptions, and will definitely either revisit this thread, or will open an official support ticket with Microsoft for this issue (and will update here with any results/findings).

sliderhouserules commented 2 years ago

We get -one- instance of this error in the logs and then the entire site becomes unusable as Kestrel bounces all traffic.

Basically the same here. Once the application encounters a hard SocketException (not a Kestrel warning) then it all goes to pot and we have to have Kubernetes shut down the pod.

MMOSimca commented 2 years ago

Holy crap, my teammate solved it.

This was the entire fix: image

I'm still in disbelief about it.

How? Why? How did this code function perfectly for 100s of requests and then suddenly stop, ruining all traffic to and from the service? Why would it sometimes recover? Why was it crazily intermittent? What even IS a SafeHandle? I've never used one, personally.

Why did this function on Linux at all if it literally says Win32 in the file name?! Why did it sometimes stop functioning?! Why was it even more intermittent back in .NET Core 2.2?!

But beyond the questions, what I have most right now is joy. I'm so happy to be done with this. Thank God.

stephentoub commented 2 years ago

Why did this function on Linux at all if it literally says Win32 in the file name?!

That naming stems from 20 years ago when .NET Framework was introduced and ran solely on Windows. The name may still include Win32, but almost everything in that namespace was made to function well on Unix as well. We just can't change the name for compatibility reasons.

What even IS a SafeHandle?

It wraps a handle on Windows or a file descriptor on Unix. It basically provides a finalizer that can close such a handle/file descriptor if you forget to Dispose of it (i.e. close it) and it ensures that Dispose'ing it doesn't actually close the underlying handle while the SafeHandle is still in use by someone else, typically in interop scenarios.

new SafeFileHandle(IntPtr.Zero, true)

0 is a valid file descriptor, typically stdin. By creating a SafeFileHandle for file descriptor 0, when you then called handle.Dispose, you were calling close(0), effectively closing stdin.

MMOSimca commented 2 years ago

But on Windows presumably that can't happen? Since this code ran on Windows fine for years.

I don't know much more than you stated above about SafeHandles, but is there any improvements the .NET team could make here to provide a warning if initializing a SafeHandle to a 0 value? Something like:

'Warning: using 0 for SafeHandles and then disposing them may result in improperly terminating input streams in Unix.'

Not sure if that makes sense (if there's a lot of valid use cases it could annoy people), but anything that could be done to assist anyone else in the world out there who might happen to be as unfortunate as myself with an issue so obscure would be greatly appreciated!