Open sliderhouserules opened 2 years ago
Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.
Author: | sliderhouserules |
---|---|
Assignees: | - |
Labels: | `area-System.Net.Http`, `untriaged` |
Milestone: | - |
Any chance you can run it under strace
to see what is happening at OS level? The error will likely come from kernel so base OS is probably more important than the container.
cc: @tmds
This issue has been marked needs-author-action
since it may be missing important information. Please refer to our contribution guidelines for tips on how to report issues effectively.
The error code we get isn't helpful.
Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074
From the stacktrace we see the exception occurs during connect to sqs.us-west-2.amazonaws.com:443
.
If you let .NET generate a trace for the app, the Private.InternalDiagnostics.System.Net*
and System.Net*
event sources may tell us more about when connect fails (e.g. during name resolution, or when making the socket connect).
The other exception is different problem:
System.InvalidOperationException: Handle is already used by another Socket.
This exception can happen when multiple Socket
instances perform async I/O on the same underlying file descriptor. Does your app do this? The stacktrace can tell us more.
From the stacktrace we see the exception occurs during connect to sqs.us-west-2.amazonaws.com:443.
If you let .NET generate a trace for the app, the Private.InternalDiagnostics.System.Net and System.Net event sources may tell us more about when connect fails (e.g. during name resolution, or when making the socket connect).
Are you talking about running strace
as Tomas suggested? I don't know much about this utility yet, but I have access to open a shell into the pods so I can run this. Or are you talking about dotnet-counters
or something? I could run that too, I've used it before.
The other exception is different problem:
System.InvalidOperationException: Handle is already used by another Socket.
This exception can happen when multiple Socket instances perform async I/O on the same underlying file descriptor. Does your app do this? The stacktrace can tell us more.
Yeah, I brought this part up because I have strong suspicions that this may be what is causing all of the problems. This part of the issue has existed since the inception of this application. If Kestrel is being allocated sockets in a way that is not copacetic to the way IHttpClientFactory
/HttpClient
is, then they could be stepping on each other.
Are you talking about running strace as Tomas suggested? I don't know much about this utility yet, but I have access to open a shell into the pods so I can run this. Or are you talking about dotnet-counters or something? I could run that too, I've used it before.
strace
requires you to be privileged on the machine.
I mean dotnet-trace.
Could your file descriptor hunch maybe have something to do with how Linux has a lot of interaction with the file system (/proc/ folder and other system files)? Our application does nothing whatsoever with the file system.
Something causes the file descriptor to be registered twice. It could be your code or a 3rd party library. If you're not doing anything special with handles, you should try and figure out what 3rd party library it causing the issue.
I doubt it is .NET or ASP.NET Core because there would be reports about it.
This is the exception from the Handle is already used by another Socket
error/warning:
System.InvalidOperationException: Handle is already used by another Socket.
at System.Net.Sockets.SocketAsyncEngine.TryRegisterCore(IntPtr socketHandle, SocketAsyncContext context, Error& error)
at System.Net.Sockets.SocketAsyncEngine.TryRegisterSocket(IntPtr socketHandle, SocketAsyncContext context, SocketAsyncEngine& engine, Error& error)
at System.Net.Sockets.SocketAsyncContext.TryRegister(Error& error)
at System.Net.Sockets.SocketAsyncContext.OperationQueue`1.StartAsyncOperation(SocketAsyncContext context, TOperation operation, Int32 observedSequenceNumber, CancellationToken cancellationToken)
at System.Net.Sockets.SocketAsyncContext.ReceiveAsync(Memory`1 buffer, SocketFlags flags, Int32& bytesReceived, Action`5 callback, CancellationToken cancellationToken)
at System.Net.Sockets.SocketAsyncEventArgs.DoOperationReceive(SafeSocketHandle handle, CancellationToken cancellationToken)
at System.Net.Sockets.Socket.ReceiveAsync(SocketAsyncEventArgs e, CancellationToken cancellationToken)
at Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketReceiver.WaitForDataAsync(Socket socket)
at Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketConnection.DoReceive()
at System.IO.Pipelines.Pipe.GetReadResult(ReadResult& result)
at System.IO.Pipelines.Pipe.ReadAsync(CancellationToken token)
at System.IO.Pipelines.Pipe.DefaultPipeReader.ReadAsync(CancellationToken cancellationToken)
at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Http1Connection.BeginRead(ValueTask`1& awaitable)
at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)
at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequestsAsync[TContext](IHttpApplication`1 application)
That doesn't look like anything to do with the file system unless that's under the hood.
That doesn't look like anything to do with the file system unless that's under the hood.
Yes, this stacktrace doesn't tell us much. We're interested to find out who registered the first Socket for this fd.
About Handle is already used by another Socket
:
There is a related issue (https://github.com/dotnet/runtime/issues/56750, which may probably be closed). A blog post gets mentioned In that case, the issue disappeared when removing a 3rd party library: https://zblesk.net/blog/aspnetcore-identity-litedb-breaks-on-ubuntu/.
cc @antonfirsov @karelz
About Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074
.
Maybe you can make a small reproducer by making HttpClient
perform a call against sqs.us-west-2.amazonaws.com:443
.
Having a repro for this would be extremely valuable, this is the 4th report of #56750 if I'm counting well.
@antonfirsov it doesn't happen often, but when it does, it is a pain to debug. And when I see this, I wonder: can there still be a bug in SocketAsyncEngine
...
Thinking out loud.
We could add some envvar which causes the SocketAsyncEngine
to track for each SocketAsyncContext
the Environment.StackTrace
that does the registration. Then, when we throw this exception, we could include that stacktrace.
Or we extend the event source logging so it contains the StackTrace
and the user can try match the handle values.
Do you know @tmds its our tracing actually logs the actual OS error or only the translated one? This is reason why I suggested 'strace' as that gives the raw value from kernel.
Do you know @tmds its our tracing actually logs the actual OS error or only the translated one? This is reason why I suggested 'strace' as that gives the raw value from kernel.
I don't think so. The NativeErrorCode
is very weird. I hope the dotnet-trace can help us figure out where in the connect it fails. I assume it's not easy to strace
where this app is running due to needed privileges. Also if the error comes from name resolution, it wouldn't show in strace
.
For name resolution the tracing should be sufficient AFAIK.
Triage: This is not the first time we see problems like these. However, we need more info (logs) to make it actionable. @sliderhouserules are you in position to work with us here or privately to root cause it in your environment and help us improve .NET Core? We would appreciate any help here. Thanks!
This issue has been marked needs-author-action
since it may be missing important information. Please refer to our contribution guidelines for tips on how to report issues effectively.
So... in our testing, the SocketException
we can get to for logging on the back-end (Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074
-- that is my log statement) was occurring around 25% to 30% of the time for every outgoing call our code was making. Around 75% or so of the incoming calls hitting Kestrel were logging the warning from the handling pipeline (Handle is already used by another Socket
-- I don't know how to get access to those exceptions to be able to log more).
When I shelled into one of the containers and tried running strace
, it would attach but then produce no output even though the app was working, both front-end and back-end (and producing errors/warnings). I'm sure I was doing something wrong. dotnet-trace
would attach, but almost immediately detach (Detected a potentially broken trace. Continuing with best-efforts to convert the trace [/snip]
). If I tried running it over and over, it would eventually work and produce some output, but not for long. Again, I'm sure I was doing something wrong.
So I scoured our code again from top to bottom, left to right and made sure every call was using the IHttpClientFactory
. I dug into the AWS source code to make sure every call they are making would use it, just to be sure. I then dug into and found that the main log sink/forwarder we were using is using WebRequest.Create! I got rid of it and worked with our infrastructure to get Kubernetes forwarding logs from STDOUT. However, this didn't get rid of the errors.
Next, I dug into New Relic's .Net Agent source code. We are using RecordMetric
and NoticeError
quite a bit. I had thought that they were using something like RPC or something from previous trips down that rabbit hole, because their code is confusing. But then I finally found this -- WebRequest.Create
... Ugh.
So I commented out every call to the NewRelic Agent, every using
, every NuGet reference. The system has been under load for around 3 hours now and the errors have disappeared. (That's a bit of a lie, Kestrel has generated the warning given previously a few times in that three hours.)
tldr; New Relic's .Net Agent is using WebRequest.Create
and stripping NR from the app has remedied the problems. I'm going to file an issue with them.
So what's left here that maybe deserves some further scrutiny?
SocketException
s isn't part of any TCP error spec that I have found. The message on the exception doesn't include the ErrorCodes at all. Quite unhelpful.Providing a full repro that doesn't expose custom logic/credentials may be hard for me to get figured out. We are doing heavy interaction with AWS as mentioned. A Kestrel app, with an IHostedService
hitting an external resource (GETs, PUTs and POSTs), properly using a shared HttpClient
, and then anything using WebRequest
instead of a shared HttpClient
should recreate this problem under load. Targeting and testing both .Net 5 and .Net 6 should show a difference in frequency of the errors.
Sorry this is so long. Let me know what I can do further.
may be related to #61798. @MihaZupan was looking into that. I don't see direct correlation but they certainly do many unusual things.
may be related to #61798
If it's about https://github.com/newrelic/newrelic-dotnet-agent/issues/803, NewRelic agent 9.2.0 should contain the fix for that problem.
I can file an issue over on the aspnet board for this, just let me know.
Focusing on the Handle is already used by another Socket
issue:
I would go with the idea from https://github.com/dotnet/runtime/issues/64305#issuecomment-1023445687, and share a private build of System.Net.Sockets.dll
which contains some extra logging. @sliderhouserules would you be able to try it? Since neither of us has contact info in bio, and GH doesn't have direct messages, I have sent an invitation for a private repository named Repro-64305
, we can collaborate there or just exchange email contacts in an issue.
Yes, I could try that. My email is my github username at gmail.
In relation to the New Relic issue already being fixed, we tried v9.3.0 and it isn't fixed. They're using WebRequest
, so unless that plays nice with HttpClient
, then they still have the problem.
Anyway, that's separate from you guys, and I've filed an issue on their board that you can see linked above, so I'll follow up with them.
HttpWebRequest
is built on top of HttpClient
... unlike .NET Framework where the relation is opposite.
Right, but they're not using HttpWebRequest
, they're using WebRequest
...
The mentioned WebRequest.Create
lines will create an HttpWebRequest
which is subclass of WebRequest. NewRelic seems to use synchronous path. What's weird that the error + stack trace posted in the opening comment doesn't come from that sync path, however it looks like getting rid of those synchronous NewRelic requests eliminates the weird unknown socket error coming from the bottom of the MetricsHandler
call.
Handle is already used by another Socket
:I would go with the idea from #64305 (comment), and share a private build of System.Net.Sockets.dll which contains some extra logging.
I've implemented the stacktrace tracking in https://github.com/tmds/runtime/tree/socketcontext_stacktrace. Below is the System.Net.Sockets.dll
which you can place in a 6.0.101
sdk at shared/Microsoft.NETCore.App/6.0.1/System.Net.Sockets.dll
. The exception message will include the stacktrace of the first registration.
About Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074.
:So I commented out every call to the NewRelic Agent, every using, every NuGet reference. The system has been under load for around 3 hours now and the errors have disappeared.
Perhaps the agent makes the request that causes the socket error, so removing it gets rid of the socket error.
It doesn't mean the agent does something wrong. It can still be an issue with .NET.
The weird NativeErrorCode
suggests .NET has an issue understanding/dealing with an error that happens during connect.
There is no clear relation (for me) between newrelic/newrelic-dotnet-agent#803 and the unknown socket error.
The stack traces I've provided above are just a couple out of many. Here's one from where I have a catch on a SocketException from a call to SQS, and so I spin up an HttpClient
using IHttpClientFactory
and just send a GET to GitHub. When it errors too, this is the stack trace.
System.Net.Http.HttpRequestException: Unknown socket error (api.github.com:443)
---> System.Net.Sockets.SocketException (0xFFFDFFFE): Unknown socket error
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|277_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(HttpRequestMessage request)
at System.Threading.Tasks.TaskCompletionSourceWithCancellation1.WaitWithCancellationAsync(CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at Microsoft.Extensions.Http.Logging.LoggingHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at Microsoft.Extensions.Http.Logging.LoggingScopeHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
[my code here]
@sliderhouserules we're most interested in seeing the stacktrace of Handle is already used by another Socket
.
For the Unknown socket error
it would be interesting to get the dotnet traces.
This information can also help:
curl
/wget
work against them?nslookup api.github.com
)? Are there only IPv4 or also IPv6 addresses?@sliderhouserules can you run an experiment by replacing System.Net.Sockets.dll
with the one shared in https://github.com/dotnet/runtime/issues/64305#issuecomment-1024019280 and share some logs containing the registration stack trace (https://github.com/tmds/runtime/commit/9cb7381d245de47f9d27a7a2d848a48fd039652d)?
You can share it here, or in private, pinged you on your email, not sure if you see it.
Sorry for the incommunicado. Had some priorities shift that pulled me to other tasks.
I'm also working through some unrelated errors we're getting with the 3rd party module that we use for the main processing in our application. It is unmanaged code that we call in-process. But... we can't get a straight answer out of our vendor as to whether the module is calling home for authentication. Working through setting up network tracing today, and will keep you posted on that.
I'm trying to get new stack traces from our managed code using the new dll @tmds provided above asap for you all. Shifting priorities mentioned above have had focus move to the pre-dotnet6 version of the app, which does not get these errors so prolifically (actually, not very often...). Haven't had a chance to throw load at it, and there are quite a few changes along with the upgrade to dotnet6, so that part of the cause is hard to pin down and test for well, until I can get some time in a test environment besides my workstation.
Have I mentioned that this only happens on Linux? I have never seen this happen on Windows, but I haven't tested much in the way of load because my workstation is the only place I can test that (I've done sustained testing, like letting it run over the weekend a couple times, but not much load). I'll try to pull some logs today. But yeah, running a Linux container locally I've seen the errors before. I've never seen them on Windows. So there's that.
For the
Unknown socket error
it would be interesting to get the dotnet traces.
As mentioned, this has been particularly hard to do. I'll keep trying.
This information can also help:
- Do you know if there are calls made against the host names that appear in the stacktrace that succeed?
- Does curl/wget work against them?
- Can you resolve the addresses for the hostname (e.g. nslookup api.github.com)?
- Are there only IPv4 or also IPv6 addresses?
From what I answered above, a follow up.
Yes, almost all of these calls are to either AWS resources (S3, SQS, STS, etc.) or New Relic. The app would be dead in the water if all those calls were failing.
The app will sometimes recover after 1, 2, 10 retries. It varies. However, sometimes the task retries so many times my retry logic kills it altogether cause it gets stuck.
Actually, this is fragment from TryCompleteConnect
would set Unknown
error regardless of the OS failure.
It exists for very long time but the path above with failed Interop.Sys.Poll
perhaps not. I wish we have error trace there but we don't. If there are experiments with instrumented build it would be nice to know if we are hitting any of the error path here @antonfirsov .
Actually, this is fragment from TryCompleteConnect would set Unknown error regardless of the OS failure.
Any async connect that fails gets reported as SocketError.SocketError
. We need to propagate the real error.
NativeErrorCode: -131074
is 0x20002
.
SocketError.SocketError
can also occur when we catch an Exception
other than SocketException
here:
which leads to
Shouldn't we persist the native error code instead of mapping it there and back, inevitably loosing information? Seems like an untrivial refactor though and requires a new SocketException
constructor, or the change of the current ctr. semantics:
Shouldn't we persist the native error code instead of mapping it there and back,
Yes, we'd like NativeErrorCode
to be a native error code, and not 0x20002
.
There are a few Debug
statements around these usages, so I've recompiled the dll
I shared earlier in Debug config. Give this a try @sliderhouserules: System.Net.Sockets.dll.tar.gz
Awesome, I'll give it a try asap and let you know.
Didn't want to leave you all hanging...
I'm working to build a small app to repro this so I can hand something off to you that will help more than some random stack traces. For our main app we've backed off of the big .net 6 push to a previous version of the code that isn't presenting the problem so far (I have tried, and it doesn't show up, and I've had little chance to test with the new dlls,)
I should know within a couple days if I can successfully repro the problem.
I have encountered this 'Unknown socket error' issue extensively in a similar yet slightly different setting.
My team migrated an ancient application with some questionable programming practices from:
AWS Elastic Compute Cloud instance Windows Server 2019 IIS .NET Core 2.2 Public subnets with an Internet Gateway
to
AWS Elastic Container Service (Fargate) task Amazon Linux 2 instance running Ubuntu 20.04 images provided by Microsoft in a container Kestrel .NET 6 Private subnets with no internet access at all
We've also tightened up security extensively in security groups, IAM permissions, and other avenues. There's so many changes that were required that it is difficult to pinpoint where things went wrong.
We started to encounter random 500s (returned by Kestrel directly), 502s (from Application Load Balancer), and 504s (full timeouts, from Application Load Balancer) from our website. There's no obvious cause. All API calls seem to exhibit this behavior, seemingly at random.
Digging in deeper, we found errors like this:
2022-03-01 20:18:00.0408|ip-10-0-101-228|ERROR|System.AggregateException: One or more errors occurred. (Unknown socket error)
---> System.Net.Sockets.SocketException (0xFFFDFFFE): Unknown socket error
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at System.Threading.Tasks.ValueTask.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)
--- End of stack trace from previous location ---
at System.Net.Sockets.TcpClient.CompleteConnectAsync(Task task)
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at MySql.Data.Common.StreamCreator.GetTcpStream(MySqlConnectionStringBuilder settings, MyNetworkStream& networkStream)
at MySql.Data.MySqlClient.NativeDriver.Open()
at MySql.Data.MySqlClient.Driver.Open()
at MySql.Data.MySqlClient.Driver.Create(MySqlConnectionStringBuilder settings)
at MySql.Data.MySqlClient.MySqlPool.CreateNewPooledConnection()
at MySql.Data.MySqlClient.MySqlPool.GetPooledConnection()
at MySql.Data.MySqlClient.MySqlPool.TryToGetDriver()
at MySql.Data.MySqlClient.MySqlPool.GetConnection()
at MySql.Data.MySqlClient.MySqlConnection.Open()
at -my code happens here-
I suspect the problem goes further than purely our MySQL RDS connection, though. The 502s and 504s suggest that sometimes Kestrel cannot properly respond to requests either, as if it is also experiencing similar problems. However, it doesn't log any errors to that effect at least with its default settings.
Unfortunately, I'm not assigned to this project any longer and can't afford to spend work hours debugging this issue. However, this has made me curious enough that if the project's new team hasn't solved it somehow by the weekend, I'll try out the modified DLL or see if I can get a dotnet-trace, as I still have access to the environment (and I'm sure they would greatly appreciate any help they could get, since they're pulling their hair out over this issue too).
I'll try out the modified DLL or see if I can get a dotnet-trace
:+1:, that will help us move this issue forward.
Do you have a .Net 5-compatible version of that dll? I had to abandon the work that had seemed to dead end with the SocketException issue, and we went back to the .Net 5 version of the app and moved forward from there, and that seemed to fix the issue. But now we're trying to make the final push into production with our project and the SocketExceptions are back.
If it would be better to open a formal support ticket with MS so that we can share our actual app with you, then we can do that.
Do you have a .Net 5-compatible version of that dll?
.NET 5 is supported for only 2 more months. It would be best if we can reproduce it with .NET 6.
If it would be better to open a formal support ticket with MS so that we can share our actual app with you, then we can do that.
I'm not a Microsoft employee, so that would exclude me from the investigation.
We should start by trying the dll and see if that makes us wiser.
Ah, I see. I'll get you some results as soon as I can. Thanks.
I was having trouble getting your DLL to work so I gave up for a while but then I realized I'm dumb and downloaded the first version of your DLL.
The latest one you posted works perfectly to give a more direct stack trace. It's a hell of a lot more useful than the vague 'unknown socket error' results anyway which Google will lead you exclusively this thread.
2022-03-10 15:24:39.4513|ip-10-0-136-140|ERROR|JHTNQS: System.Net.Sockets.SocketException (125): Operation canceled
at System.Net.Sockets.Socket.UpdateStatusAfterSocketErrorAndThrowException(SocketError error, String callerName)
at System.Net.Sockets.Socket.get_Available()
at MySql.Data.MySqlClient.MySqlStream.SendPacket(MySqlPacket packet)
at MySql.Data.MySqlClient.NativeDriver.CloseStatement(Int32 statementId)
at MySql.Data.MySqlClient.Driver.CloseStatement(Int32 id)
at MySql.Data.MySqlClient.PreparableStatement.CloseStatement()
at MySql.Data.MySqlClient.MySqlCommand.Dispose()
at -my code- (a much more direct and distinct trace here too)
We've tried dozens and dozens of things at this point to narrow down what's wrong, but having a new error message to look at will help a huge amount.
This still doesn't explain to me why Kestrel instantly returns back to our Application Load Balancer like crazy after one of these errors happens. We get -one- instance of this error in the logs and then the entire site becomes unusable as Kestrel bounces all traffic.
@sliderhouserules Trying to find a few more commonalities between us. Do you make RDS connections? Do you use an S3 client? How sync-based is your code (this legacy code is very heavily sync and awaits or .Results any async)? Do you use any custom middleware (we use a few)? Do you make any POST requests to external services with your IHttpClientFactory?
Basic breakdown of our application:
What we're dealing with at this point:
Incriminating details:
We cannot remove New Relic, as we lose all telemetry insight into our application, so we are living with robust (as we can make it) mitigation through retries and resiliency via SQS replays, and our error rate as such is around 0.1% once all retry logic has failed. For now we are living with this and just pushed to Beta in Production yesterday.
I have a few other things on my TODO list before I can get back to trying to deal with these SocketExceptions, and will definitely either revisit this thread, or will open an official support ticket with Microsoft for this issue (and will update here with any results/findings).
We get -one- instance of this error in the logs and then the entire site becomes unusable as Kestrel bounces all traffic.
Basically the same here. Once the application encounters a hard SocketException (not a Kestrel warning) then it all goes to pot and we have to have Kubernetes shut down the pod.
Holy crap, my teammate solved it.
This was the entire fix:
I'm still in disbelief about it.
How? Why? How did this code function perfectly for 100s of requests and then suddenly stop, ruining all traffic to and from the service? Why would it sometimes recover? Why was it crazily intermittent? What even IS a SafeHandle? I've never used one, personally.
Why did this function on Linux at all if it literally says Win32 in the file name?! Why did it sometimes stop functioning?! Why was it even more intermittent back in .NET Core 2.2?!
But beyond the questions, what I have most right now is joy. I'm so happy to be done with this. Thank God.
Why did this function on Linux at all if it literally says Win32 in the file name?!
That naming stems from 20 years ago when .NET Framework was introduced and ran solely on Windows. The name may still include Win32, but almost everything in that namespace was made to function well on Unix as well. We just can't change the name for compatibility reasons.
What even IS a SafeHandle?
It wraps a handle on Windows or a file descriptor on Unix. It basically provides a finalizer that can close such a handle/file descriptor if you forget to Dispose of it (i.e. close it) and it ensures that Dispose'ing it doesn't actually close the underlying handle while the SafeHandle is still in use by someone else, typically in interop scenarios.
new SafeFileHandle(IntPtr.Zero, true)
0 is a valid file descriptor, typically stdin. By creating a SafeFileHandle for file descriptor 0, when you then called handle.Dispose, you were calling close(0)
, effectively closing stdin.
But on Windows presumably that can't happen? Since this code ran on Windows fine for years.
I don't know much more than you stated above about SafeHandles, but is there any improvements the .NET team could make here to provide a warning if initializing a SafeHandle to a 0 value? Something like:
'Warning: using 0 for SafeHandles and then disposing them may result in improperly terminating input streams in Unix.'
Not sure if that makes sense (if there's a lot of valid use cases it could annoy people), but anything that could be done to assist anyone else in the world out there who might happen to be as unfortunate as myself with an issue so obscure would be greatly appreciated!
Description
We have an application based on template dotnet create webapi project template and just adds an IHostedService to do background processing. Kestrel is used in order to be able to answer health checks from Kubernetes, as well as feed new requests into the service to be queued. Our background processing logic involves pretty heavy interaction with AWS resources (SQS, S3, STS), and we are encountering SocketExceptions that have been very difficult to pin down.
The canon guidance to use a static HttpClient instance application-wide does not work well when using the AWS clients in Amazon's SDK packages. The various AWS client constructors, and the config objects you can hand in, do not accept an instance of HttpClient. Instead, you must either rely on their built-in caching of HttpClient instance(s) that are created internally, or you have to provide an implementation that derives from their HttpClientFactory. This is fairly straight-forward if you use IHttpClientFactory in .Net. The Create() method override that you implement simply calls .Create() on an instance of IHttpClientFactory provided via DI.
However, even when providing this HttpClientFactory to all instances of any AWS client that is instantiated anywhere and everywhere in this service, it does not solve the SocketException problem. I will provide the stack trace below, but the properties of the actual SocketException object are extremely unhelpful:
Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074
The twist here is that these errors have only been encountered on Linux/in containers. Extended running tests on Windows cannot reproduce the problem when tested under load.
The other twist is that our Kestrel calls are producing
System.InvalidOperationException: Handle is already used by another Socket.
errors (they are logged as Warning level) in Kestrel's handling pipeline. It would appear that Kestrel, and the HttpWebRequest pool used underneath IHttpClientFactory are stomping on each other. If I am mischaracterizing the issue and there is a different cause, please let me know.The fact that the runtime is throwing a SocketException with an error code that isn't even in the spec for TCP errors (nice article here) is the reason I am bringing it to this group. The runtime is burping up what would appear to be the equivalent of the
default
case in aswitch
statement.Any help or guidance here is appreciated.
Reproduction Steps
The basic structure is a Web API project created with dotnet create webapi with an
IHostedService
doing background processing interacting heavily with AWS resources. I can provide application code privately, as needed, so as to not have to share proprietary/private logic and access keys publicly.Expected behavior
Interaction with AWS resources from a web application should not cause SocketExceptions.
Actual behavior
We are encountering SocketExceptions that have error codes that aren't even part of the TCP specs:
Unknown socket error; ErrorCode: -131074; SocketErrorCode: SocketError; NativeErrorCode: -131074
Full stack trace:
Regression?
The application in its current form is a .Net 6 solution. In a prior version, it was .Net 5 and the SocketExceptions were very rare if not non-existent.
However, even in the .Net 5 solution Kestrel was throwing the
System.InvalidOperationException: Handle is already used by another Socket.
warnings when handling basic health check calls.Known Workarounds
We have implemented heavy retry policies using Polly, which allows the application to move past the errors, but it quite often takes a near complete restart of the background processing logic to clear up the SocketExceptions and allow the application to restart a job and create new connections to the AWS resources being used.
Configuration
The errors only appear running in containers on Linux.
Other information
N/A