Closed kierenj closed 4 years ago
@geoffkizer any idea? Is it Windows? What is the hex HResult? Did you look it up?
It's a Windows App Service, yes - the HResult is https://errorcodelookup.com/?type=hresult&code=80004005 (E_FAIL)
I just had this happen in production again on a fresh deploy, and had to restart the process, whereupon socket operations worked again. If there's anything I could or show be doing to enable some advanced logging or diagnostics - with .NET Core or Azure - let me know. Obviously I'm keen to solve this - it keeps taking out production!
I am not sure if we have that advanced production diagnostics yet. @wfurt @davidsh @geoffkizer can you please advise what kind of logs to collect to troubleshoot this?
Is there chance you can reproduce this when running under strace @kierenj ? That would tell us what is going on on OS level. (strace -f -t -e trace=network app )
Any chance you running out of system limits - like open descriptors per process?
I was also wondering about overall TCP sockets but that would not change if you restart process.
If it fails again you can use lsof or use /proc/
I’m on Windows - I used Azure tools to check # of Tcp connections and it wasnt particularly high (1000 ish?) - as above :)
@geoffkizer @stephentoub
Type name: System.Net.Sockets.SocketException .HResult = -2147467259 .Message = "Unknown error (0xffffffff)" .NativeErrorCode = -1 .Source = "System.Private.CoreLib"
Do you know where -1 could be returned from our Sockets code? Normally we would get a positive error code number from winsock.
Sorry to be a pest - but this one just hit us in production at prime-time again, another process restart was needed whereby everything started running again. Is there anything I/we can do to help track this down?
@kierenj I don't think we have actionable info. First step would be to get environment where you can sort of reproduce it (even if it takes days to hit it there). Otherwise we will have to come up with some kind of logging (potentially not existing yet) to track down more ...
It's happened on other non-production projects too in the past - if it was to reoccur and we didn't need to restart the process, is there anything I could do at that point - share enough detail to enable one of the team to jump on with a remote debugger?
Would it be any use for me to try to raise with the Azure team, in case it's an Azure thing, not a .NET Core thing?
Nothing comes to mind. After it happens it may be already too late, so remote debugger session may not be the answer anyway. Do your projects have anything in common? Given that you are the only person reporting this, I wonder if it may be some component in your projects causing it ... or you may be just lucky :( Given that it is unclear what exactly we're tracking down, I don't know who else to loop in to help ...
My general approach in mysterious things like this one, is to get some sort of repro, then instrument the binaries (Socket in this case) to gather additional info.
They all use own framework (built on top of ASP.NET Core), and they all use SE.Redis. Maybe I'll try something like having one of these apps try some socket ops, and if they don't fail, killing the process so it's restarted, until it does fail... Then I could try swapping bits out to try to narrow it down.
I'll try an Azure ticket too perhaps. I wonder if there could be some low-level anti-DDoS protection in socket code or similar that we might be falling afoul of.
I had a quick look through sockets code to see where a -1
might come from, maybe I'll do some more digging there. But if you guys aren't sure I'm not so confident :)
I have the similar issue running Lambda in AWS. The root cause is because I create HttpClient for every request. It was fixed by reusing one HttpClient instance. You may need to check your code if you create too many HttpClient instances.
Interesting, we do use separate instances but I can't understand how this would manifest itself by a fresh process either immediately and forever having this socket exception on both HTTP and Redis connections, or never (and forever) having any problems at all, with each restart rolling the dice once more?
Any ideas how this would manifest itself with this error, wouldn't it simply mean not enough free client ports not being available (which I figure would be a different exception to this) due to closed sockets remaining in CLOSE_WAIT ? Do CLOSE_WAIT sockets not persist when processes are closed (since its a part of the TCP spec at a fairly low level)?
Would you be able to share the exception/stack trace you received @nixinwang ?
As an update - I've captured some profiler tracers and sent to Azure support, who are investigating.
We can repro this easily enough on our production systems here. I'd be happy to jump on a screen-share or allow access if anyone here wanted to dig in, or look at our code.
@davidsh may have some recommendations what to look for.
Triage: We made lots of sockets changes on 3.0. There is a chance this is address. Can you please try it on 3.0?
This was intermittent and occurs only very rarely. We don't have the option of upgrading to 3.0 in this instance, unfortunately. (I'm happy to close the issue if you'd like - it is still a problem, but we automatically detect and restart the worker process when it occurs)
@karelz I have a few 3.0 upgraded services and still encounter this error
@vsadams do you have ability to reproduce it and diagnose deeper? Ideally a small transferable repro if possible ...
I do not. It seems to be very sporadic some days we will get 100s some days we will get 0. We can not reliably reproduce.
@vsadams are you running on an Azure App Service, as I am/was too?
@kierenj I am. Windows app service running webjobs and an asp.net core api.
I am afraid there is much we can do at this moment. I think we would need someone able to try it out on .NET Core 3.1 and possibly on custom .NET 5 builds (self-contained) with additional logging to discover what is going on. Nothing else comes to mind. Closing for now as suggested above -- we have 2 people hitting it. If anyone is in position to experiment, please let us know and we can try to get to the bottom of this ...
I am having the same issue. I am running a .net core app as background service in Ubuntu server. Please check the stack trace below.
Unknown error -1) ---> System.Net.Http.HttpRequestException: Unknown error -1 ---> System.Net.Sockets.SocketException: Unknown error -1 at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken) --- End of inner exception stack trace --- at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken) at System.Threading.Tasks.ValueTask
1.get_Result() at System.Net.Http.HttpConnectionPool.CreateConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Threading.Tasks.ValueTask
1.get_Result() at System.Net.Http.HttpConnectionPool.WaitForCreatedConnectionAsync(ValueTask1 creationTask) at System.Threading.Tasks.ValueTask
1.get_Result() at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken) at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
What version are you on @buddalasunil999? If you can reproduce it, can you check the OS call with strace
and get the errno?
@wfurt I am using .net core 2.1 This is intermittent. It might be coming when we're making more http requests at any time.
that may be https://github.com/dotnet/runtime/issues/28630 fixed in 3.0. strace would tell for sure. You should give it try with 3.1
The original (Windows) issue would seem to be unrelated unfortunately. We have another project showing the error now, but on this project its interspersed some SQL exceptions with a different win32 exception:
System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.) ---> System.ComponentModel.Win32Exception (10055): An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
(Revelant bit: System.ComponentModel.Win32Exception (10055): An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full)
I don't know if this could be a clue ..?
I had this same issue today and the google search brought me here.
On my workstation I have an ActiveMQ ArtemisBroker running (just http://localhost:5672) and was developing a .net core 3.1
cli app with MassTransit.ActiveMQ library.
After work and back at home I continued work and suddenly the .net core app could not connect to the Broker anymore.
I got this same weird System.Net.Sockets.SocketException with "Unknown error (0xffffffff)" in "System.Private.CoreLib"
I fixed it by closing my vpn connection (which I had running for other work stuff on the network drive..) and restarted the .net core app.
Perhaps this can also be helpful information.
If this happens to anybody it would be great to set breakpoint on ConnectEx and check return code from the call. That would allow up to triage if this is something wrong at .NET layer or this is caused by some underlying OS condition. At the same time, I would suggest to check TCP table and maybe system/security logs.
.NET Core SDK (3.1.201)
I'm hitting this with all new code (AWS head/upload) on Centos around 6 times out of 10 - the problem is pretty persistent.
It seems that the issue is strongly related to the rate of task creation. Also, it appears that runtime in general has some bugginess and can get stuck outright, spinning on cpu, when large number of tasks are created in rapid succession.
I have implemented 2 relatively simple programs in go and c#, both uploading large number of files to s3 using corresponding SDKs (one upload per task/goroutine). Go can easily max out the S3 service, hitting the service RPS limit. Dotnet Core can get stuck. :-)
Things appear to work ok when I introduce semaphore based limiters here and there.
@oakad, repro? And why do you think it's related to this closed issue?
Because it's exactly the error I was getting. Same location, same message.
Can you share your repro? I'm confused by the description of the problem: this issue is about an exception, but I interpret your description to be about a hang ("gets stuck").
Here's a stack trace easily - version .NET Core SDK (3.1.201) as I said.
As to repro it's a bit tricky.
Exceptions were encountered: System.AggregateException: One or more errors occurred. (Bad value for ai_flags)
---> System.Net.Http.HttpRequestException: Bad value for ai_flags
---> System.Net.Sockets.SocketException (0xFFFFFFFF): Bad value for ai_flags
at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean allowHttp2, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.GetHttpConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
at Amazon.Runtime.HttpWebRequestMessage.GetResponseAsync(CancellationToken cancellationToken)
at Amazon.Runtime.Internal.HttpHandler`1.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.RedirectHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.Unmarshaller.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.S3.Internal.AmazonS3ResponseHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.CredentialsRetriever.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.S3.Internal.AmazonS3ExceptionHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.ErrorCallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.MetricsHandler.InvokeAsync[T](IExecutionContext executionContext)
My other comment was somewhat unrelated. If we look at the issue at hand, that is, socket error, it appears to be strongly bound to the rate, at which those AWS SDK async tasks are created.
My other problem was apparently due to memory pressure - it was connected to OS OOM handler being invoked (but no internal OOM exception).
I also encountered a similiar problem today both on .NET Core 3.1 and .NET Core 5.0 Preview 7.0 while running Ubuntu 20.04 LTS.
I was sending an HTTP GET request to http://www.example.com/Pages/Home.aspx
(not actually example.com) and received back a 302 status code with Location: https://www.example.comPages/Home.aspx
, which is obviously a bug in the endpoint I was communicating with. However the HttpWebRequest
class that I used for executing HTTP requests had AllowAutoRedirect = true
set, so it tried to follow the redirect. That either resulted in the low level syscall getaddrinfo
to fail or something to that end. Whatever the low-level reason was, it resulted in .NET to throw a System.Net.WebException: Bad value for ai_flags
Full exception stack trace:
System.Net.WebException: Bad value for ai_flags Bad value for ai_flags ---> System.Net.Http.HttpRequestException: Bad value for ai_flags ---> System.Net.Sockets.SocketException: Bad value for ai_flags
at at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
at at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean allowHttp2, CancellationToken cancellationToken)
at at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at at System.Net.Http.HttpConnectionPool.GetHttpConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
at at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
at at System.Net.HttpWebRequest.SendRequest()
at at System.Net.HttpWebRequest.GetResponse()
--- End of inner exception stack trace ---
at at System.Net.HttpWebRequest.GetResponse()
at <REDACTED_LOCAL_METHOD>() in <REDACTED_LOCAL_CLASS>
Type name:
System.Net.Sockets.SocketException
.HResult
= -2147467259.Message
= "Unknown error (0xffffffff)".NativeErrorCode
= -1.Source
= "System.Private.CoreLib"The stack trace for the outer exception:
We started receiving this error a few hours ago (after weeks / months of no problem), and a process restart was required to fix it. It occurred on every outbound HTTP request until the process was killed. Process memory use was not particularly high.
Azure App Service. Kudu versioning:
Other versions:
From csproj:
I checked the TCP Connections diagnostics tool on Azure - there were only ~1000 connections being used (so I guess not port exhaustion?).
Interestingly - we're using StackExchange.Redis and at the time of the first exception, I do see that there was a connection error with Redis too.
What can I do to assist? The problem occurred in production, and while I have app-level logging, but not too much else.
Edit: the issue just re-occurred on the same server, after the process restart