JeringTech / Javascript.NodeJS

Invoke Javascript in NodeJS, from C#
Other
463 stars 43 forks source link

Intermittent error: An attempt was made to access a socket in a way forbidden by its access permissions #96

Closed JohnElliffe closed 3 years ago

JohnElliffe commented 3 years ago

We are getting the following issue intermittently:

An invocation attempt failed. Retries remaining: 5. Exception: System.Net.Http.HttpRequestException: An attempt was made to access a socket in a way forbidden by its access permissions. ---> System.Net.Sockets.SocketException (10013): An attempt was made to access a socket in a way forbidden by its access permissions. at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken) --- End of inner exception stack trace --- at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean allowHttp2, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.GetHttpConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken) at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Net.Http.DiagnosticsHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts) at Jering.Javascript.NodeJS.HttpNodeJSService.TryInvokeAsync[T](InvocationRequest invocationRequest, CancellationToken cancellationToken) at Jering.Javascript.NodeJS.OutOfProcessNodeJSService.TryInvokeCoreAsync[T](InvocationRequest invocationRequest, CancellationToken cancellationToken)

We have a .Net Core 3.1 web application running on an Azure Windows App Service Server - Web sockets are Off, NodeJs version 14.15.0. As well as hosting our website this server also uses JeringTech Javascript.NodeJS to run Javascript to create PDFs. Currently it runs well for around 5 days and then throws the error above. We need to restart the app server to start things up again. The microsoft nodeservices package gave the same behaviour. Our configuration is simple - the Startup.cs file contains: services.AddNodeJS(); services.Configure(options => options.NumRetries = 5);

JeremyTCD commented 3 years ago

Hey thanks for reporting the issue!

At present the library uses HTTP requests to communicate with Node.js. With this approach there've been several issues with intermittent socket problems.
Having taken a look at these issues, and considering that they occur with Microsoft's INodeService, I think there are likely issues in .Net/Node.js framework logic. These issues are beyond what we can realistically fix.

Short of updating the library to use some alternative protocol like websockets or grpc, the best workaround for now is to improve our retry logic. You mention that you have to restart the application after a failed invocation.

An invocation attempt failed. Retries remaining: 5.

Does the program crash/hang after this? Or does it count down retries till all are exhausted? If it crashes, are any further exceptions logged?

Edit: I'm updating the library to use HTTP/2 by default. Might solve socket issues since HTTP/2 multiplexes parallel requests over the same connection.

JohnElliffe commented 3 years ago

Thanks for replying Jeremy.

We have Retries set to 5. The node js service tries 5 times, logging the exception each time and then stops as expected. Subsequent calls to this function also try 5 times and stop -- until we restart the app server and then everything is OK till the next time. Would there be any point in increasing the retry number?

We will try using MultiProcess but I am worried this will hide the problem rather than fixing it.

JeremyTCD commented 3 years ago

Sure, socket issues have been the bane of this library, am happy to get additional information that could help resolve them.

Anyway I updated my comment:

I'm updating the library to use HTTP/2 by default. Might solve socket issues since HTTP/2 multiplexes parallel requests over the same connection.

I suspect switching to HTTP/2 has a non-negligible chance of fixing socket issues. Now that .Net core 3 supports HTTP/2 it's at least worth a try.

Will publish a new package today, if issues persist after this change we can look more closely at retries.

JohnElliffe commented 3 years ago

Thanks Jeremy, we will give the new package a try when it is ready.

JeremyTCD commented 3 years ago

Hey I've published 6.0.0-beta.0. It uses HTTP/2 to communicate with node. Let me know how it goes!

JohnElliffe commented 3 years ago

We have deployed 6.0.0-beta.0 to our Dev and Test environments. I will keep an eye on it over the coming week.

JeremyTCD commented 3 years ago

Hi @JohnElliffe have you had any trouble with 6.0.0-beta.0?

JohnElliffe commented 3 years ago

Hi @JeremyTCD, I was just about to let you know that we haven't had a problem with 6.0.0-beta.0 when we received:

An invocation attempt failed. Retries remaining: 8. Exception: System.Net.Http.HttpRequestException: An attempt was made to access a socket in a way forbidden by its access permissions. ---> System.Net.Sockets.SocketException (10013): An attempt was made to access a socket in a way forbidden by its access permissions. at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken) --- End of inner exception stack trace --- at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean allowHttp2, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.GetHttp2ConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken) at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Net.Http.DiagnosticsHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at Microsoft.Extensions.Http.Logging.LoggingHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at Microsoft.Extensions.Http.Logging.LoggingScopeHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts) at Jering.Javascript.NodeJS.HttpNodeJSService.TryInvokeAsync[T](InvocationRequest invocationRequest, CancellationToken cancellationToken) at Jering.Javascript.NodeJS.OutOfProcessNodeJSService.TryInvokeCoreAsync[T](InvocationRequest invocationRequest, CancellationToken cancellationToken)

After killing the Node service processing begins again as normal.

I note that this call passes through some Microsoft.Extensions.Http.Logging methods - Could our problems be related to logging? We use console.log to log within the JavaScript and Azure logging from C#.

JeremyTCD commented 3 years ago

Thanks for the information!

I don't think it's logging.

This library communicates with Node.js over a HTTP connection. The Node.js process starts a HTTP server on a port of its choice and the C# process sends messages to it. The SocketException you're getting suggests that the port we're using to communicate with Node.js is either in-use by another process or blocked (could be by your firewall or OS).

I'm shooting in the dark here, but these seem like plausible causes of your exception:

I'll try 2 things:

JohnElliffe commented 3 years ago

Thanks, they'll be a couple of useful improvements. From my experience, if you are able to kill the Node process the next TryInvokeCoreAsync invocation will create a new one.

JeremyTCD commented 3 years ago

Hey I've published 6.0.0-beta.2. I've added a new option, OutOfProcessNodeJSServiceOptions.NumProcessRetries. Basically if retries in the original NodeJS process fail, a new process is created and invocations are retried there, so you don't need to manually kill NodeJS processes:

https://github.com/JeringTech/Javascript.NodeJS/blob/b96c2f9f24291b79422e83e30ca9843cfd780a31/src/NodeJS/NodeJSServiceImplementations/OutOfProcess/OutOfProcessNodeJSServiceOptions.cs#L27-L39

Also, on connecting to a NodeJS process, port number is now logged at LogLevel.Information level.

Let me know how it goes!

JohnElliffe commented 3 years ago

Great, thanks Jeremy, I give the new version a try.

JohnElliffe commented 3 years ago

Hi Jeremy, Your latest change works well. This morning we had the error invoking Node. After the number of re-tries was exhausted a connection was made on a new port and the request was processed successfully. Thank you very much for all your help.

JeremyTCD commented 3 years ago

Good to hear! Thanks for the update.

Any clues on what went wrong with the port?

JeremyTCD commented 3 years ago

Closing since resolved. If the port issue can be avoided/mitigated through changes to this library, please open a new issue.

johnrom commented 3 years ago

@JeremyTCD some extra context here from another person experiencing this issue. I am not using this package, but I'm using NodeServices which this is based off of. We started running into this issue very frequently lately on App Service, while it was rare before. It's not because of a restart -- or at least the .net server doesn't think it should be using a new port. It generally happens early in the morning, when the website goes from little-to-no traffic and ramps up to normal traffic.

I'm shooting in the dark here, but these seem like plausible causes of your exception:

  • The Node.js server died and was restarted on a port that is blocked.
  • Something is changing port permissions on your system, i.e. Node.js server didn't die but the port it was using got blocked.

As you can see in the Log Analytics query below, requests are successful on a given port for a while, in the early morning, and then stop when the website starts to ramp up to normal traffic. We've had this same set of circumstances twice. I'm not sure whether the Node Server restarted and moved to a new port without the webserver updating its port, if the node server restarted but was unable to bind to the same port, or if the OS has randomly decided to block these requests for some reason. I don't see any traces from NodeServices, unfortunately, maybe they use the Debug log level or don't log restarts at all.

Right before the issue starts, there was a period of 5 minutes without any requests to the Node server.

(Time, success, target, count) image

We'll be looking into migrating to this package sometime soon to acquire your fix above, so maybe I'll be able to get more info as to the root cause during that time.

JeremyTCD commented 3 years ago

Hey! Thanks for the extra context. Did some Googling.

Investigating Socket exception for ASP.NET applications in Azure App Service:

When an application initiates an outbound connection to a database or a remote service, it uses the TCP connection from a range of allowed TCP connections on the machine. In some scenarios, the number of available outbound ports might get exhausted and when the applications tries to initiate a new connection request, it might fail with this error:

Exception Details: System.Net.Sockets.SocketException : An attempt was made to access a socket in a way forbidden by its access permissions.

COMMON REASONS THAT CAUSE THIS ERROR:

  • Using client libraries which are not implemented to re-use TCP connections.

  • Application code or the client library is leaking TCP socket handles.

  • Burst load of requests opening too many TCP socket connections at once.

  • In case of higher level protocol like HTTP this is encountered if the Keep-Alive option is not leveraged.

The above link provides a way to diagnose socket issues, found a StackOverflow answer with alternatives for diagnosing: https://stackoverflow.com/a/63829129/13102448.

Further information would be much appreciated. We should rule out socket leaks, though I think that is unlikely given that we're using a singleton HttpClient instance (as recommended).

My suspicion is that App Service is messing up. Limiting new connections seems fair ("as advertised"). I don't get why open connections are affected though. You guys want to bring that up with App Service.

That said, we will still work to improve the robustness of this library. The new retry logic in 6.0.0-beta.2 should ensure your invocations run eventually. Feedback on the release will help confirm this and polish things up!

Edit: More info - Deep Dive into TCP Connections in App Service Diagnostics

johnrom commented 3 years ago

I read through all that before finally coming across this issue (besides the deep dive article). Interestingly the site we are experiencing this with doesn't have many pages, so our diagnostics only show 6 - 10 open tcp connections at a given time over the loop back interface. There is a diagnostic blade called "SNAT Port Exhaustion" under the app service diagnostics which seems to rule this out by returning all green (including loopback).

I think if there are any circumstances where the Node instance tries to restart while reusing the same port, that is where I would start. Alternatively, I wonder if the connection could close automatically after an idle timeout and then the server continues trying to use that connection, but fails to reopen it somehow?

I'll check out the last link on Monday, thanks!

JeremyTCD commented 3 years ago

Thanks for the extra information. Looks like loopback TCP connections aren't the issue.


As far as timeouts/connection management goes, we've set things up to use a single connection (per Node.js process) indefinitely.

Client - HttpClient lifetimes:

https://github.com/JeringTech/Javascript.NodeJS/blob/e4889844fda5821adcc2a98f9d69b830afe0d3b9/src/NodeJS/NodeJSServiceCollectionExtensions.cs#L41-L43

Server - server.setTimeout([msecs][, callback]), server.timeout:

https://github.com/JeringTech/Javascript.NodeJS/blob/e4889844fda5821adcc2a98f9d69b830afe0d3b9/src/NodeJS/Javascript/Servers/OutOfProcess/Http/Http20Server.ts#L30-L33

Protocol - Http/2 Spec, Connection Management:

HTTP/2 connections are persistent. For best performance, it is expected that clients will not close connections until it is determined that no further communication with a server is necessary (for example, when a user navigates away from a particular web page) or until the server closes the connection.

Clients SHOULD NOT open more than one HTTP/2 connection to a given host and port pair, where the host is derived from a URI, a selected alternative service [ALT-SVC], or a configured proxy.

...

Servers are encouraged to maintain open connections for as long as possible but are permitted to terminate idle connections if necessary. When either endpoint chooses to close the transport-layer TCP connection, the terminating endpoint SHOULD first send a GOAWAY (Section 6.8) frame so that both endpoints can reliably determine whether previously sent frames have been processed and gracefully complete or terminate any necessary remaining tasks.


As for Node.js process restarts, we leave it to the Node.js server to select an open port. We log both process ID and endpoint on Node.js process restart at information level. Any information on restarts vis-à-vis socket exceptions would be useful, e.g. are socket exceptions always preceded by a restart?


It is entirely possible we've misconfigured something or some part of the system isn't working as expected. That said, given the pattern you described of exceptions occurring when App Service is ramping up, perhaps you could consider an investigation like this:

My suspicion is, the problem has to do with:

Burst load of requests opening too many TCP socket connections at once.

We're using reserved VM instances, so we haven't really tinkered with App Service. Help with ensuring that the library works as well as possible on App Service would be greatly appreciated.

johnrom commented 3 years ago

That sounds like a pretty good test. I'll try and test it out when we are able to prioritize this migration.

johnrom commented 2 years ago

Just to circle back here we did finally make the migration to this library, and we are still experiencing intermittent Socket Exceptions in App Service.

We implemented the process retries, but had to roll it back because the retries increased the pressure on the node app itself. Additionally, we had a situation where we ended up getting logs for "Retries left for this process: Infinity", maybe due to concurrent requests / multi-threading. We instead wrapped the node calls with Polly policies so that triggering a Circuit Breaker will call SwitchToNewProcess().

We still haven't identified the cause of the socket issue in the first place.

JeremyTCD commented 2 years ago

Thanks for looking into the socket issue.

About "Retries left for this process: Infinity", there was indeed a bug. Fixed in this pull request, fix released in 6.2.0.

Polly + MoveToNewProcess sounds good though. More flexible than the built in system.

johnrom commented 2 years ago

I'm using 6.3.0, and I'm pretty sure I was using 6.3.0 when it happened.

johnrom commented 2 years ago

Created a new issue for the infinite loop here: #134