dotnet / aspnetcore

ASP.NET Core is a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.
https://asp.net
MIT License
35.07k stars 9.9k forks source link

Conectivity issue when upgrading from .NET 7.0.1 to .NET 7.0.4 on Windows server #48965

Open haludi opened 1 year ago

haludi commented 1 year ago

Is there an existing issue for this?

Describe the bug

Our product is a distributed system consisting of multiple ASP servers. There are TCP connections open between the servers. And the servers also communicate with HTTP (version 1.1).

The issue:

The server doesn’t accept HTTP requests between the servers. Even not from the server itself. No errors on Microsoft logs (the configuration below) on the target server. In the source server, we got:

Raven.Client.Exceptions.RavenException: An exception occurred while contacting ***URL***.
System.Net.Http.HttpRequestException: No connection could be made because the target machine actively refused it. (***URL***:443)
---> System.Net.Sockets.SocketException (10061): No connection could be made because the target machine actively refused it.

(full stack trace below) The issue starts between 1 hour to a couple of hours after a restart. Azure firewall and OS firewall were checked.

Some details:

One of our customers had an issue in a production environment when upgrading our product. This customer tested the upgrade in a test environment but no issue there. We upgraded the servers again to collect more information. We collected Microsoft logs (the configuration below) and also tcpdump. In the Microsoft logs, we saw no error. On tcpdump that was collected (for 3 minutes) for the target server, we see all HTTPS packages to the port the server is listening to has no response packages [Conversation completeness: Incomplete, SYN_SENT (1)]. We didn’t have any changes regarding the server communication handling between our product versions besides upgrading .Net from .Net6 to .Net7 so we created the same build with one difference, we used .Net6 instead and the issue was solved. Since this is a production we deployed the .Net7 version twice after the first issue to collect the logs and the issue happened again on both times. We also saw the issue again in another customer system, this time the .Net upgrade was between .NET 7.0.1 to .NET 7.0.4 - again using .Net6 build solved the issue.

Both customers’ servers were on:

Addition finding:

tcpdump was collected for 3 minutes while the issue happened

Microsoft log configuration:

{
    "Microsoft.AspNetCore.Server.Kestrel": "Debug",
    "Microsoft.AspNetCore.Server.Kestrel.BadRequests": "Debug"
    "Microsoft.AspNetCore.Server.Kestrel.Connections": "Debug",
    "Microsoft.AspNetCore.Server.Kestrel.Http2": "Debug",
    "Microsoft.AspNetCore.Server.Kestrel.Http3": "Debug",

    "Microsoft.AspNetCore.Server.Kestrel.Transport.Quic": "Debug",
    "Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets": "Debug",

    "Microsoft.AspNetCore.Hosting.WebHost": "Debug",
    "Microsoft.AspNetCore.Hosting.Diagnostics": "Information",
}

Expected Behavior

No response

Steps To Reproduce

We couldn't reproduce

Exceptions (if any)

Exception Stack Trace:

Raven.Client.Exceptions.RavenException: An exception occurred while contacting ***URL***.
System.Net.Http.HttpRequestException: No connection could be made because the target machine actively refused it. (***URL***:443)
---> System.Net.Sockets.SocketException (10061): No connection could be made because the target machine actively refused it.
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|281_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(QueueItem queueItem)
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.HttpConnectionWaiter`1.WaitForConnectionAsync(Boolean async, CancellationToken requestCancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   at Raven.Client.Http.RequestExecutor.SendAsync[TResult](ServerNode chosenNode, RavenCommand`1 command, SessionInfo sessionInfo, HttpRequestMessage request, CancellationToken token) in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1098
   at Raven.Client.Http.RequestExecutor.SendRequestToServer[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, HttpRequestMessage request, String url, CancellationToken token) in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1050.
The server at ***URL*** responded with status code: ServiceUnavailable.
---> System.Net.Http.HttpRequestException: No connection could be made because the target machine actively refused it. (***URL***:443)
---> System.Net.Sockets.SocketException (10061): No connection could be made because the target machine actively refused it.
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|281_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(QueueItem queueItem)
   at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.HttpConnectionWaiter`1.WaitForConnectionAsync(Boolean async, CancellationToken requestCancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   at Raven.Client.Http.RequestExecutor.SendAsync[TResult](ServerNode chosenNode, RavenCommand`1 command, SessionInfo sessionInfo, HttpRequestMessage request, CancellationToken token) in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1098
   at Raven.Client.Http.RequestExecutor.SendRequestToServer[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, HttpRequestMessage request, String url, CancellationToken token) in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1050
   --- End of inner exception stack trace ---
   at Raven.Client.Http.RequestExecutor.ThrowFailedToContactAllNodes[TResult](RavenCommand`1 command, HttpRequestMessage request) in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1177
   at Raven.Client.Http.RequestExecutor.SendRequestToServer[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, HttpRequestMessage request, String url, CancellationToken token) in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1050
   at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 919
   at Raven.Client.Http.RequestExecutor.HandleServerDown[TResult](String url, ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, HttpRequestMessage request, HttpResponseMessage response, Exception e, SessionInfo sessionInfo, Boolean shouldRetry, RequestContext requestContext, CancellationToken token) in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1560
   at Raven.Client.Http.RequestExecutor.SendRequestToServer[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, HttpRequestMessage request, String url, CancellationToken token) in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1050
   at Raven.Client.Http.RequestExecutor.ExecuteAsync[TResult](ServerNode chosenNode, Nullable`1 nodeIndex, JsonOperationContext context, RavenCommand`1 command, Boolean shouldRetry, SessionInfo sessionInfo, CancellationToken token) in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 919
   at Raven.Server.Utils.ReplicationUtils.GetTcpInfoAsync(String url, String databaseName, String databaseId, Int64 etag, String tag, X509Certificate2 certificate, CancellationToken token) in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Server\Utils\ReplicationUtils.cs:line 51
   at Raven.Server.Utils.ReplicationUtils.GetTcpInfoAsync(String url, String databaseName, String tag, X509Certificate2 certificate, CancellationToken token) in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Server\Utils\ReplicationUtils.cs:line 36
   at Raven.Client.Util.AsyncHelpers.RunSync[T](Func`1 task) in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Util\AsyncHelpers.cs:line 135
   at Raven.Server.Utils.ReplicationUtils.GetTcpInfo(String url, String databaseName, String tag, X509Certificate2 certificate, CancellationToken token) in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Server\Utils\ReplicationUtils.cs:line 30
   at Raven.Server.ServerWide.Maintenance.ClusterMaintenanceSupervisor.ClusterNode.ListenToMaintenanceWorker() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Server\ServerWide\Maintenance\ClusterMaintenanceSupervisor.cs:line 274

.NET Version

.NET 7.0.4

Anything else?

No response

ayende commented 1 year ago

Hi, Is there any additional information that we can provide?

adityamandaleeka commented 1 year ago

Sorry for the delay getting to this. This is interesting... I'm fairly sure none of the changes between 7.0.1 and 7.0.4 (at least on the Kestrel side) would affect this. I see you mentioned that downgrading to 6.0 solved it. How confident are you that 7.0.1 doesn't have the issue?

It might be helpful to see 'Trace' level logs for "Microsoft.AspNetCore". And maybe a dump while the server is in the bad state?

haludi commented 1 year ago

Hi, thank you for the reply. We had two incidents of that in two separate customers. One of them upgraded our product from a version with .Net6 and the other upgraded from a version with .Net7.0.1 For both customers, the issue happened only in the production environment and has not happened in the test environment. For both of them when we installed our latest version with .Net6 the issue was resolved. Because this is a production environment a dump is not possible. Next time it will happen I will collect Trace for Microsoft.AspNetCore. To ask our customer to reproduce the issue on purpose to collect the data can be problematic (we already asked one time) but I can check. Just to be sure, you want to collect 'Trace' for all sub logger of Microsoft.AspNetCore, right?

ayende commented 1 year ago

Is there any additional information that we can provide? This has caused us to downgrade several customers to .NET 6.0, which is not a long term strategy, obviously.

adityamandaleeka commented 1 year ago

As mentioned above, Trace level logs for "Microsoft.AspNetCore" would be the best next step (since @haludi said dumps are not going to be possible).

danmoseley commented 5 months ago

Also curious whether you can test on .NET 8

haludi commented 5 months ago

There is a plan to do so but since we saw that only in production environments and we couldn't reproduce it in a test environment we are very limited

danmoseley commented 5 months ago

In many/most cases the update is just flip the target framework and rebuild. Do you have the ability to do that, and deploy (perhaps temporarily and limited, just enough to see whether it's fixed)?

That would give another data point but depending on your processes and limitations might be a way to get the fix.

haludi commented 5 months ago

Currently, our customers running on dotnet 6. We created a custom build for them. Our stable currently running on dotnet 7 and should be upgraded to dotnet 8 in the upcoming 2 months. We created a custom build for them to work around until we find another solution. Our customers don't experience issues in dotnet 6. Requesting them to upgrade to dotnet 8 and potentially break their production is problematic. But we will gather more information if another accident happens.