dotnet / SqlClient

Microsoft.Data.SqlClient provides database connectivity to SQL Server for .NET applications.
MIT License
838 stars 279 forks source link

After migration from .NET7 to .NET8 SqlException 0x80131904 started to appear randomly when connecting to Azure SQL databases #2400

Open JiriZidek opened 6 months ago

JiriZidek commented 6 months ago

After migration from .NET7 to .NET8 SqlException 0x80131904 started to appear randomly when connecting to Azure SQL databases

The error happens in two flavors:

We observed that this error happens usually after a longer (>minutes) period of inactivity (kestrell web app not being hit). After first occurence, the successive DB operations succeed. We use SqlClient in our code indirectly through EF.

Version information

Microsoft.Data.SqlClient version: 5.2.0 (but same observed for 5.1.5) - Microsoft.Data.SqlClient.dll 5.20.324059.2 dated 28-FEB-2024 size 911480 bytes Target framework: .NET 8 - 8.0.3 (but same observed for 8.0.2, 8.0.1 & 8.0.0) Operating system: Linux docker image - same result on mcr.microsoft.com/dotnet/aspnet:8.0 and mcr.microsoft.com/dotnet/aspnet:8.0-jammy - hosted in AKS, similar behavior for backend service using image runtime:8.0

Relevant code

Original code - in .NET7 - now randomly failing .NET8:

        protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder) {
            if(!optionsBuilder.IsConfigured) {
                optionsBuilder.UseSqlServer(DbConfiguration.ConnectionString);
            }
        }

Improved code - no better, same problems:

        protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder) {
            if(!optionsBuilder.IsConfigured) {
                optionsBuilder.UseSqlServer(DbConfiguration.ConnectionString,
                    opt => opt.UseAzureSqlDefaults(DbConfiguration.UseAzureDefaults));
            }
        }

Failing place examples (they are ranadom in fact, in various programes):

using(var ctx = new LicensingDbContext()) {
    var header = await (from h in ctx.Headers.AsNoTracking() where h.Id == licenceId && h.Status == Status.Issued select h).FirstOrDefaultAsync();
using var ctx = new DeploymentDbContext();
ctx.Events.Add(new Event { EventType = EventType.TenantResource_Operational, ScriptId = ScriptId, Reference = (int)ScriptData.Process, ResourceType = trdb.ResourceType, ResourceId = trdb.TenantResourceId, TenantId = trdb.TenantId, TimeUtc = DateTime.UtcNow, DetailText = $"PG SQL K8s Database: {dbname}" });
await ctx.SaveChangesAsync(cancel);

Connection string

Connection string looks similar to Server=tcp:SOME-sql.database.windows.net,1433;Initial Catalog=MyDB;Persist Security Info=False;User ID=some-admin;Password=123456789;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;

Full stack traces

fail: Microsoft.AspNetCore.Server.Kestrel[13]
      Connection id "0HN1UUOR4NG60", Request id "0HN1UUOR4NG60:00000002": An unhandled exception was thrown by the application.
      Microsoft.Data.SqlClient.SqlException (0x80131904): A connection was successfully established with the server, but then an error occurred during the login process. (provider: TCP Provider, error: 35 - An internal exception was caught)
       ---> System.IO.IOException: Unable to write data to the transport connection: Connection reset by peer.
       ---> System.Net.Sockets.SocketException (104): Connection reset by peer
         at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 count)
         --- End of inner exception stack trace ---
         at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 count)
         at System.IO.Stream.Write(ReadOnlySpan`1 buffer)
         at Microsoft.Data.SqlClient.SNI.SslOverTdsStream.Write(ReadOnlySpan`1 buffer)
         at System.Net.Security.SslStream.WriteSingleChunk[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
         at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
         at Microsoft.Data.SqlClient.SNI.SNITCPHandle.Send(SNIPacket packet)
         at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
         at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, SqlCommand command, Boolean callerHasConnectionLock, Boolean asyncClose)
         at Microsoft.Data.SqlClient.TdsParserStateObject.SNIWritePacket(PacketHandle packet, UInt32& sniError, Boolean canAccumulate, Boolean callerHasConnectionLock, Boolean asyncClose)
         at Microsoft.Data.SqlClient.TdsParserStateObject.WriteSni(Boolean canAccumulate)
         at Microsoft.Data.SqlClient.TdsParserStateObject.WritePacket(Byte flushMode, Boolean canAccumulate)
         at Microsoft.Data.SqlClient.TdsParser.TdsLogin(SqlLogin rec, FeatureExtension requestedFeatures, SessionData recoverySessionData, FederatedAuthenticationFeatureExtensionData fedAuthFeatureExtensionData, SqlConnectionEncryptOption encrypt)
         at Microsoft.Data.SqlClient.SqlInternalConnectionTds.Login(ServerInfo server, TimeoutTimer timeout, String newPassword, SecureString newSecurePassword, SqlConnectionEncryptOption encrypt)
         at Microsoft.Data.SqlClient.SqlInternalConnectionTds.AttemptOneLogin(ServerInfo serverInfo, String newPassword, SecureString newSecurePassword, TimeoutTimer timeout, Boolean withFailover)
         at Microsoft.Data.SqlClient.SqlInternalConnectionTds.LoginNoFailover(ServerInfo serverInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString connectionOptions, SqlCredential credential, TimeoutTimer timeout)
         at Microsoft.Data.SqlClient.SqlInternalConnectionTds.OpenLoginEnlist(TimeoutTimer timeout, SqlConnectionString connectionOptions, SqlCredential credential, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance)
         at Microsoft.Data.SqlClient.SqlInternalConnectionTds..ctor(DbConnectionPoolIdentity identity, SqlConnectionString connectionOptions, SqlCredential credential, Object providerInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString userConnectionOptions, SessionData reconnectSessionData, Boolean applyTransientFaultHandling, String accessToken, DbConnectionPool pool, Func`3 accessTokenCallback)
         at Microsoft.Data.SqlClient.SqlConnectionFactory.CreateConnection(DbConnectionOptions options, DbConnectionPoolKey poolKey, Object poolGroupProviderInfo, DbConnectionPool pool, DbConnection owningConnection, DbConnectionOptions userOptions)
         at Microsoft.Data.ProviderBase.DbConnectionFactory.CreatePooledConnection(DbConnectionPool pool, DbConnection owningObject, DbConnectionOptions options, DbConnectionPoolKey poolKey, DbConnectionOptions userOptions)
         at Microsoft.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
         at Microsoft.Data.ProviderBase.DbConnectionPool.UserCreateRequest(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
         at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
         at Microsoft.Data.ProviderBase.DbConnectionPool.WaitForPendingOpen()
      --- End of stack trace from previous location ---
         at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(Boolean errorsExpected, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(Boolean errorsExpected, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenAsync(CancellationToken cancellationToken, Boolean errorsExpected)
         at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(AsyncEnumerator enumerator, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass30_0`2.<<ExecuteAsync>b__0>d.MoveNext()
      --- End of stack trace from previous location ---
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()
         at Microsoft.EntityFrameworkCore.Query.ShapedQueryCompilingExpressionVisitor.SingleOrDefaultAsync[TSource](IAsyncEnumerable`1 asyncEnumerable, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Query.ShapedQueryCompilingExpressionVisitor.SingleOrDefaultAsync[TSource](IAsyncEnumerable`1 asyncEnumerable, CancellationToken cancellationToken)
         at Atlantis.LicensingWeb.Controllers.TelemetryController.FindLicenceIdDetails(Guid licenceId, LicensingDbContext ctx) in /src/Cloud/LicensingWeb/Controllers/TelemetryController.cs:line 113
         at Atlantis.LicensingWeb.Controllers.TelemetryController.GetMyIp(Guid licenceId, String appName) in /src/Cloud/LicensingWeb/Controllers/TelemetryController.cs:line 91
         at lambda_method262(Closure, Object)
         at Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor.AwaitableObjectResultExecutor.Execute(ActionContext actionContext, IActionResultTypeMapper mapper, ObjectMethodExecutor executor, Object controller, Object[] arguments)
         at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeActionMethodAsync>g__Awaited|12_0(ControllerActionInvoker invoker, ValueTask`1 actionResultValueTask)
         at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeNextActionFilterAsync>g__Awaited|10_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
         at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow(ActionExecutedContextSealed context)
         at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
         at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeInnerFilterAsync>g__Awaited|13_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
         at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeNextResourceFilter>g__Awaited|25_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
         at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Rethrow(ResourceExecutedContextSealed context)
         at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
         at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeFilterPipelineAsync>g__Awaited|20_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
         at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Awaited|17_0(ResourceInvoker invoker, Task task, IDisposable scope)
         at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Awaited|17_0(ResourceInvoker invoker, Task task, IDisposable scope)
         at Microsoft.AspNetCore.Authorization.AuthorizationMiddleware.Invoke(HttpContext context)
         at Microsoft.AspNetCore.Authentication.AuthenticationMiddleware.Invoke(HttpContext context)
         at Atlantis.General.Web.SecurityFeaturesMiddleware.InvokeAsync(HttpContext context) in /src/General/Web/SecurityFeaturesMiddleware.cs:line 38
         at Microsoft.AspNetCore.Builder.Extensions.UsePathBaseMiddleware.InvokeCore(HttpContext context, PathString matchedPath, PathString remainingPath)
         at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)
2024-03-12T10:20:22+01:00       Microsoft.Data.SqlClient.SqlException (0x80131904): A connection was successfully established with the server, but then an error occurred during the login process. (provider: TCP Provider, error: 0 - Success)
2024-03-12T10:20:22+01:00          at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
2024-03-12T10:20:22+01:00          at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, SqlCommand command, Boolean callerHasConnectionLock, Boolean asyncClose)
2024-03-12T10:20:22+01:00          at Microsoft.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose)
2024-03-12T10:20:22+01:00          at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
2024-03-12T10:20:22+01:00          at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
2024-03-12T10:20:22+01:00          at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
2024-03-12T10:20:22+01:00          at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
2024-03-12T10:20:22+01:00          at Microsoft.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
2024-03-12T10:20:22+01:00          at Microsoft.Data.SqlClient.SqlInternalConnectionTds.CompleteLogin(Boolean enlistOK)
2024-03-12T10:20:22+01:00          at Microsoft.Data.SqlClient.SqlInternalConnectionTds.AttemptOneLogin(ServerInfo serverInfo, String newPassword, SecureString newSecurePassword, TimeoutTimer timeout, Boolean withFailover)
2024-03-12T10:20:22+01:00          at Microsoft.Data.SqlClient.SqlInternalConnectionTds.LoginNoFailover(ServerInfo serverInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString connectionOptions, SqlCredential credential, TimeoutTimer timeout)
2024-03-12T10:20:22+01:00          at Microsoft.Data.SqlClient.SqlInternalConnectionTds.OpenLoginEnlist(TimeoutTimer timeout, SqlConnectionString connectionOptions, SqlCredential credential, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance)
2024-03-12T10:20:22+01:00          at Microsoft.Data.SqlClient.SqlInternalConnectionTds..ctor(DbConnectionPoolIdentity identity, SqlConnectionString connectionOptions, SqlCredential credential, Object providerInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString userConnectionOptions, SessionData reconnectSessionData, Boolean applyTransientFaultHandling, String accessToken, DbConnectionPool pool, Func`3 accessTokenCallback)
2024-03-12T10:20:22+01:00          at Microsoft.Data.SqlClient.SqlConnectionFactory.CreateConnection(DbConnectionOptions options, DbConnectionPoolKey poolKey, Object poolGroupProviderInfo, DbConnectionPool pool, DbConnection owningConnection, DbConnectionOptions userOptions)
2024-03-12T10:20:22+01:00          at Microsoft.Data.ProviderBase.DbConnectionFactory.CreatePooledConnection(DbConnectionPool pool, DbConnection owningObject, DbConnectionOptions options, DbConnectionPoolKey poolKey, DbConnectionOptions userOptions)
2024-03-12T10:20:22+01:00          at Microsoft.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
2024-03-12T10:20:22+01:00          at Microsoft.Data.ProviderBase.DbConnectionPool.UserCreateRequest(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
2024-03-12T10:20:22+01:00          at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
2024-03-12T10:20:22+01:00          at Microsoft.Data.ProviderBase.DbConnectionPool.WaitForPendingOpen()
2024-03-12T10:20:22+01:00       --- End of stack trace from previous location ---
2024-03-12T10:20:22+01:00          at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(Boolean errorsExpected, CancellationToken cancellationToken)
2024-03-12T10:20:22+01:00          at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(Boolean errorsExpected, CancellationToken cancellationToken)
2024-03-12T10:20:22+01:00          at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenAsync(CancellationToken cancellationToken, Boolean errorsExpected)
2024-03-12T10:20:22+01:00          at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.BeginTransactionAsync(IsolationLevel isolationLevel, CancellationToken cancellationToken)
2024-03-12T10:20:22+01:00          at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.BeginTransactionAsync(CancellationToken cancellationToken)
2024-03-12T10:20:22+01:00          at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(IEnumerable`1 commandBatches, IRelationalConnection connection, CancellationToken cancellationToken)
2024-03-12T10:20:22+01:00          at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(IEnumerable`1 commandBatches, IRelationalConnection connection, CancellationToken cancellationToken)
2024-03-12T10:20:22+01:00          at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(IEnumerable`1 commandBatches, IRelationalConnection connection, CancellationToken cancellationToken)
2024-03-12T10:20:22+01:00          at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(IList`1 entriesToSave, CancellationToken cancellationToken)
2024-03-12T10:20:22+01:00          at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(StateManager stateManager, Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)
2024-03-12T10:20:22+01:00          at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
2024-03-12T10:20:22+01:00          at Microsoft.EntityFrameworkCore.DbContext.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)
2024-03-12T10:20:22+01:00          at Microsoft.EntityFrameworkCore.DbContext.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)
2024-03-12T10:20:22+01:00          at Atlantis.DeploymentService.Processes.K8sNameSpaceWithPg_UpDown_Process.SetUpAsync(CancellationToken cancel) in /src/Cloud/DeploymentService/Processes/K8sNameSpaceWithPg_UpDown_Process.cs:line 230
2024-03-12T10:20:22+01:00          at Atlantis.DeploymentService.DeploymentProcessor.PerformSetUp(IProcess process) in /src/Cloud/DeploymentService/DeploymentProcessor.cs:line 358

We hoped the problem would be solved by 5.2.0, but it does not look to be the case.

JiriZidek commented 6 months ago

Please remove lavel "sqlite" - it was added by mistake, but I cannot change it anymore to "ms sqlclient" or similar.

JRahnama commented 6 months ago

@JiriZidek Can you capture some EventSource logs. Also can you confirm by reverting back to Net7 the issue does not happen.

JRahnama commented 6 months ago

I recommend you contact Azure support and have the networking team look into what is going on during those periods of failures.

JiriZidek commented 6 months ago

I recommend you contact Azure support and have the networking team look into what is going on during those periods of failures.

I am afraid there are no "periods". The problem appears randomly, but always after restarting the container. So from this point of view it is reproducible.

JiriZidek commented 6 months ago

@JiriZidek Can you capture some EventSource logs. Also can you confirm by reverting back to Net7 the issue does not happen.

Yes - we tested old version, which differs only by being .NET7 and all the nuget modules from December (EF, SqlClient, image aspnet:7.0) and yes, when restarted it does not yield this error. EventSource - I'll try my best, but I am in Kubernetes POD - so in fact I have really limited ways how to do that. In fact I use only text log output captured Kubernetes. The pages refered are more for Windows installation.

JiriZidek commented 6 months ago

I realized one important point - I run the very same code as Azure Service, using Linux plan and .NET8 - and in this case there is not such problem with SqlClient.

JRahnama commented 6 months ago

I realized one important point - I run the very same code as Azure Service, using Linux plan and .NET8 - and in this case there is not such problem with SqlClient.

This confirms that your best option would be contacting Azure support.

JiriZidek commented 6 months ago

This confirms that your best option would be contacting Azure support.

But how this plays with .NET7 works vs. .NET8 does not ? In same place, same AKS cluster ? I can put two PODs, one old, one new in parallel to decide if it is likely the network problem. Let you know. I'd like to avoid being pingponged back from Azure support.

Crossbow78 commented 5 months ago

Just a note to confirm we're seeing the exact same issue in almost an identical setup:

Exception message:

A connection was successfully established with the server, but then an error occurred during the login process. (provider: TCP Provider, error: 35 - An internal exception was caught) Unable to write data to the transport connection: Connection reset by peer. Connection reset by peer

Note that the error occurs incidentally (less than 0.5% of the calls), and we're seeing the exceptions since the day that the .NET upgrade was deployed.

JiriZidek commented 5 months ago

So positively confirmed. AKS, same K8s namespace. One POD with .NET8 and .NET 7.

.NET8 - multiple errors: image

.NET7 - no problem image

So for me this is NOT Azure problem. It is problem if .NET8 or problem of Debian 12 (which is bases of the image). But since I have observed same bad behavior on Unbuntu based images I bet on .NET8 or Microsoft.Data.SqlClient. I guess there is something wrong with TLS handshake.

popcatalin81 commented 5 months ago

I'm having the same issue, there are no concurrent connections to the DB and no async operations happening in parallel on the same context.

An exception occurred while iterating over the results of a query for context type 'DataAccess.ChassiContext'.
      System.InvalidOperationException: The connection does not support MultipleActiveResultSets.
         at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__211_0(Task`1 result)
         at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
         at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
      --- End of stack trace from previous location ---
         at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
         at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
      --- End of stack trace from previous location ---
         at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(AsyncEnumerator enumerator, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass30_0`2.<<ExecuteAsync>b__0>d.MoveNext()
      --- End of stack trace from previous location ---
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()
      System.InvalidOperationException: The connection does not support MultipleActiveResultSets.
         at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__211_0(Task`1 result)
         at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
         at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
      --- End of stack trace from previous location ---
         at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
         at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
      --- End of stack trace from previous location ---
         at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(AsyncEnumerator enumerator, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass30_0`2.<<ExecuteAsync>b__0>d.MoveNext()
      --- End of stack trace from previous location ---
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
         at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()
patohara-rkt commented 5 months ago

We experienced the same SqlException (0x80131904), however in a slightly different environment, and I beleive I understand the source of this exception, at least in our environment.

Microsoft.Data.SqlClient.SqlException (0x80131904): A connection was successfully established with the server, but then an error occurred during the login process. (provider: TCP Provider, error: 35 - An internal exception was caught)

TLDR

This exception occurs when there is a proxy between the client and the SQL Server AG Listener, and the proxy accepts TCP connections for both the primary and secondary SQL Server instances. The mechanism SqlClient uses to determine which host is primary is by opening a Socket to all IPs returned by the DNS lookup, and the first socket to open is considered the primary and the other connections are disposed. (See ParallelConnectAsync and ParallelConnectHelper from SNITcpHandle.cs). Under normal circumstances, without the proxy, the secondary instance is not listening on port \<listener-ip>:1433 and therefore the client cannot open a socket. But when a proxy is between the client and server, the client can open the socket and even perform the TCP handshake with the proxy for both the primary and secondary SQL Server Instances. If the socket is opened to the secondary first, then as soon as any data is sent this exception is thrown.

Environment

pod.yaml

---
apiVersion: v1
kind: Pod
metadata:
  labels:
    sidecar.istio.io/inject: "true"
  annotations:
    proxy.istio.io/config: |
      proxyMetadata:
        ISTIO_META_DNS_CAPTURE: "true"
        ISTIO_META_DNS_AUTO_ALLOCATE: "true"
  name: dotnet8
spec:
  containers:
  - image: mcr.microsoft.com/dotnet/sdk:8.0
    name: dotnet8
    imagePullPolicy: Always
    command: ["sleep"]
    args: ["3600"]

service.yaml

---
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: sql-ag
spec:
  hosts:
  - "sql-server-ag-listener.fqdn"
  exportTo:
  - "."
  location: MESH_EXTERNAL
  ports:
  - number: 1433
    name: sql
    protocol: TCP
  resolution: DNS

Description

The first issue with this, is that Istio will create a Virtual IP (VIP) for the ServiceEntry ex: 240.240.240.10 and any DNS queries performed within the pod for the listener name will resolve to this VIP. In addition to the VIP, the Istio proxy sidecar will do DNS resolution on that hostname and put both the primary and secondary IPs into a pool and perform load balancing between all IPs in the pool. This means sometimes when SqlClient will get sent to the secondary instance.

Disable the VIP and Set the Service resolution to NONE

    annotations:
    proxy.istio.io/config: |
      proxyMetadata:
        ISTIO_META_DNS_CAPTURE: "true"
        ISTIO_META_DNS_AUTO_ALLOCATE: "false"
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
spec:
  resolution: NONE

With the VIP disabled the the pod's DNS queries will now resolve to the 2 SQL Listener IPs, and with the ServiceEntry resolution set to NONE, connections will be routed/forwarded to the IP requested.

This however does not solve the root of the issue. Istio Proxy is listening and accepting connection on 1433. SqlClient attempts to open 2 connections to the 2 IPs and because Istio is listening both sockets will succeed. If the primary is opened first then all is good. If the secondary is opened first then will perform the TCP handshake and then TLS / TDS handshake which is where the exception is thrown.

Bypass Istio

    annotations:
      traffic.sidecar.istio.io/excludeOutboundPorts: '1433'

This will cause any traffic on port 1433 to completely bypass Istio and traffic will go directly to SQL Server without issue.

Possible Fix

In SqlClient, delay the decision for which IP/Host is considered primary to after the TLS handshake / TDS Handshake is attempted instead of using Socket.Connect() and perform the handshake in parallel for all IPs. If a socket is opened to the secondary listener ip it will fail at the handshake, otherwise the Socket.Connect() will timeout and in the mean time the primary handshake can proceed, in which case discard the rest of the connections.

JiriZidek commented 5 months ago
  • Note: Adding "Pooling=False;" to the connection string will turn off the connection pooling and cause the issue to occur more frequently.

👍 In my recent experiments this makes app almost unusable with Azure SQL. I planned to use this in POC project.

patohara-rkt commented 5 months ago

The only reason I mentioned Pooling=False is because it is useful if anyone attempts to re-create the issue and wants to reliably reproduce it.

It is unlikely you would want to Pooling=False in a real world application

chticer commented 5 months ago

I am also facing the same issue as well.

An exception occurred while iterating over the results of a query for context type 'musicstreamingplayer.DatabaseModels.igdbapi.IGDBAPIAppContext'.
Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): The wait operation timed out.
   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at Microsoft.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more)
   at Microsoft.Data.SqlClient.SqlDataReader.Read()
   at Microsoft.EntityFrameworkCore.Storage.RelationalDataReader.Read()
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.Enumerator.MoveNext()
ClientConnectionId:d2f1ae4f-531d-47a7-90f3-97eca52f1501
Error Number:-2,State:0,Class:11
chticer commented 5 months ago

Ignore my comment from before, the issue I am facing is not related to this issue.

cheenamalhotra commented 2 months ago

Checking in, has someone tested with MDS 5.2.1?

JiriZidek commented 2 months ago

Checking in, has someone tested with MDS 5.2.1?

Yes - no change, the same problem.

cheenamalhotra commented 2 months ago

@kf-gonzalez2 @JRahnama could you please help prioritize the issue investigations? Thanks.

JiriZidek commented 2 months ago

Just some notes after few month of living with this bug:

kf-gonzalez2 commented 1 month ago

@JiriZidek Can you share some event source traces logs please?

JRahnama commented 1 month ago

@JiriZidek Can you comare OpenSSL versions in Azure Service (Linux) and Debian one?

patohara-rkt commented 1 month ago

@JiriZidek Does you AKS cluster use Istio Proxy, or any other proxy that sits between the .NET8 application and the SQL DB?

JiriZidek commented 1 month ago

@JiriZidek Does you AKS cluster use Istio Proxy, or any other proxy that sits between the .NET8 application and the SQL DB?

No, nothing special. No Istio. There is normal outbound in AKS: image

image

JiriZidek commented 1 month ago

@JiriZidek Can you comare OpenSSL versions in Azure Service (Linux) and Debian one?

AKS Docker is based on this: https://mcr.microsoft.com/en-us/product/dotnet/aspnet/about tag 8:0 and shows: OpenSSL 3.0.13 30 Jan 2024 (Library: OpenSSL 3.0.13 30 Jan 2024)

Azure App Service - gives this: OpenSSL 3.0.11 19 Sep 2023 (Library: OpenSSL 3.0.11 19 Sep 2023) image

JiriZidek commented 1 month ago

@JiriZidek Can you share some event source traces logs please?

I'd love to do. but I do not know how to do it within a POD. Any pointer how to do it ?

arellegue commented 1 month ago

@JiriZidek There is an update to .Net8.0 that came out last July 9th, 2024 along with updated containers. Could you please test with that version just to see if the issue is resolved if that version is used, please?

https://github.com/dotnet/core/blob/main/release-notes/8.0/8.0.7/8.0.7.md

JiriZidek commented 1 month ago

@JiriZidek There is an update to .Net8.0 that came out last July 9th, 2024 along with updated containers. Could you please test with that version just to see if the issue is resolved if that version is used, please?

https://github.com/dotnet/core/blob/main/release-notes/8.0/8.0.7/8.0.7.md

Negative - still happens:

ENV: .NET=8.0.7 CULT: UTC:07/23/2024 16:10:56 LOC:07/23/2024 18:10:56

fail: Microsoft.AspNetCore.Server.Kestrel[13]
      Connection id "0HN5B69LGCITJ", Request id "0HN5B69LGCITJ:00000001": An unhandled exception was thrown by the application.
      Microsoft.Data.SqlClient.SqlException (0x80131904): A connection was successfully established with the server, but then an error occurred during the login process. (provider: TCP Provider, error: 35 - An internal exception was caught)
       ---> System.IO.IOException: Unable to write data to the transport connection: Connection reset by peer.
       ---> System.Net.Sockets.SocketException (104): Connection reset by peer
         at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 count)
         --- End of inner exception stack trace ---
         at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 count)
         at System.IO.Stream.Write(ReadOnlySpan`1 buffer)
         at System.Net.Sockets.NetworkStream.Write(ReadOnlySpan`1 buffer)
         at Microsoft.Data.SqlClient.SNI.SslOverTdsStream.Write(ReadOnlySpan`1 buffer)
         at System.Net.Security.SslStream.WriteSingleChunk[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
         at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
         at System.Net.Security.SslStream.Write(Byte[] buffer, Int32 offset, Int32 count)
         at Microsoft.Data.SqlClient.SNI.SNIPacket.WriteToStream(Stream stream)
         at Microsoft.Data.SqlClient.SNI.SNITCPHandle.Send(SNIPacket packet)
         at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
         at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, SqlCommand command, Boolean callerHasConnectionLock, Boolean asyncClose)
         at Microsoft.Data.SqlClient.TdsParserStateObject.SNIWritePacket(PacketHandle packet, UInt32& sniError, Boolean canAccumulate, Boolean callerHasConnectionLock, Boolean asyncClose)
         at Microsoft.Data.SqlClient.TdsParserStateObject.WriteSni(Boolean canAccumulate)
         at Microsoft.Data.SqlClient.TdsParserStateObject.WritePacket(Byte flushMode, Boolean canAccumulate)
         at Microsoft.Data.SqlClient.TdsParser.TdsLogin(SqlLogin rec, FeatureExtension requestedFeatures, SessionData recoverySessionData, FederatedAuthenticationFeatureExtensionData fedAuthFeatureExtensionData, SqlConnectionEncryptOption encrypt)
         at Microsoft.Data.SqlClient.SqlInternalConnectionTds.Login(ServerInfo server, TimeoutTimer timeout, String newPassword, SecureString newSecurePassword, SqlConnectionEncryptOption encrypt)
         at Microsoft.Data.SqlClient.SqlInternalConnectionTds.AttemptOneLogin(ServerInfo serverInfo, String newPassword, SecureString newSecurePassword, TimeoutTimer timeout, Boolean withFailover)
         at Microsoft.Data.SqlClient.SqlInternalConnectionTds.LoginNoFailover(ServerInfo serverInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString connectionOptions, SqlCredential credential, TimeoutTimer timeout)
         at Microsoft.Data.SqlClient.SqlInternalConnectionTds.OpenLoginEnlist(TimeoutTimer timeout, SqlConnectionString connectionOptions, SqlCredential credential, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance)
         at Microsoft.Data.SqlClient.SqlInternalConnectionTds..ctor(DbConnectionPoolIdentity identity, SqlConnectionString connectionOptions, SqlCredential credential, Object providerInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString userConnectionOptions, SessionData reconnectSessionData, Boolean applyTransientFaultHandling, String accessToken, DbConnectionPool pool, Func`3 accessTokenCallback)
         at Microsoft.Data.SqlClient.SqlConnectionFactory.CreateConnection(DbConnectionOptions options, DbConnectionPoolKey poolKey, Object poolGroupProviderInfo, DbConnectionPool pool, DbConnection owningConnection, DbConnectionOptions userOptions)
         at Microsoft.Data.ProviderBase.DbConnectionFactory.CreatePooledConnection(DbConnectionPool pool, DbConnection owningObject, DbConnectionOptions options, DbConnectionPoolKey poolKey, DbConnectionOptions userOptions)
         at Microsoft.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
         at Microsoft.Data.ProviderBase.DbConnectionPool.UserCreateRequest(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
         at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
         at Microsoft.Data.ProviderBase.DbConnectionPool.WaitForPendingOpen()
Shakeebmm commented 1 month ago

@JiriZidek Anything related to IPv4/IPv6 connections ? https://stackoverflow.com/questions/15488922/connection-to-sql-server-works-sometimes

JiriZidek commented 1 month ago

@JiriZidek Anything related to IPv4/IPv6 connections ? https://stackoverflow.com/questions/15488922/connection-to-sql-server-works-sometimes

AKS uses IPv4 only and Azure SQL does not support IPv6 at all (afaik - in NSLOOKUP the DB name does not have IPv6 address). As for "MultiSubnetFailover=True" I'll give it a try...

arunprakashn commented 3 weeks ago

We are facing this same issue after upgrading our AKS k8s app to .net8. Any solutions yet?

Shakeebmm commented 3 weeks ago

I am facing the same issue even with .NET 6 and .NET 8 in lower environments. And same code works in higher environments. Looks like something to do with env configs...

alexavrutin commented 2 weeks ago

I'm experiencing the same issue with my application after upgarding to .Net 8.0

kf-gonzalez2 commented 2 weeks ago

Setting this up in Kubernetes is getting complex to test, has anybody being able to reproduce the issue in a different environment such as a plain Ubuntu VM or container?

JiriZidek commented 2 weeks ago

Just to summarize where the problem IS NOT EXISTING so it makes no sense to do the test there:

AppServices tells: .NET 8.0.7 (but this is irrelvant to problem afaik) PRETTY_NAME="Debian GNU/Linux 12 (bookworm)" NAME="Debian GNU/Linux" VERSION_ID="12" VERSION="12 (bookworm)" OpenSSL 3.0.13 30 Jan 2024 (Library: OpenSSL 3.0.13 30 Jan 2024)

JiriZidek commented 2 weeks ago

And AKS POD where the problem IS EXISTING:

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)" NAME="Debian GNU/Linux" VERSION_ID="12" VERSION="12 (bookworm)" OpenSSL 3.0.13 30 Jan 2024 (Library: OpenSSL 3.0.13 30 Jan 2024)

.NET runtimes installed: Microsoft.AspNetCore.App 8.0.8 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.NETCore.App 8.0.8 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

And now find the seven differences (no, .NET version is not the answer).

JiriZidek commented 2 weeks ago

MultiSubnetFailover=True

This has no effect so far (a week of observation).

bdschaap commented 2 weeks ago

I'm receiving the Connection reset by peer error but with Unable to read data rather than Unable to write data. It seems to randomly occur. I received the error with Microsoft.Data.SqlClient v5.1.5. I also tried v5.2.2 which was just released but continue to get the error.

Environment .NET 8, on-prem SQL Server 2022 configured with DNN for FCI, Linux container running on Kubernetes.

I can also reproduce it by running the pod on AKS in Azure hitting the same on-prem SQL Server 2022 infrastructure.

Microsoft.Data.SqlClient.SqlException (0x80131904): A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: TCP Provider, error: 35 - An internal exception was caught)
 ---> System.IO.IOException: Unable to read data from the transport connection: Connection reset by peer.
 ---> System.Net.Sockets.SocketException (104): Connection reset by peer
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   at Microsoft.Data.SqlClient.SNI.SNITCPHandle.Receive(SNIPacket& packet, Int32 timeoutInMilliseconds)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, SqlCommand command, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
   at Microsoft.Data.SqlClient.TdsParser.ConsumePreLoginHandshake(SqlConnectionEncryptOption encrypt, Boolean trustServerCert, Boolean integratedSecurity, Boolean& marsCapable, Boolean& fedAuthRequired, Boolean tlsFirst, String serverCert)
   at Microsoft.Data.SqlClient.TdsParser.Connect(ServerInfo serverInfo, SqlInternalConnectionTds connHandler, TimeoutTimer timeout, SqlConnectionString connectionOptions, Boolean withFailover)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.LoginNoFailover(ServerInfo serverInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString connectionOptions, SqlCredential credential, TimeoutTimer timeout)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.OpenLoginEnlist(TimeoutTimer timeout, SqlConnectionString connectionOptions, SqlCredential credential, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds..ctor(DbConnectionPoolIdentity identity, SqlConnectionString connectionOptions, SqlCredential credential, Object providerInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString userConnectionOptions, SessionData reconnectSessionData, Boolean applyTransientFaultHandling, String accessToken, DbConnectionPool pool, Func`3 accessTokenCallback)
   at Microsoft.Data.SqlClient.SqlConnectionFactory.CreateConnection(DbConnectionOptions options, DbConnectionPoolKey poolKey, Object poolGroupProviderInfo, DbConnectionPool pool, DbConnection owningConnection, DbConnectionOptions userOptions)
   at Microsoft.Data.ProviderBase.DbConnectionFactory.<>c__DisplayClass48_0.<CreateReplaceConnectionContinuation>b__0(Task`1 _)
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location ---
JiriZidek commented 2 weeks ago

Connection reset by peer error but with Unable to read data rather than Unable to write data.

There is also a difference you get "during the pre-login handshake", in my case I get "during the login process".

oatsoda commented 1 week ago

Also seeing this error sporadically after upgrading from .NET 6 to .NET 8.

Running in an Azure Linux AppService (B1 Tier). Connecting to several SQL Failover groups (i.e. more than one database, each is itself a failover group). Appears to happening under load...but not certain - that may just be increasing the chances of occuring.

Using EF IDbContextFactory.

Container used in AppService:

Creating container with image: appsvc/dotnetcore:8.0_20240819.2.tuxprod from registry: 10.1.0.5:13209 and fully qualified image name: 10.1.0.5:13209/appsvc/dotnetcore:8.0_20240819.2.tuxprod

Exception:

Microsoft.Data.SqlClient.SqlException (0x80131904): A connection was successfully established with the server, but then an error occurred during the login process. (provider: TCP Provider, error: 35 - An internal exception was caught)
 ---> System.IO.IOException: Unable to write data to the transport connection: Connection reset by peer.
 ---> System.Net.Sockets.SocketException (104): Connection reset by peer
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 count)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 count)
   at System.IO.Stream.Write(ReadOnlySpan`1 buffer)
   at System.Net.Sockets.NetworkStream.Write(ReadOnlySpan`1 buffer)
   at Microsoft.Data.SqlClient.SNI.SslOverTdsStream.Write(ReadOnlySpan`1 buffer)
   at System.Net.Security.SslStream.WriteSingleChunk[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Security.SslStream.WriteAsyncInternal[TIOAdapter](ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at Microsoft.Data.SqlClient.SNI.SNITCPHandle.Send(SNIPacket packet)
   at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.SNIWritePacket(PacketHandle packet, UInt32& sniError, Boolean canAccumulate, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.WriteSni(Boolean canAccumulate)
   at Microsoft.Data.SqlClient.TdsParserStateObject.WritePacket(Byte flushMode, Boolean canAccumulate)
   at Microsoft.Data.SqlClient.TdsParser.TdsLogin(SqlLogin rec, FeatureExtension requestedFeatures, SessionData recoverySessionData, FederatedAuthenticationFeatureExtensionData fedAuthFeatureExtensionData, SqlConnectionEncryptOption encrypt)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.Login(ServerInfo server, TimeoutTimer timeout, String newPassword, SecureString newSecurePassword, SqlConnectionEncryptOption encrypt)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.AttemptOneLogin(ServerInfo serverInfo, String newPassword, SecureString newSecurePassword, Boolean ignoreSniOpenTimeout, TimeoutTimer timeout, Boolean withFailover)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.LoginNoFailover(ServerInfo serverInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString connectionOptions, SqlCredential credential, TimeoutTimer timeout)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.OpenLoginEnlist(TimeoutTimer timeout, SqlConnectionString connectionOptions, SqlCredential credential, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds..ctor(DbConnectionPoolIdentity identity, SqlConnectionString connectionOptions, SqlCredential credential, Object providerInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString userConnectionOptions, SessionData reconnectSessionData, Boolean applyTransientFaultHandling, String accessToken, DbConnectionPool pool)
   at Microsoft.Data.SqlClient.SqlConnectionFactory.CreateConnection(DbConnectionOptions options, DbConnectionPoolKey poolKey, Object poolGroupProviderInfo, DbConnectionPool pool, DbConnection owningConnection, DbConnectionOptions userOptions)
   at Microsoft.Data.ProviderBase.DbConnectionFactory.CreatePooledConnection(DbConnectionPool pool, DbConnection owningObject, DbConnectionOptions options, DbConnectionPoolKey poolKey, DbConnectionOptions userOptions)
   at Microsoft.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.UserCreateRequest(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.WaitForPendingOpen()
vitaliitk-tr commented 1 week ago

Has the same error after .NET update from 6 to 8. In my case it was related to k8s CPU throttling(container is running in very limited environment). Increasing CPU requests solved the problem completely.

oatsoda commented 1 week ago

@vitaliitk-tr What CPU configuration were you using before and after?

I am seeing this in a Azure App Service for Linux (which I believe uses a container under the hood) on a B1 Tier - which is 1 Core and 1.75 GB. I'm going to try scaling it up to see if it fixes things.

@JiriZidek Did you find more CPU fixed things? I think you said it didn't happen on Azure App Services Linux Image - can you confirm what hardware you were using for that?

vitaliitk-tr commented 1 week ago

@oatsoda in my case it was 25 millicores(0.025 of core) for small worker application. I think throttling also may occur in Azure App Service(found some articles about that). The fact that this error is occurred randomly is also lead to the thought that it's somehow related to resource throttling

JiriZidek commented 1 week ago

Did you find more CPU fixed things?

In fact in K8s I have 2 vCPU and 16GB (E2 machines), do not consider it to be week - and it happens. In AppService I have P0V3 which is 1 vCPU and 4GB and it does not happen (to me!). So hard to say.

oatsoda commented 1 week ago

I tried upgrading SqlClient from 5.1.5 to 5.2.2 today and it didn't solve the issue BTW.

However, scaling up from a single core B1 to a 2 core B2 does seem to have solved the problem.

@JiriZidek I suspect if you scaled down your AppService you would see it also. Though I can't explain why you are seeing it on your 2 core K8s.

oatsoda commented 1 week ago

After scaling up to the B2 I have since found that some of the slowness was caused by one of the Sql Azure databases reaching 100% DTU. Applying a missing index has solved this.. so this may have been increasing the load on my App Service. I've just tried scaling back down to B1 and so far I am not seeing the error (though it is early days).

That does imply that this is an error that occurs under load. Or lack of CPU. Or both.

JiriZidek commented 1 week ago

That does imply that this is an error that occurs under load. Or lack of CPU. Or both.

Based on earlier repro steps (mentioned at beginning of this thread) I do think that load just eposes the problem. Consider, that when you disable connection pooling, the problem appears almost immediately. It happens in the moment of connecting to SQL, so it depends on app behavior - if it uses on regular basis a steady number o sqlconnecetions, then after initial problem there is no up/down, they are living in a connection pool so you do not see any problem.

JRahnama commented 1 week ago

~Root Cause Description:~

~1. We are using DNN (Distributed Network Nodes) in this scenario.~ ~2. DNNs always have multiple IP addresses, with one being the primary and the rest being secondary.~ ~3. When the driver receives the IP addresses, sockets are successfully created for all of them since they all resolve to the server name.~ ~4. During the handshake process, if a socket connected to a secondary IP address is used, the server will reset the connection, causing it to close. However, if connected to the primary IP address, everything works fine.~

~This explains why the issue occurs intermittently and primarily on idle servers. On non-idle servers, good connections are added to the pool and reused, reducing the likelihood of failure. On idle servers, the pool is destroyed, leading to the creation of new connections, which can trigger the intermittent issue. For other types such as VNN and DNS this does not happen. Also this explains why setting Pooling to false cause the issue to happen more often.~

~@JiriZidek and @oatsoda does this align with your findings?~

This was not an accurate diagnosis. Even if the user were using an Availability Group ,which is not the case here, all Primary and Secondary IP addresses are valid for connection. It is up to the user to decide whether the Secondary should handle Read-Only requests by database-side setup and application intent.

bdschaap commented 1 week ago

Root Cause Description:

  1. We are using DNN (Distributed Network Nodes) in this scenario.
  2. DNNs always have multiple IP addresses, with one being the primary and the rest being secondary.
  3. When the driver receives the IP addresses, sockets are successfully created for all of them since they all resolve to the server name.
  4. During the handshake process, if a socket connected to a secondary IP address is used, the server will reset the connection, causing it to close. However, if connected to the primary IP address, everything works fine.

This explains why the issue occurs intermittently and primarily on idle servers. On non-idle servers, good connections are added to the pool and reused, reducing the likelihood of failure. On idle servers, the pool is destroyed, leading to the creation of new connections, which can trigger the intermittent issue. For other types such as VNN and DNS this does not happen. Also this explains why setting Pooling to false cause the issue to happen more often.

@JiriZidek and @oatsoda does this align with your findings?

That's my exact scenario. I've found that fails frequently on Linux but works fine on Windows. It also works when using VNN rather than DNN.

I was thinking it might have been an issue with MultiSubnetFailover's parallel connect or perhaps that linux error codes are different than windows where either could be leading to different behavior. I get linux error code 104.

JRahnama commented 1 week ago

@bdschaap on windows by default we are using native SNI, which apparently is not facing this issue, you will probably will see the same issue if you enable managed SNI switch by adding AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.UseManagedNetworkingOnWindows", true) to your application.

One of our to do tasks is looking into native code to see how that handles this scenario. and also go through the changes that we made between 5.1 and 5.2 that caused this issue. 5.1.6 does not have this issue.

Edit: Can you guys confirm this is not happening on 5.1?