dotnet / SqlClient

Microsoft.Data.SqlClient provides database connectivity to SQL Server for .NET applications.
MIT License
847 stars 282 forks source link

Linux Core 3.1 connecting to SQL Server Availability Group: System.AggregateException: One or more errors occurred. (One or more errors occurred. (A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: TCP Provider, error: 0 - No error information))) #1464

Open jesanfafon opened 2 years ago

jesanfafon commented 2 years ago

Describe the bug

We're hosting a dotnet core 3.1 app on Linux in kubernetes. It presently connects successfully to a SQL Server 2014 instance, but we're consolidating onto a SQL Server 2014 Availability Group cluster. When the application tries to connect to the Availability Group Listener, it is initially successful, but on subsequent calls we frequently get the included exception.

Suspecting this might be a diagnosable protocol and/or platform issue, we ran wireshark on the primary instance of the SQL Server availability group. Wireshark doesn't show much because the connection to the server is being encrypted, but it does show that as long as TCP sessions don't enter their idle state, the app is able to successfully reuse them. If the TCP session idles, the app sends a tcp syn packet, which the server responds to correctly, but the app resets the connection and returns the included stack trace. Other than that detail, the overall network traffic looks like a normal encrypted SQL Server connection.

This is our only dotnet core application connecting to sql server that runs on Linux, so we've not been able to identify what the issue is.

Exception message: System.AggregateException: One or more errors occurred. (One or more errors occurred. (A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: TCP Provider, error: 0 - No error information)))
Stack trace: ---> System.AggregateException: One or more errors occurred. (A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: TCP Provider, error: 0 - No error information))
 ---> Microsoft.Data.SqlClient.SqlException (0x80131904): A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: TCP Provider, error: 0 - No error information)
   at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
   at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
   at Microsoft.Data.SqlClient.TdsParser.ConsumePreLoginHandshake(Boolean encrypt, Boolean trustServerCert, Boolean integratedSecurity, Boolean& marsCapable, Boolean& fedAuthRequired)
   at Microsoft.Data.SqlClient.TdsParser.Connect(ServerInfo serverInfo, SqlInternalConnectionTds connHandler, Boolean ignoreSniOpenTimeout, Int64 timerExpire, Boolean encrypt, Boolean trustServerCert, Boolean integratedSecurity, Boolean withFailover, SqlAuthenticationMethod authType)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.AttemptOneLogin(ServerInfo serverInfo, String newPassword, SecureString newSecurePassword, Boolean ignoreSniOpenTimeout, TimeoutTimer timeout, Boolean withFailover)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.LoginNoFailover(ServerInfo serverInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString connectionOptions, SqlCredential credential, TimeoutTimer timeout)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.OpenLoginEnlist(TimeoutTimer timeout, SqlConnectionString connectionOptions, SqlCredential credential, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds..ctor(DbConnectionPoolIdentity identity, SqlConnectionString connectionOptions, SqlCredential credential, Object providerInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString userConnectionOptions, SessionData reconnectSessionData, Boolean applyTransientFaultHandling, String accessToken, DbConnectionPool pool)
   at Microsoft.Data.SqlClient.SqlConnectionFactory.CreateConnection(DbConnectionOptions options, DbConnectionPoolKey poolKey, Object poolGroupProviderInfo, DbConnectionPool pool, DbConnection owningConnection, DbConnectionOptions userOptions)
   at Microsoft.Data.ProviderBase.DbConnectionFactory.CreatePooledConnection(DbConnectionPool pool, DbConnection owningObject, DbConnectionOptions options, DbConnectionPoolKey poolKey, DbConnectionOptions userOptions)
   at Microsoft.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.UserCreateRequest(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.WaitForPendingOpen()
ClientConnectionId:d058d1ce-1e25-4ae2-bc7e-eb1fc48a959e
   --- End of inner exception stack trace ---

<remainder of stack trace is our code>

Further technical details

Microsoft.Data.SqlClient version: 4.0.0 (also happens on 3.0.1) .NET target: Core 3.1 SQL Server version: SQL Server 2014 Operating system: Docker Container, mcr.microsoft.com/dotnet/aspnet:3.1 We're using the OpenSSL mitigations outlined in https://github.com/dotnet/runtime/issues/30667 Connection string: Data Source=<sql AG Listener endpoint>; Initial Catalog=<our database>; User ID=<our uid>; Connect Timeout=120; MultipleActiveResultSets=True; MultiSubnetFailover=True; TrustServerCertificate=true;

JRahnama commented 2 years ago

@jesanfafon is it possible to try without MultipleActiveResultSets=True; or set it to MultipleActiveResultSets=false;

jesanfafon commented 2 years ago

@JRahnama I tried running the app with that set to false; it looks like the team is using a library for graphql that requires MARS enabled.

jesanfafon commented 2 years ago

If you think that it has a high chance of fixing our issues, we can try isolating part of the app to verify.

jesanfafon commented 2 years ago

Any other ideas?

jesanfafon commented 2 years ago

In doing some more testing today, the application works fine when connected directly to the primary server of the availability group by ip address. It appears to be only when using the listener endpoint that this issue occurs.

JRahnama commented 2 years ago

@jesanfafon this makes me wonder if you need to have same certificate or a valid certificate on each server? Do you have that setup and working?

jesanfafon commented 2 years ago

@JRahnama we doublechecked the certs on our testing database and issued new ones but are seeing the same issue with the new certs

JRahnama commented 2 years ago

@jesanfafon I am going to setup an environment as yours, not sure if it is doable on my side, but that would be time consuming. I most probably will report back here somewhere next week. Just quick reminder of the Note part in this MS docs.

jesanfafon commented 2 years ago

Good to hear! I spent some time with our Ops team on Friday working through the certificate requirements in that Note in the docs. We weren't able to get that configured correctly, but should that matter with TrustServerCertificate enabled? We're able to connect to either member of the availability group on its own, but not if both servers are listed in the DNS entry (as is the case with an availability group listener).

I'll keep working with our Ops team to get this configured correctly, since it is a more correct configuration. I'm unclear on what behavior it would change, and why our .NET Framework apps can connect to the same cluster without issue.

Is there any chance porting this application to Windows might resolve the issue?

ssadiq1 commented 1 year ago

Was there a resolution to this issue? We're seeing this as well. We use entity framework and make about 13 or so quick DB counts and it keeps erroring within the step with this error. We're using .net 6, and system.data.sqlclient

ErikEJ commented 1 year ago

@ssadiq1 which System.Data.SqlClient version?

ssadiq1 commented 1 year ago

@ssadiq1 which System.Data.SqlClient version? 4.8.4

Dannieib commented 1 year ago

What was the resolution to this? I am encountering this too. Hello @jesanfafon @JRahnama

jesanfafon commented 1 year ago

What we wound up doing for our applications is building a secondary tool that watches the Availability Group Endpoint and updates the membership of our writable servers in a load balancer.

Then, the applications that have issues with connecting to the endpoint (e.g. dotnet on linux, nodejs, etc) can connect to the load balancer instead.

I can probably get permission to open source the code that's watching the endpoint if you're interested in that.