dotnet / SqlClient

Microsoft.Data.SqlClient provides database connectivity to SQL Server for .NET applications.
MIT License
861 stars 289 forks source link

Intermittent EF Core/Azure SQL Connection Issues #1527

Open jrpharis opened 2 years ago

jrpharis commented 2 years ago

It was suggested on my original issue I move this here.


We have .NET 6 API using EF Core hosted in Azure App Services using Azure SQL DB hosted in the same Azure resource group and region. We're seeing intermittent connection issues where the API throws an exception not being able to connect to the DB. However, these issues last no more than a couple minutes at most and there are successful requests to the API that hit the DB immediately before and after, as well as during the issue (i.e. not all the requests into the API have this issue when we're seeing it). No action is needed to correct the issue, it resolves itself. However, we'd like to identify root cause given the fact its a production environment saying the database can't be reached.

We first see an error log (via Datadog) saying an error occurred trying to connect to our DB:

"RenderedMessage": "An error occurred using the connection to database '\"dbName\"' on server '\"serverConnection\"'.",

"EventId": {
  "Id": 20004,
  "Name": "Microsoft.EntityFrameworkCore.Database.Connection.ConnectionError"
}

There will then be one of the following two errors in the logs for the request:

"RenderedMessage": "An exception occurred while iterating over the results of a query for context type '\"RH.Data.RhDbContext\"'.\"
\"\"Microsoft.Data.SqlClient.SqlException (0x80131904): Cannot open server 'dbName' requested by the login. Client with IP address 'apiHostIpAddress' is not allowed to access the server.  To enable access, use the Windows Azure Management Portal or run sp_set_firewall_rule on the master database to create a firewall rule for this IP address or address range.  It may take up to five minutes for this change to take effect.
   at Microsoft.Data.ProviderBase.DbConnectionPool.CheckPoolBlockingPeriod(Exception e)
   at Microsoft.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.UserCreateRequest(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.WaitForPendingOpen()
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(Boolean errorsExpected, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(Boolean errorsExpected, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenAsync(CancellationToken cancellationToken, Boolean errorsExpected)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(AsyncEnumerator enumerator, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()
ClientConnectionId:1c859098-a855-4d4f-8f4d-435ad3560dad
Error Number:40615,State:1,Class:14\"",

"EventId": {
    "Id": 10100,
    "Name": "Microsoft.EntityFrameworkCore.Query.QueryIterationFailed"
},

This error suggests making changes to the firewall rules to ensure the host can access the DB. The firewall rules are correct, but further more there are requests during the time period we're seeing this issue successfully connecting to the DB.

"RenderedMessage": "An exception occurred while iterating over the results of a query for context type '\"RH.Data.RhDbContext\"'.\"
   \"\"System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Threading.Thread.StartInternal()
   at System.Threading.Thread.Start()
   at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at Microsoft.Data.ProviderBase.DbConnectionClosed.TryOpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at Microsoft.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
   at Microsoft.Data.SqlClient.SqlConnection.OpenAsync(CancellationToken cancellationToken)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenDbConnectionAsync(Boolean errorsExpected, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenDbConnectionAsync(Boolean errorsExpected, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenAsync(CancellationToken cancellationToken, Boolean errorsExpected)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(DbContext _, Boolean result, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()\"",

"EventId": {
  "Id": 10100,
  "Name": "Microsoft.EntityFrameworkCore.Query.QueryIterationFailed"
},

When this error occurs, we'll see the API crash and Azure will spin the instance back up.

It does not make any sense the DB connection issues pop up for some requests but not others, and then just simply stop sometimes in a couple seconds, but up to a couple minutes. Any direction to determine root cause of this would be helpful.

JRahnama commented 2 years ago

Hi @jrpharis, can you provide your connection string options and the query that the failure happens on mostly? Also version of the driver you are using for Microsoft.Data.SqlClient and the platform ( Windows or Linux)

jrpharis commented 2 years ago

Platform is Windows

Not pulling in Microsfot.Data.SqlClient directly, but using Microsoft.EntityFrameworkCore.SqlServer version 6.0.0. I also viewed implementation in VS Code and found Assembly Microsoft.Data.SqlClient, Version=2.0.20168.4

Here is the connection string we have set in our API's config

Server=tcp:*******.database.windows.net,1433;Database=******;User ID=******;Password=********;Connection Timeout=30;MultipleActiveResultSets=True;

I'm not seeing any logs around the error that say what query it was that failed. The error will happen on a number of different APIs, which all hit the DB with different queries. So I don't think it's a single query resulting in this problem.

Wraith2 commented 2 years ago

Exception of type 'System.OutOfMemoryException' was thrown.

This one isn't an sqlclient problem, it's a whole app problem you'll need to look into. The other one might has the same cause if there isn't enough memory available to create an instance of something.

jrpharis commented 2 years ago

@Wraith2 Given that before every out-of-memory exception there is an error stating there was an issue trying to connect to the DB, seems to me it's likely the two are related. We don't ever see the out-of-memory problem outside of seeing issues connecting to the DB. So it doesn't really seem like it's a whole app problem.

Wraith2 commented 2 years ago

So we fail to connect and then you get an oom error and you think it's the connection failure causing the oom? Interesting perspective. See if you can get dumps of the process to investigate what's eaten all the memory. As far as the sql connection problem without a repro there's little to be done because there's nowhere to start investigating from.

jrpharis commented 2 years ago

@Wraith2 I'm not saying there isn't something I can investigate about the memory. But given I only see OOM exceptions when there is a DB connection issue, I find it hard to believe they aren't connected somehow. But what I'm really looking for is why I'm seeing the error to connect to the DB for 10 seconds to 4 minutes, then all of a sudden it's working again. All the while there are also other requests that are connecting to the DB just fine.

The OOM exception is also not the only exception we're seeing after the log about not being able to connect. So it seems like there's some issue with how EF is connecting to SQL, or with how SQL Server is handling the connection.

The crashes are infrequent, but I'll look into setting something up in our Azure App Service to capture the crash.

JRahnama commented 2 years ago

@jrpharis by any chance, if the application structure allows it, is it possible to test with MARS set to false?

creed-maxeta commented 2 years ago

Has any progress been made here? I am experiencing the same issue.

Wraith2 commented 2 years ago

Has any progress been made here? I am experiencing the same issue.

Can you provide a reproduction? If not then there's still nowhere to start investigating from.

Halodolo commented 1 year ago

Hello

Can we get this issue addressed?

Did the Triage Board ever happen?

Halodolo commented 1 year ago

Has any progress been made here? I am experiencing the same issue.

@creed-maxeta How similar is your issue?

Do you see SNAT exhaustions as well?

Wraith2 commented 1 year ago

an we get this issue addressed?

Sure, if can you provide a way to reproduce it.

jrpharis commented 1 year ago

I am currently working with Azure Support through my company's Azure account on this issue. We continue to see random DB connection errors ("An error occurred using the connection to database '"*"' on server '"*****"'." from Microsoft.EntityFrameworkCore.Database.Connection.ConnectionError) immediately followed by OOM exceptions usually thrown from within the Microsoft.Data.ProviderBase namespace/class.

We have seen app crashes in the past but have not seen it since I enabled the app crash monitoring.

One current theory is SNAT port exhaustion, as we have seen errors for SNAT exhaustion that correlate to when we see the DB connection/OOM exception issues. However, more often than not we do not SNAT port exhaustion. Even still, I'm currently working on implementing VNET/NAT Gateway around our App Service/DB infrastructure to try to limit the SNAT exhaustion even (per Azure Support's suggestion)

** As @Wraith2 mentioned before, it's an interesting take to say the DB connection is causing the OOM exception. However, given we ONLY see the OOM when we see the DB connection issue, and that the stracktrace for the OOM exception is always deep into the Microsoft.Data/EntityFrameworkCore namespaces, it seems there is a strong correlation there.

ErikEJ commented 1 year ago

Are you using Serilog?

jrpharis commented 1 year ago

@ErikEJ Yes, we use Serilog

ErikEJ commented 1 year ago

Maybe this? https://github.com/RehanSaeed/Serilog.Exceptions/issues/100

jrpharis commented 1 year ago

@ErikEJ I don't believe this would be related. We are not using the Serilog.Exceptions enricher from that ticket. And it doesn't seem like we're even hitting the DB due to the DB connection issue. However, the OOM exception does usually say it failed to iterate over the results. So I suppose the driving issue of the ticket linked above could possibly be at play. I've increased the level of logging for a few of the EF namespaces to see if I can get more context.

Halodolo commented 1 year ago

Hello @ErikEJ

Do you have other suggestions? It appears that issue occurs randomly.

The only useful information I found was that certain options exist for connection issues.

Here https://learn.microsoft.com/en-us/ef/core/miscellaneous/connection-resiliency.

Which retries the connection after the failure. Do you have other docs that may help?

ConaxLiu commented 1 year ago

I was researching on the problem I had today, where I started getting Login failed for user '' from my ef core db context and the error messages described in this post followed. I'm not sure if my issue is related to the one raised in this post but I'll contribute anyway just in case...

In my DBContext I have this code: public DbFloralPortalContext(DbContextOptions<DbFloralPortalContext> options) : base(options) { var conn = (SqlConnection)Database.GetDbConnection(); var defaultCredential = new DefaultAzureCredential(); var token = defaultCredential.GetToken(new Azure.Core.TokenRequestContext(new[] { "https://database.windows.net/.default" })); conn.AccessToken = token.Token; } It ran without problem. But at the point of querying the database, I get the login error message.

Just when I was about to try reboot my computer, I noticed I had a Remote Desktop Connection opened and connected to one of my other computers. I had a slight suspicion that this may be the cause of the issue so I closed the RD and tried to run my code again. It worked!

The reason I had the suspicion is because I had noticed recently, when I have Remote Desktop Connection open, it affected YouTube video, in a way that it kept buffering the video eventhough I have a very high speed internet connection.

I tried open the RD again but this time the code still runs without problem. So I think perhaps only when the RD is opened for a good amount of time then it would start affecting other programs that uses internet connection.

May be total bull, but for a issue raised a year ago and still no solution, maybe worth to check when the issue occured, was an RD window opened?

jrpharis commented 1 year ago

@ConaxLiu For the issue I raised, this was happening in an Azure Web App deployed to an Azure App Service Plan. So having a remote desktop connection open wouldn't be a possible culprit.

sven5 commented 1 year ago

Today, I had the same errors with .NET6 App on App Service for Windows Plan. I'm using several Parallel.ForEachAsync loops in my code. Do you also use it @jrpharis?

Xriuk commented 1 year ago

@Wraith2 I'm having the same problem and it's hard to reproduce because it happens randomly. Also the exception itself is not much helpful: An error occurred using the connection to database '{database}' on server '{server}'. so I don't even know where to investigate.

I only have another exception related to this request:

An exception occurred while iterating over the results of a query for context type
System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(Boolean errorsExpected, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(Boolean errorsExpected, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenAsync(CancellationToken cancellationToken, Boolean errorsExpected)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(AsyncEnumerator enumerator, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass33_0`2.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()
lonchanick commented 1 year ago

This is weird, I have the same problem in a simple CRUD API using EF core and SQL server .. :|

lonchanick commented 1 year ago

This has to do with Task and async

aldrashan commented 7 months ago

@Wraith2 I'm having the same problem and it's hard to reproduce because it happens randomly. Also the exception itself is not much helpful: An error occurred using the connection to database '{database}' on server '{server}'. so I don't even know where to investigate.

I only have another exception related to this request:

An exception occurred while iterating over the results of a query for context type
System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(Boolean errorsExpected, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(Boolean errorsExpected, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenAsync(CancellationToken cancellationToken, Boolean errorsExpected)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(AsyncEnumerator enumerator, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass33_0`2.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()

This makes sense, I enabled logging for Microsoft.EntityFrameworkCore.* and then these Exceptions started showing up, while my code ran just fine and no one was complaining about any errors. That Exception apparently happens when a task gets aborted via a CancellationToken.