dotnet / SqlClient

Microsoft.Data.SqlClient provides database connectivity to SQL Server for .NET applications.
MIT License
847 stars 282 forks source link

Possible lock contention/thread pool starvation when acquiring access token causing timeout exceptions #2152

Open kwlin opened 1 year ago

kwlin commented 1 year ago

Describe the bug

We are experiencing connection pool problems during high load and when the Azure Access Token is expired. The following exception will be thrown:

Exception message: System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.
Stack trace:

System.InvalidOperationException: Timeout expired.  The timeout period elapsed prior to obtaining a connection from the pool.  This may have occurred because all pooled connections were in use and max pool size was reached.
   at Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at Microsoft.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry, SqlConnectionOverrides overrides)
   at Microsoft.Data.SqlClient.SqlConnection.Open(SqlConnectionOverrides overrides)
   at ... 

This exception will be thrown for about 20 minutes and after that duration our operations of our component will proceed normal as expected. It seems that exception occur during startup of our application or when the Azure Access Token is expired.

To reproduce

I've included a Visual Studio solution with a reproduction of the issue with similar exceptions.

HighLoadAndAzureActiveDirectory.zip

Expected behavior

We expect that when getting a SQL Connection, the SQL Connection Pool shouldn't throw InvalidOperationException during high load or when the access token is expired.

Further technical details

Our system is a .NET 6 & CoreWCF application hosted in Service Fabric. We are using Managed Identity to connect to an Azure SQL database. We should handle more than 100 concurrenct SOAP requests during normal operations (and all the concurrent SOAP requests must connect to our Azure SQL Database with the Managed Identity). We don't have special configurations for Pool Size; just default values.

We are using the following libraries:

Additional context We are seeing lock contentions/thread pool starvation when acquiring an access token for a SQL Connection. I've provided a reproduction project with similar effect when multiple tasks are opening a SQL connection to Azure SQL Database with a Azure Active Directory.

Following are printscreens of our debugging session:

parallel stacks - TryGetConnection

parallel stacks - GetFedAuthToken

It seems there are two threads are blocked in GetFedAuthToken, while 49 threads trying to get a connection from DbConnectionPool (TryGetConnection),

Browsing in the Microsoft.Data.SqlClient github repository, we noticed the following code:

                                _fedAuthToken = Task.Run(async () => await authProvider.AcquireTokenAsync(authParamsBuilder)).GetAwaiter().GetResult().ToSqlFedAuthToken();

https://github.com/dotnet/SqlClient/blob/7b2e2fc9f002e8b270894595f5f92670670053d7/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlInternalConnectionTds.cs#L2416C50-L2416C50

Could this code create a lock contention/thread pool starvation? We've noticed before this construction can cause lock contention/thread pool starvation. In some places this construction is used to go from async to sync: https://github.com/aspnet/AspNetIdentity/blob/main/src/Microsoft.AspNet.Identity.Core/AsyncHelper.cs

PS: I'm not sure if I'm using the correct terminlogy lock contention vs thread pool starvation.

kwlin commented 1 year ago

My colleague @MarcWils noticed:

If we put the following line of code before Task.Run in our test project, the application will run just fine:

ThreadPool.SetMinThreads(100, 100);

var tasks = Enumerable ...

It seems when there is enough threads available to handle the tasks, then the timeout exceptions doesn't occur in our reproduction project.


While debugging he also caught this in his attention: There are a lot of threads waiting when trying to get a SqlConnection. See the printscreen

image

This seems like a bad practice.

JRahnama commented 1 year ago

@kwlin this seems like a duplicate of issue #422

kwlin commented 1 year ago

@JRahnama thanks for your reply.

I've changed our test project so MARS is explicit disabled in connectionstring & running on Windows Machine, but the exception still occur.

JRahnama commented 1 year ago

@JRahnama thanks for your reply.

I've changed our test project so MARS is explicit disabled in connectionstring & running on Windows Machine, but the exception still occur.

that sounds like thread starvation issue and the only workaround for now would be limiting threads as you have done. We have noticed if you set the maxpoolsize to 20 that will help too. Also there has been some improvements on Net7 and after version. We will go through the repro and see if there is anything new beside thread starvation issue.

MarcWils commented 1 year ago

Hello, I am a colleague of Ka-Wai. We looked at the issue together last week.

I believe there's a conceptual design flaw in SqlnternalConnectionTds calling authProvider.AcquireTokenAsync. And actually trying to go from async to sync. There are many ways to do so. But none of them are ideal.

The problem we're facing is a burst scenario where we have many incoming connections as soon as the applications starts. Debugging this scenario lead us to SqlConnection.Open which keeps worker pool threads locked until one them is able to get the authentication token. Problem here is that a Task is scheduled to fetch the authentication token. And as all of the threads are busy waiting, the task scheduler can not delegate the task to one of the worker pool threads. The task can successfully run when the thread pool ramps up more threads. However, there's a delay involved. In our scenario (burst scenario and max. pool size = 100), it can take up to 20 minutes until the authentication is finally succesfully retrieved. Ideally, SqlConnection.Open would not have have asynchronous or task based code down the callstack.

Restricting the max. pool size to a lower value may be a workaround for this issue. However, this may cause other issues in the application.

Setting the ThreadPool.MinThreads might be a workaround. But this can also impact application performance.

SqlConnection.OpenAsync doesn't seem to suffer from this specific thread locking issue. Threads aren't locked. Tasks are waiting until one the first SqlConnection is able to get the authentication token. So burst scenario's work like a charm.

2105 and #2149 may be related to this.

JRahnama commented 1 year ago

@MarcWils, have you tried opening one connection first as pool warming and try the rest after?

kwlin commented 1 year ago

@JRahnama we tried that.

But when the Access Token expired (after 24 hours+) we observe same issue with thread starvation.

David-Engel commented 1 year ago

I believe there's a conceptual design flaw in SqlnternalConnectionTds calling authProvider.AcquireTokenAsync. And actually trying to go from async to sync. There are many ways to do so. But none of them are ideal.

You are correct here. However, MSAL doesn't offer any sync APIs and itself only uses async HTTP client APIs. So, unfortunately, SqlClient is eventually limited to calling an async API from within the sync Connection.Open() flow.

I was looking for better ways of handling the unfortunate pattern (I've read the common "solutions", which all have at least one pitfall) and thinking about using a single thread synchronization context that would dedicate a thread to handle all the async token acquisition tasks to keep them off the system thread pool. I got the idea looking at how NpgSql was handling a similar situation. https://github.com/npgsql/npgsql/blob/v8.0.0-preview.4/src/Npgsql/SingleThreadSynchronizationContext.cs

MarcWils commented 1 year ago

I was looking for better ways of handling the unfortunate pattern (I've read the common "solutions", which all have at least one pitfall) and thinking about using a single thread synchronization context that would dedicate a thread to handle all the async token acquisition tasks to keep them off the system thread pool. I got the idea looking at how NpgSql was handling a similar situation. https://github.com/npgsql/npgsql/blob/v8.0.0-preview.4/src/Npgsql/SingleThreadSynchronizationContext.cs

Seems like a creative solution which might work. And if I understand it correctly, the authentication token is cached at level of the connection pool. If a cached token is found, no new call will be made to the authentication token provider. We use managed identities where tokens have a lifetime of 24 hours. In this scenario, the SingleThreadSynchronizationContext would create a new thread when the first SQL connection is opened to get the token. Subsequent SQL connections can use the cached token and the SingleThreadSynchronizationContext would let the thread terminate. After 24 hours, a new thread is started to refresh the token. Which after a short while will also terminate. The overhead seems very limited. At least for the sync .Open.

For our use case, @kwlin is working very hard to refactor all SQL calls to async. That should provide better performance overall.

vonzshik commented 1 year ago

For our use case, @kwlin is working very hard to refactor all SQL calls to async. That should provide better performance overall.

Fair warning to not be overly enthusiastic for async since SqlClient has quite a few issues with it (#2118).

kwlin commented 1 year ago

@vonzshik thanks for the headsup :)

JljHook commented 1 year ago

We had exactly same issues with a major outage today using Web Apps, user assigned identity and SQL Managed instance. We've changed back to using SQL logins for now. In our conn string we have min pool size 5 and max 300. Any suggestion what to try or wait for a fix? We are working on .net 6->7 upgrade would this help?

kwlin commented 1 year ago

@JljHook you could try use the async methods of Microsoft.Data.SqlClient in your solution instead sync methods (like OpenAsync instead Open). That seems to fix our problems.

Keep in mind of the caveats provided by @vonzshik #2118

JljHook commented 1 year ago

@kwlin I gathered you fixed your issues with async methods? Is that considered a workaround? Seems like safest option would be to wait a fix for this?

David-Engel commented 11 months ago

Adding some investigation results:

I tried the Single Threaded Context path and that got me through the async tasks within MDS itself. But as soon as execution goes into MSAL or Azure.Identity, their async tasks are all created with ConfigureAwait(false) (standard/best practice for libraries), which moves those tasks off my Single Threaded Context and subsequently hangs with the same problem - no available threads in the system thread pool to service their tasks. So seems we might be stuck. We need sync-only APIs in MSAL and Azure Identity in order to avoid this issue. 😞