Open kwlin opened 1 year ago
My colleague @MarcWils noticed:
If we put the following line of code before Task.Run in our test project, the application will run just fine:
ThreadPool.SetMinThreads(100, 100);
var tasks = Enumerable ...
It seems when there is enough threads available to handle the tasks, then the timeout exceptions doesn't occur in our reproduction project.
While debugging he also caught this in his attention: There are a lot of threads waiting when trying to get a SqlConnection. See the printscreen
This seems like a bad practice.
@kwlin this seems like a duplicate of issue #422
@JRahnama thanks for your reply.
I've changed our test project so MARS is explicit disabled in connectionstring & running on Windows Machine, but the exception still occur.
@JRahnama thanks for your reply.
I've changed our test project so MARS is explicit disabled in connectionstring & running on Windows Machine, but the exception still occur.
that sounds like thread starvation issue and the only workaround for now would be limiting threads as you have done. We have noticed if you set the maxpoolsize to 20 that will help too. Also there has been some improvements on Net7 and after version. We will go through the repro and see if there is anything new beside thread starvation issue.
Hello, I am a colleague of Ka-Wai. We looked at the issue together last week.
I believe there's a conceptual design flaw in SqlnternalConnectionTds calling authProvider.AcquireTokenAsync
. And actually trying to go from async to sync. There are many ways to do so. But none of them are ideal.
The problem we're facing is a burst scenario where we have many incoming connections as soon as the applications starts. Debugging this scenario lead us to SqlConnection.Open
which keeps worker pool threads locked until one them is able to get the authentication token. Problem here is that a Task is scheduled to fetch the authentication token. And as all of the threads are busy waiting, the task scheduler can not delegate the task to one of the worker pool threads. The task can successfully run when the thread pool ramps up more threads. However, there's a delay involved. In our scenario (burst scenario and max. pool size = 100), it can take up to 20 minutes until the authentication is finally succesfully retrieved. Ideally, SqlConnection.Open
would not have have asynchronous or task based code down the callstack.
Restricting the max. pool size to a lower value may be a workaround for this issue. However, this may cause other issues in the application.
Setting the ThreadPool.MinThreads
might be a workaround. But this can also impact application performance.
SqlConnection.OpenAsync
doesn't seem to suffer from this specific thread locking issue. Threads aren't locked. Tasks are waiting
until one the first SqlConnection is able to get the authentication token. So burst scenario's work like a charm.
@MarcWils, have you tried opening one connection first as pool warming and try the rest after?
@JRahnama we tried that.
But when the Access Token expired (after 24 hours+) we observe same issue with thread starvation.
I believe there's a conceptual design flaw in SqlnternalConnectionTds calling
authProvider.AcquireTokenAsync
. And actually trying to go from async to sync. There are many ways to do so. But none of them are ideal.
You are correct here. However, MSAL doesn't offer any sync APIs and itself only uses async HTTP client APIs. So, unfortunately, SqlClient is eventually limited to calling an async API from within the sync Connection.Open() flow.
I was looking for better ways of handling the unfortunate pattern (I've read the common "solutions", which all have at least one pitfall) and thinking about using a single thread synchronization context that would dedicate a thread to handle all the async token acquisition tasks to keep them off the system thread pool. I got the idea looking at how NpgSql was handling a similar situation. https://github.com/npgsql/npgsql/blob/v8.0.0-preview.4/src/Npgsql/SingleThreadSynchronizationContext.cs
I was looking for better ways of handling the unfortunate pattern (I've read the common "solutions", which all have at least one pitfall) and thinking about using a single thread synchronization context that would dedicate a thread to handle all the async token acquisition tasks to keep them off the system thread pool. I got the idea looking at how NpgSql was handling a similar situation. https://github.com/npgsql/npgsql/blob/v8.0.0-preview.4/src/Npgsql/SingleThreadSynchronizationContext.cs
Seems like a creative solution which might work. And if I understand it correctly, the authentication token is cached at level of the connection pool. If a cached token is found, no new call will be made to the authentication token provider. We use managed identities where tokens have a lifetime of 24 hours. In this scenario, the SingleThreadSynchronizationContext would create a new thread when the first SQL connection is opened to get the token. Subsequent SQL connections can use the cached token and the SingleThreadSynchronizationContext would let the thread terminate. After 24 hours, a new thread is started to refresh the token. Which after a short while will also terminate. The overhead seems very limited. At least for the sync .Open
.
For our use case, @kwlin is working very hard to refactor all SQL calls to async. That should provide better performance overall.
For our use case, @kwlin is working very hard to refactor all SQL calls to async. That should provide better performance overall.
Fair warning to not be overly enthusiastic for async since SqlClient has quite a few issues with it (#2118).
@vonzshik thanks for the headsup :)
We had exactly same issues with a major outage today using Web Apps, user assigned identity and SQL Managed instance. We've changed back to using SQL logins for now. In our conn string we have min pool size 5 and max 300. Any suggestion what to try or wait for a fix? We are working on .net 6->7 upgrade would this help?
@JljHook you could try use the async methods of Microsoft.Data.SqlClient in your solution instead sync methods (like OpenAsync instead Open). That seems to fix our problems.
Keep in mind of the caveats provided by @vonzshik #2118
@kwlin I gathered you fixed your issues with async methods? Is that considered a workaround? Seems like safest option would be to wait a fix for this?
Adding some investigation results:
I tried the Single Threaded Context path and that got me through the async tasks within MDS itself. But as soon as execution goes into MSAL or Azure.Identity, their async tasks are all created with ConfigureAwait(false) (standard/best practice for libraries), which moves those tasks off my Single Threaded Context and subsequently hangs with the same problem - no available threads in the system thread pool to service their tasks. So seems we might be stuck. We need sync-only APIs in MSAL and Azure Identity in order to avoid this issue. 😞
Describe the bug
We are experiencing connection pool problems during high load and when the Azure Access Token is expired. The following exception will be thrown:
This exception will be thrown for about 20 minutes and after that duration our operations of our component will proceed normal as expected. It seems that exception occur during startup of our application or when the Azure Access Token is expired.
To reproduce
I've included a Visual Studio solution with a reproduction of the issue with similar exceptions.
HighLoadAndAzureActiveDirectory.zip
Expected behavior
We expect that when getting a SQL Connection, the SQL Connection Pool shouldn't throw InvalidOperationException during high load or when the access token is expired.
Further technical details
Our system is a .NET 6 & CoreWCF application hosted in Service Fabric. We are using Managed Identity to connect to an Azure SQL database. We should handle more than 100 concurrenct SOAP requests during normal operations (and all the concurrent SOAP requests must connect to our Azure SQL Database with the Managed Identity). We don't have special configurations for Pool Size; just default values.
We are using the following libraries:
Additional context We are seeing lock contentions/thread pool starvation when acquiring an access token for a SQL Connection. I've provided a reproduction project with similar effect when multiple tasks are opening a SQL connection to Azure SQL Database with a Azure Active Directory.
Following are printscreens of our debugging session:
It seems there are two threads are blocked in GetFedAuthToken, while 49 threads trying to get a connection from DbConnectionPool (TryGetConnection),
Browsing in the Microsoft.Data.SqlClient github repository, we noticed the following code:
https://github.com/dotnet/SqlClient/blob/7b2e2fc9f002e8b270894595f5f92670670053d7/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlInternalConnectionTds.cs#L2416C50-L2416C50
Could this code create a lock contention/thread pool starvation? We've noticed before this construction can cause lock contention/thread pool starvation. In some places this construction is used to go from async to sync: https://github.com/aspnet/AspNetIdentity/blob/main/src/Microsoft.AspNet.Identity.Core/AsyncHelper.cs
PS: I'm not sure if I'm using the correct terminlogy lock contention vs thread pool starvation.