dotnet / SqlClient

Microsoft.Data.SqlClient provides database connectivity to SQL Server for .NET applications.
MIT License
844 stars 280 forks source link

TransactionAbortedException when performing queries in parallel inside a transaction scope #1675

Closed joostmeijles closed 9 months ago

joostmeijles commented 2 years ago

Describe the bug

When performing multiple queries in parallel, and each pair of 2 queries are inside a transaction scope, an unexpected transaction error is thrown (see below).

Exception message: System.Transactions.TransactionAbortedException: The transaction has aborted.
Stack trace:
Unhandled exception. System.Transactions.TransactionAbortedException: The transaction has aborted.
 ---> System.Transactions.TransactionPromotionException: Failure while attempting to promote transaction.
 ---> System.InvalidOperationException: The requested operation cannot be completed because the connection has been broken.
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransaction(TransactionRequest transactionRequest, String name, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at Microsoft.Data.SqlClient.SqlDelegatedTransaction.Promote()
   --- End of inner exception stack trace ---
   at Microsoft.Data.SqlClient.SqlDelegatedTransaction.Promote()
   at System.Transactions.TransactionStatePSPEOperation.PSPEPromote(InternalTransaction tx)
   at System.Transactions.TransactionStateDelegatedBase.EnterState(InternalTransaction tx)
   --- End of inner exception stack trace ---
   at System.Transactions.TransactionStateAborted.CheckForFinishedTransaction(InternalTransaction tx)
   at System.Transactions.EnlistableStates.Promote(InternalTransaction tx)
   at System.Transactions.Transaction.Promote()
   at System.Transactions.TransactionInterop.ConvertToDistributedTransaction(Transaction transaction)
   at System.Transactions.TransactionInterop.GetExportCookie(Transaction transaction, Byte[] whereabouts)
   at Microsoft.Data.SqlClient.SqlInternalConnection.GetTransactionCookie(Transaction transaction, Byte[] whereAbouts)
   at Microsoft.Data.SqlClient.SqlInternalConnection.EnlistNonNull(Transaction tx)
   at Microsoft.Data.ProviderBase.DbConnectionInternal.ActivateConnection(Transaction transaction)
   at Microsoft.Data.ProviderBase.DbConnectionPool.PrepareConnection(DbConnection owningObject, DbConnectionInternal obj, Transaction transaction)
   at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInterna
l& connection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at Microsoft.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry, SqlConnectionOverrides overrides)
   at Microsoft.Data.SqlClient.SqlConnection.InternalOpenAsync(CancellationToken cancellationToken)
--- End of stack trace from previous location ---

To reproduce

Below code reproduces the error. It runs 100K times 2 queries in parallel where each 2 queries are inside a single transaction scope. Note that after each query the database connection is disposed (and thus closed). This is important to trigger the error. When we change the code, and use 1 connection for both queries and only open 1 connection the error does not occur (see also code comments below).

using System.Transactions;
using Microsoft.Data.SqlClient;

static async Task PerformTransactionWithQuery(int num)
{
    try
    {
        using (new TransactionScope(TransactionScopeAsyncFlowOption.Enabled))
        {
            // Adding "max pool size=1000;" to the connection string seems to trigger the problem less often
            string connStr = @"Server=.\SQLEXPRESS;Database=master;Trusted_Connection=True;Encrypt=False";

            var query = "SELECT COUNT(*) FROM sys.dm_tran_active_transactions";

            await using (var dbConn = new SqlConnection(connStr))
            {
                await dbConn.OpenAsync();

                await using (var command1 = new SqlCommand(query, dbConn))
                {
                    await command1.ExecuteScalarAsync();
                }
            } // Connection is disposed (and thus closed)

            await using (var dbConn = new SqlConnection(connStr)) 
            {
                // Reopening the connection triggers the following error:
                // System.Transactions.TransactionAbortedException: The transaction has aborted.
                //
                // NB. Using a single connection and opening it once does NOT trigger the error
                await dbConn.OpenAsync();

                await using (var command2 = new SqlCommand(query, dbConn))
                {
                    await command2.ExecuteScalarAsync();
                }
            } // Connection is disposed (and thus closed)

            //Do not complete transaction
        }
    }
    catch (Exception e)
    {
        Console.WriteLine($"Failed {num}");
        throw;
    }
} 

var tasks = Enumerable.Range(0, 100000).ToList().Select(PerformTransactionWithQuery);
await Task.WhenAll(tasks);

Expected behavior

Being able to use multiple connections (with the same connection string) in sequence within the same transaction without errors.

Further technical details

Microsoft.Data.SqlClient version: 4.1.0 .NET target: .NET 6 SQL Server version: SQL Server 2019 Operating system: Windows 11

DavoudEshtehari commented 2 years ago

@joostmeijles Thank you for the repro. I'll back to you if I had any questions to reproduce the issue.

joostmeijles commented 2 years ago

@DavoudEshtehari any update on an ETA to verify that this is a bug and get it fixed?

DavoudEshtehari commented 2 years ago

@joostmeijles I confirm it's reproducible with MDS and we'll return to it after the next release. I'm afraid we can't provide any estimation for a fix while it's in the investigation stage. We'll keep updating here with any findings, and we welcome any contribution as this is an open-source driver.

JRahnama commented 2 years ago

@joostmeijles can you test with a reduced MaxPoolSize to 20? I wonder if this is something similar to #422?

joostmeijles commented 2 years ago

@joostmeijles can you test with a reduced MaxPoolSize to 20? I wonder if this is something similar to #422?

Changing the pool size to 20 does not change much, but setting it very large (e.g. 1000) seems to trigger the problem less often.

JRahnama commented 2 years ago

@joostmeijles one more question, were you able to see the same problem on Windows 10?

JRahnama commented 2 years ago

@joostmeijles regarding the comment in the repro

// NB. Using a single connection and opening it once does NOT trigger the error

does this mean if you open a connection first and then try the rest of the code is working fine or just a single connection?

joostmeijles commented 2 years ago

@joostmeijles regarding the comment in the repro

// NB. Using a single connection and opening it once does NOT trigger the error

does this mean if you open a connection first and then try the rest of the code is working fine or just a single connection?

It means changing the code to:

await using (var dbConn = new SqlConnection(connStr))
{
    await dbConn.OpenAsync();

    await using (var command1 = new SqlCommand(query, dbConn))
    {
        await command1.ExecuteScalarAsync();
    }

    await using (var command2 = new SqlCommand(query, dbConn))
    {
        await command2.ExecuteScalarAsync();
    }
} // Connection is disposed (and thus closed)

does not trigger the error.

joostmeijles commented 2 years ago

@joostmeijles one more question, were you able to see the same problem on Windows 10?

Yes, same.

JRahnama commented 2 years ago

@joostmeijles couple of things I noticed:

  1. If we set the pooling to false the issue does not show up (this confirms the issue only occurs with pooling)
  2. if you set SqlConnection.ClearAllPools(); before the second using of dbConn your code should be working fine as well.
  3. If we start with one connection first and then try to open the rest the connections the issue does not com up. The above makes me suspicious that we need to adjust the code to do pool warming. Too many parallel connections are trying to create the very same pool. If we let one connection to get opened a pool is stablished and the rest will use that.
David-Engel commented 2 years ago

Edit: Timeout isn't the issue here.

It sounds to me like your transaction is timing out. With 100,000 parallel operations and a limited connection pool and thread pool, there will likely be a significant delay between the first query completing, the connection closing, and the second connection being obtained from the pool during the transaction. Idle transactions aren't kept around forever. The default transaction timeout is only one minute.

Useful blog post: https://docs.microsoft.com/en-za/archive/blogs/dbrowne/using-new-transactionscope-considered-harmful

joostmeijles commented 2 years ago

It sounds to me like your transaction is timing out. With 100,000 parallel operations and a limited connection pool and thread pool, there will likely be a significant delay between the first query completing, the connection closing, and the second connection being obtained from the pool during the transaction. Idle transactions aren't kept around forever. The default transaction timeout is only one minute.

Useful blog post: https://docs.microsoft.com/en-za/archive/blogs/dbrowne/using-new-transactionscope-considered-harmful

An error possibly occur within seconds, so doesn't seem to hit a timeout. Also, the 100K parallel operations are just to make it easier to reproduce. I have seen the issue also with less (~100) parallel operations.

Creating the transaction scope as follows (as mentioned in the blog post + async option) results in the same error.

static TransactionScope CreateTransactionScope()
{
    var transactionOptions = new TransactionOptions();
    transactionOptions.IsolationLevel = IsolationLevel.ReadCommitted;
    transactionOptions.Timeout = TransactionManager.MaximumTimeout;
    return new TransactionScope(TransactionScopeOption.Required, transactionOptions, TransactionScopeAsyncFlowOption.Enabled);
}
joostmeijles commented 2 years ago

@joostmeijles couple of things I noticed:

  1. If we set the pooling to false the issue does not show up (this confirms the issue only occurs with pooling)
  2. if you set SqlConnection.ClearAllPools(); before the second using of dbConn your code should be working fine as well.
  3. If we start with one connection first and then try to open the rest the connections the issue does not com up. The above makes me suspicious that we need to adjust the code to do pool warming. Too many parallel connections are trying to create the very same pool. If we let one connection to get opened a pool is stablished and the rest will use that.

Without connection pooling (on the connection string or by clearing the pool before the second connection) System.PlatformNotSupportedException: This platform does not support distributed transactions. will be thrown immediately.

DavoudEshtehari commented 1 year ago

Just posting an update:

I could reproduce this issue with Microsoft.Data.SqlClient since version 1.1.4 against supported target frameworks (.NET/.NET Core or .NET Framework). And System.Data.SqlClient doesn't throw "TransactionAbortedException".

This issue has been introduced by PR #543 for issue #237, which explains why it doesn't occur against SDS.

DavoudEshtehari commented 1 year ago

@joostmeijles Using the NuGet package from the pipeline that is introduced with PR #1814 I experienced some improvement with this issue. May I ask you to verify it with the provided link, please?

joostmeijles commented 1 year ago

@DavoudEshtehari thanks for the update! Sounds as a good change, but unfortunately I am still getting an error, although its a different one:

Unhandled exception. System.InvalidOperationException: The requested operation cannot be completed because the connection has been broken.
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransaction(TransactionRequest transactionRequest, String name, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at Microsoft.Data.SqlClient.SqlDelegatedTransaction.Initialize()
   at System.Transactions.TransactionStatePSPEOperation.PSPEInitialize(InternalTransaction tx, IPromotableSinglePhaseNotification promotableSinglePhaseNotification, Guid promoterType)
   at System.Transactions.TransactionStateActive.EnlistPromotableSinglePhase(InternalTransaction tx, IPromotableSinglePhaseNotification promotableSinglePhaseNotification, Transaction atomicTransaction, Guid promoterType)
   at System.Transactions.Transaction.EnlistPromotableSinglePhase(IPromotableSinglePhaseNotification promotableSinglePhaseNotification, Guid promoterType)
   at System.Transactions.Transaction.EnlistPromotableSinglePhase(IPromotableSinglePhaseNotification promotableSinglePhaseNotification)
   at Microsoft.Data.SqlClient.SqlInternalConnection.EnlistNonNull(Transaction tx)
   at Microsoft.Data.ProviderBase.DbConnectionInternal.ActivateConnection(Transaction transaction)
   at Microsoft.Data.ProviderBase.DbConnectionPool.PrepareConnection(DbConnection owningObject, DbConnectionInternal obj, Transaction transaction)
   at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at Microsoft.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry, SqlConnectionOverrides overrides)
   at Microsoft.Data.SqlClient.SqlConnection.InternalOpenAsync(CancellationToken cancellationToken)
DavoudEshtehari commented 1 year ago

@joostmeijles thank you for sharing the result. Also, I can see some improvements in the exception occurrence. With the proposed change it is less likely to happen (by increasing to over 1M iterations), which is not a constant behaviour and depends on the available machine's resources. Can you verify if you see the same progress, please?

joostmeijles commented 1 year ago

It fluctuates, doing 100K iterations. Some runs pass, others still bring up the error 1 or 2 times.

sdrapkin commented 1 year ago

I can reproduce it under both MDS-5.1.0-preview2.22314.2 and SDS-4.8.5 on Windows10, under NET5, NET6, and NET7. Making a query is not even required to reproduce (might take a few runs). Uncommenting ts.Complete() stops triggering the exceptions. IMHO this is a bug.

LinqPad:

async Task Main()
{
    var tasks = Enumerable.Range(0, 100_000).Select(PerformWork);
    await Task.WhenAll(tasks);
    "DONE!".Dump();
}
const string connString = "Server=.\\SQL2019;Database=TempDB;Application Name=TestApp;Trusted_Connection=True;TrustServerCertificate=True;Pooling=True;Encrypt=True;Max Pool Size = 5000;";

static TransactionScope CreateTransactionScope() =>
    new TransactionScope(TransactionScopeOption.Required,
        new TransactionOptions { IsolationLevel = System.Transactions.IsolationLevel.ReadCommitted }, TransactionScopeAsyncFlowOption.Enabled);

static async Task PerformWork(int num)
{
    try
    {
        using (var ts = CreateTransactionScope())
        {
            for (int i = 0; i < 2; ++i)
            {
                await using (var dbConn = new SqlConnection(connString))
                {
                    await dbConn.OpenAsync();
                    await Task.Delay(1);
                } // Connection is disposed (and thus closed)
            }//for
            //ts.Complete();
        }//ts
    }//try
    catch (Exception e)
    {
        Console.WriteLine($"Failed {num}");
        e.Dump(); Environment.Exit(-1);
        throw;
    }
}
cheenamalhotra commented 1 year ago

@joostmeijles @sdrapkin

What you're trying to do is unsafe and I don't agree this is bug. Old behavior before v1.1.4 was a bug where accounting was not done for externally aborted transactions. I would request a read on https://github.com/dotnet/SqlClient/issues/237#issuecomment-569856610 and https://github.com/dotnet/SqlClient/issues/237#issuecomment-569965788 to understand the root cause.

How distributed transactions work is that they are decoupled from your application and are running within the MS DTC service. Which means if your app tries to be super-fast and expect MS DTC to keep coordinating with SQL Server for the MISSED commit/rollbacks, it needs time to do so, as it needs to abort the transaction on the server.

From driver end, if you closed a POOLED connection, server-side connection is kept open. So now if you happen to re-open the connection BEFORE MSDTC could terminate your previous transaction, you find that AFTER opening the connection, transaction gets aborted. It should never happen that the same OLD transaction continues to be used, and I hope you'd agree with me on that! So, someone must keep things in order. Either it's your app, or the driver.

If you want driver to consistently terminate your connection alongwith Distributed transaction, DON'T use connection pool! If you want to use connection pool for performance needs, you MUST manage your transaction lifecycle and cannot leave them to be aborted by server.

Likewise, in the above repro, if you do call ts.Complete() things do just fine!

I would certainly recommend documentation on this somewhere so it's not a surprise to end users as the default behavior of driver is with pooling enabled, which is clearly not intuitive to users. cc @David-Engel

cheenamalhotra commented 1 year ago

Also, to address @DavoudEshtehari 's PR https://github.com/dotnet/SqlClient/pull/1814 and why it made visible difference:

When the PR changed Pool design from Stack to a Queue, it's clear that instead of fetching the latest connection from stack, you'd always get the oldest connection, which gives more time to the recently added connection in the pool to terminate transaction.

Since it's just pushing the latest connection to end of queue, it's not a solution IMO. You can still reproduce this behavior with a single connection going in and out of the pool if you limit your pool size.

joostmeijles commented 1 year ago

@cheenamalhotra

As far as I understand the provided example code should not result in a distributed transaction as the connection gets closed before the next connection is opened.

Please note that I am not using the MS DTC service (its not running on my machine). And I think that the .NET version I use does not support distributed transactions?

sdrapkin commented 1 year ago

@cheenamalhotra There are no distributed transactions in my LinqPad example. MSDTC escalation does not happen because only one connection-string is used. If you replace await dbConn.OpenAsync(); with dbConn.Open(); (ie. merely switch from async to sync opening of a connection), this code works just fine (ie. even with ts.Complete() commented out).

When the 1st connection-open request occurs, it should enlist that connection into the Transaction. Then 1st-connection-close occurs (but the inner-connection is not really closed - just returns to the pool). Then the 2nd connection-open request should receive the exact same already-open inner-connection back from the pool (ie. the same one durable resource is enlisted). This is exactly what happens when dbConn.Open() is used. There should be no observable behavior differences between dbConn.Open(); and await dbConn.OpenAsync(); - yet the async version crashes.

cheenamalhotra commented 1 year ago

If you replace await dbConn.OpenAsync(); with dbConn.Open(); (ie. merely switch from async to sync opening of a connection), this code works just fine (ie. even with ts.Complete() commented out).

This is interesting and I'd agree I didn't look into that to be the possibility, so it certainly points to something missing out with OpenAsync(), which is very strange since both APIs 'should' follow the same path. 🤔

There are no distributed transactions

I possibly was imagining .NET 7, but never mind. It's still a case of delegated transaction as we're looking at System.Transactions.TransactionPromotionException

Let me dive into it and will get back.

cheenamalhotra commented 1 year ago

I reproed again with Open() and it does seem to happen, maybe not everytime but it does sometimes fail (notice the stacktrace):

System.Transactions.TransactionAbortedException: The transaction has aborted.
 ---> System.Transactions.TransactionPromotionException: Failure while attempting to promote transaction.
 ---> System.InvalidOperationException: The requested operation cannot be completed because the connection has been broken.
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransaction(TransactionRequest transactionRequest, String name, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest) in E:\sqlclient\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlInternalConnectionTds.cs:line 992
   at Microsoft.Data.SqlClient.SqlDelegatedTransaction.Promote() in E:\sqlclient\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlDelegatedTransaction.cs:line 157
   --- End of inner exception stack trace ---
   at Microsoft.Data.SqlClient.SqlDelegatedTransaction.Promote() in E:\sqlclient\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlDelegatedTransaction.cs:line 228
   at System.Transactions.TransactionStatePSPEOperation.PSPEPromote(InternalTransaction tx)
   at System.Transactions.TransactionStateDelegatedBase.EnterState(InternalTransaction tx)
   --- End of inner exception stack trace ---
   at System.Transactions.TransactionStateAborted.CheckForFinishedTransaction(InternalTransaction tx)
   at System.Transactions.EnlistableStates.Promote(InternalTransaction tx)
   at System.Transactions.Transaction.Promote()
   at System.Transactions.TransactionInterop.ConvertToDistributedTransaction(Transaction transaction)
   at System.Transactions.TransactionInterop.GetExportCookie(Transaction transaction, Byte[] whereabouts)
   at Microsoft.Data.SqlClient.SqlInternalConnection.GetTransactionCookie(Transaction transaction, Byte[] whereAbouts) in E:\sqlclient\src\Microsoft.Data.SqlClient\src\Microsoft\Data\SqlClient\SqlInternalConnection.cs:line 753
   at Microsoft.Data.SqlClient.SqlInternalConnection.EnlistNonNull(Transaction tx) in E:\sqlclient\src\Microsoft.Data.SqlClient\src\Microsoft\Data\SqlClient\SqlInternalConnection.cs:line 590
   at Microsoft.Data.SqlClient.SqlInternalConnection.Enlist(Transaction tx) in E:\sqlclient\src\Microsoft.Data.SqlClient\src\Microsoft\Data\SqlClient\SqlInternalConnection.cs:line 460
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.Activate(Transaction transaction) in E:\sqlclient\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlInternalConnectionTds.cs:line 887
   at Microsoft.Data.ProviderBase.DbConnectionInternal.ActivateConnection(Transaction transaction) in E:\sqlclient\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\ProviderBase\DbConnectionInternal.cs:line 221
   at Microsoft.Data.ProviderBase.DbConnectionPool.PrepareConnection(DbConnection owningObject, DbConnectionInternal obj, Transaction transaction) in E:\sqlclient\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\ProviderBase\DbConnectionPool.cs:line 1333
   at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection) in E:\sqlclient\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\ProviderBase\DbConnectionPool.cs:line 1317
   at Microsoft.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection) in E:\sqlclient\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\ProviderBase\DbConnectionPool.cs:line 1132
   at Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection) in E:\sqlclient\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\ProviderBase\DbConnectionFactory.cs:line 122
   at Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions) in E:\sqlclient\src\Microsoft.Data.SqlClient\netcore\src\Common\src\Microsoft\Data\ProviderBase\DbConnectionInternal.cs:line 341
   at Microsoft.Data.ProviderBase.DbConnectionClosed.TryOpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions) in E:\sqlclient\src\Microsoft.Data.SqlClient\netcore\src\Common\src\Microsoft\Data\ProviderBase\DbConnectionClosed.cs:line 39
   at Microsoft.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry, SqlConnectionOverrides overrides) in E:\sqlclient\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlConnection.cs:line 1844
   at Microsoft.Data.SqlClient.SqlConnection.Open(SqlConnectionOverrides overrides) in E:\sqlclient\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlConnection.cs:line 1333
   at Microsoft.Data.SqlClient.SqlConnection.Open() in E:\sqlclient\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlConnection.cs:line 1310

The problem seems to be the fact the driver or System.Transactions (to be precise) even attempts to promote transaction, where it should not and does not in a normal flow. The trigger scenario is when below call fails to enlist connection into delegated transaction and returns false which is possible if transaction is not yet aborted:

https://github.com/dotnet/SqlClient/blob/eb1b92285152ea36533dc506a0ba82ea6fb7248b/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/SqlInternalConnection.cs#L529

The transaction promotion is triggered by System.Transactions when the driver attempts to grab transaction cookie here:

https://github.com/dotnet/SqlClient/blob/eb1b92285152ea36533dc506a0ba82ea6fb7248b/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/SqlInternalConnection.cs#L586

As you can see it leads to calling at System.Transactions.TransactionInterop.ConvertToDistributedTransaction(Transaction transaction) in the stacktrace above.

cc @roji do you know why would System.Transactions try to make the transaction distributed, calling promote, in .NET 6?

DavoudEshtehari commented 1 year ago

Since it's just pushing the latest connection to end of queue, it's not a solution IMO. You can still reproduce this behavior with a single connection going in and out of the pool if you limit your pool size.

@cheenamalhotra Thank you for taking time on this issue. I agree PR #1814 is not a solution. As you mentioned using a Queue data structure can increase the chance of a transaction completion, which I've proposed it as an improvement to reduce the possibitity of a race condition not preventing it.

Wraith2 commented 1 year ago

alleviating this problem but not fixing it while degrading performance in other scenarios doesn't seem like a good tradeoff.

roji commented 1 year ago

@cheenamalhotra I was also confronted with unexpected escalation to a distributed transaction recently (https://github.com/dotnet/efcore/issues/29054), and was assuming it was related to the .NET 7.0 distributed transaction changes... It's interesting that this occurs on .NET 6.0.

I'm not sure how this could be happening. I do know that SqlInternalConnection.EnlistNonNull would only call SqlInternalConnection.GetTransactionCookie (which causes the escalation) if the call to Transaction.EnlistPromotableSinglePhase return false (barring bugs in Sys.Tx of course). This would mean that there are somehow two distinct connections at play; some sort of bug which messes up the TransactedConnectionPool could conceivably cause this, where e.g. the second open incorrectly returns a different connection. However, that doesn't explain why that connection is broken, as indicated by exception...

I suspect the exception we're seeing is the visible, end result of some earlier problem that we're not seeing... It's probably a good idea to concentrate on why a connection was broken in the first place, and ideally trying to step through the SqlClient code to see exactly what's happening (not trivial as this is non-deterministic). Since this happens in .NET 6.0, I wouldn't assume a Sys.Tx issue unless there's a stronger indication.

I unfortunately am too taken up by 7.0 stabilization work at the moment to dive into this myself, but would be happy to follow along and try to help out...

cheenamalhotra commented 1 year ago

Hi @roji

The promotion design I believe is as-is in .NET 6 and is easily reproducible with non-pooled connections (code you may run in .NET 6). It's just not supported, so throws error.

The aborted transaction issue is due to Sys.Tx failing to enlist transaction in second connection if the transaction is still in use, which means we try to get cookie and that hits promotion which doesn't complete either.

I'm not sure how can the driver prevent enlistment failures to be honest, since it's out of driver's control. There's is no guarantee with change of pool design or anything else we come up with. We cannot enlist in implicit transaction during promotion failure, as otherwise connection will go out of transaction scope leading to #237. We absolutely want the connection to be able to enlist in the scoped transaction, and best bet for that is apps taking control of their transaction scope, by calling complete() or not pooling connections so it can terminate instantly and set transaction free, if transaction cannot be completed.

roji commented 1 year ago

@cheenamalhotra what I'm missing here, is why a second connection is involved here at all... Since the first connection is closed before the second is opened, it should be in SqlClient's "transacted" pool, and handed out when the second open occurs. When this works correctly (and it does most of the time), SqlClient only ever enlists one connection to a given Transaction; but here it seems like we're seeing the enlistment of a second connection.

(Of course, this requires pooling to be on, which is why there's an immediate, predictable exception without pooling: two unpooled connections get enlisted. But with pooling, only one should ever get enlisted).

The aborted transaction issue is due to Sys.Tx failing to enlist transaction in second connection if the transaction is still in use, which means we try to get cookie and that hits promotion which doesn't complete either.

But the exception shows that the connection has been broken - do we know when/why that occurred?

Unhandled exception. System.Transactions.TransactionAbortedException: The transaction has aborted.
 ---> System.Transactions.TransactionPromotionException: Failure while attempting to promote transaction.
 ---> System.InvalidOperationException: The requested operation cannot be completed because the connection has been broken.

In any case, I agree that a change in pool design isn't the way forward here. We need to understand the root cause, i.e. if there's a bug in SqlClient (causing it to somehow enlist two connections when it shouldn't) or in Sys.Tx.

roji commented 1 year ago

I've spent of time and I think I've understood what's going on.

Consider the following code:

await Task.WhenAll(Enumerable.Range(0, 10).Select(_ => Task.Run(() =>
{
    for (var i = 0; i < 1_000_000; i++)
    {
        using var scope = new TransactionScope(TransactionScopeAsyncFlowOption.Enabled);
        using var dbConn = new SqlConnection(ConnectionString);
        dbConn.Open();

        using var command = new SqlCommand("SELECT COUNT(*) FROM sys.dm_tran_active_transactions", dbConn);
        command.ExecuteScalar();
    }
})));
  1. Some transaction scope is started, a connection is opened and enlisted to it, some operation is performed, and the connection is disposed.
  2. When the scope is disposed, Sys.Tx first calls SqlDelegatedTransaction.Rollback.
  3. Rollback cleans everything up, and then calls connection.CleanupConnectionOnTransactionCompletion(_atomicTransaction). This calls DbConnectionInternal.DetachTransaction, which calls DelegatedTransactionEnded, which puts the connection back into the pool for general usage.
    • This makes sense: the transaction has completed, and the connection was closed before the TransactionScope was disposed, so it was in "stasis", waiting for the transaction to complete before becoming available again.
  4. Since the connection was made available to everyone, some other thread concurrently takes the connection out of the (general) pool, and starts another unrelated transaction scope on it.
  5. Meantime, Sys.Tx isn't done: after calling Rollback, it also fires the Transaction.TransactionCompleted event, where we're registered DbConnectionInternal.TransactionCompletedEvent. Note that at this point the connection is in concurrent use by the 2nd user.
  6. The event handler path gets us to TransactionEnded on the new SqlDelegatedTransaction. It sees that connection isn't null, and then dooms the connection; this code path was meant for when a distributed transaction ends before rollback/commit (see comment on TransactionEnded), but it gets mistakenly invoked here.

As a quick test, I've commented out the event handler registration (this line); this means that SqlClient no longer listens to Transaction.TransactionCompleted. Before this change, the repro in this issue consistently failed, whereas after the change it consistently completes successfully. Since the bug apparently isn't present in the .NET Framework version, I'd look at how it's working and what changes were made to the .NET Core one which led to this.

I'm not yet quite sure why this bug would lead to promotion to a distributed transaction as we're seeing above, but I'm guessing that with the dooming of arbitrary connections, the state could easily get screwed up and lead to undefined results.

roji commented 1 year ago

Note the similarity with #1800, which is also about concurrent interaction between SqlDelegationTransaction's enlistment callbacks (Rollback, SinglePhaseCommit) and the Transaction.TransactionCompleted event. I'd recommend sitting down and getting a good understanding of the situation - and especially of the changes that occurred since the fork with .NET Framework.

DavoudEshtehari commented 1 year ago

@roji thank you for the update on this issue. Unfortunately, this bug exists against the .NET Framework as well: https://github.com/dotnet/SqlClient/issues/1675#issuecomment-1289817442.

roji commented 1 year ago

@DavoudEshtehari ah OK, I missed that.

In general, it seems that the connection cannot be returned to the (general) pool until the transaction has fully completed; since SqlClient registers for the TransactionCompleted event, that's when this should happen - not earlier. Alternatively, it's interesting to ask why SqlClient registers for the event in the first place - the transaction enlistment (SqlDelegatedTransaction) should already receive all the necessary events. I'm guessing there's a reason why the event registration is done, but removing that would certainly simplify a lot of things (and prevent bugs such as this and #1800).

Anyway, I'd be happy to help with any further thinking on this.

ByteMe-web commented 1 year ago

We have a similar issue to this being investigated. We are running under the .NET Framework, and create a number of threads to process a set of items. Each work item creates its own transaction, and runs a bunch of sql statements. The sql calls are not async and MSTDC is not involved. All work items commit or are in progress i.e. no rollbacks or absent tx.Complete(). In order to temporarily get around this problem I have added the .NET ThreadId to the "Application Name" connection string when opening the connection so that connection pooling is only done per thread.
Not proposing as a solution by any means but may aid with investigation.

We do make use of Transaction events but these are handled on the same thread. I would have thought these would be dealt with before going back into the pool?

sdrapkin commented 1 year ago

Here is a stacktrace with line numbers, using

The line that throws: https://github.com/dotnet/SqlClient/blob/b3cfdbbbf9f126f2d0447feb97ae592f19d1b498/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlInternalConnectionTds.cs#L992

DavoudEshtehari commented 1 year ago

I'd like to share my findings:

cc: @roji

roji commented 1 year ago

@DavoudEshtehari sure, it makes sense that removing the dooming in the TransactionEnded event handler makes the error go away (although there are reasons why it is there).

But I think the more fundamental questions here are whether the connection should be put back into the pool before the TransactionCompleted event is fired (letting other threads start using it); or more broadly, why SqlClient even subscribes to TransactionCompleted at all, i.e. why is the SqlDelegatedTransaction enlistment not sufficient to receive all events (e.g. from MSDTC) relating to the transaction.

Wraith2 commented 1 year ago

But I think the more fundamental questions here are whether the connection should be put back into the pool before the TransactionCompleted event is fired (letting other threads start using it);

I can't think of any good reason for that to happen. It simply doesn't make sense.

jetizz commented 1 year ago

Any viable workaround for this issue? We're getting occasional errors (no exact metrics, about one in 10k connections), and have no idea how to bypass the issue for now. We tried increasing pool size, but it did not help. Is the only way to avoid this - to disable connection pooling? Performance hit will be considerable, and I'd like to avoid it if possible.

Exception we're getting is the same as in original post. Environment is asp.net, with each request opening up to 10-20 short lived connections.

sdrapkin commented 1 year ago

I've tested it on:

It still throws.

InvalidOperationException - The requested operation cannot be completed because the connection has been broken.

Stacktrace:
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransaction(TransactionRequest transactionRequest, String name, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest) in D:\a\_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlInternalConnectionTds.cs:line 1001
   at Microsoft.Data.SqlClient.SqlDelegatedTransaction.Promote() in D:\a\_work\1\s\src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\SqlDelegatedTransaction.cs:line 161
DavoudEshtehari commented 1 year ago

@jetizz This occurs when pooling is enabled, and the event source logs show it's between different threads. So far, a workaround with pooling would be directing connection requests to different connection pools based on the thread in use to avoid conflict between different threads in presence of a transaction. The ApplicationName property in the connection string can be used for this purpose.

ByteMe-web commented 1 year ago

@jetizz This occurs when pooling is enabled, and the event source logs show it's between different threads. So far, a workaround with pooling would be directing connection requests to different connection pools based on the thread in use to avoid conflict between different threads in presence of a transaction. The ApplicationName property in the connection string can be used for this purpose.

@DavoudEshtehari - Yes that's what I noted earlier. However what I didn't notice at the time was that this causes the transaction to be distributed. Running under .NET 6 on linux this will then fail as distributed transactions are not allowed.

DavoudEshtehari commented 1 year ago

@ByteMe-web It seems I missed your comment, at least we agree on it. And thank you for the update.

DavoudEshtehari commented 1 year ago

Can you guys try including timeout with TransactionScope and share the result? I don't see consistent results, but significant effects with Timeout = TimeSpan.FromSeconds(180).

jetizz commented 1 year ago

@DavoudEshtehari Tried setting the timeout to 180 seconds, issue still occurs. Also tried 60 seconds and 10 minutes (TransactionManager.DefaultTimeout and TransactionManager.MaximumTimeout) - both appear to make no difference. As @ByteMe-web said, we can't apply previously stated workaround as we're running under linux (aspnet:7.0-bullseye-slim as base to be precise). Again, I do not have a suite to measure and reliably repro this issue, but reading from logs, it occurs 2-3 times per 15k transactions. We do however use up to 20 (very short lived) connections during single transaction. Thought of rewriting how connections are used (roughly - scope to request instead of transient), but it would require significant effort. It should result in lower pool usage but no guarantee it would fully resolve this issue, so I'm reluctant.

arical commented 1 year ago

Mostly noting this down as another point of data for the team.

After upgrading our project to EF 7/SqlClient 5.1.0 from EF 6/SqlClient 5.0.1, we've started receiving this exception intermittently on our CI while running our test suite. It happens roughly once in 3 runs, which is annoying enough to be noticeable given we have had a fairly stable build up until this point.

Setup:

DavoudEshtehari commented 1 year ago

I can't see the issue with the repro by upgrading to Windows 11 version 21H2 (OS Build 22000.1936). Please, share your results.

Update: By increasing the number of cnn.OpenAsyn() calls to 20 on each thread, it started to break with the same exception after a few runs.

DavoudEshtehari commented 9 months ago

This package from the PR #2301 can be used for verification. Please, give it a try as this issue varies across different machines.

sdrapkin commented 9 months ago

@DavoudEshtehari I have tested Nuget 5.20.0-pull.101801 (Windows 10/64bit with .NET 8.0.1) and it consistently does not throw (ie. works). It also works for me on latest .NET 7 and .NET 6. :green_heart:

I have also tested the exact same repro-code with Nuget 5.2.0-preview4.233342.2, and it consistently throws on every run. :boom: Thus, so far this PR is very encouraging, thanks for the work.

Can this fix please be ported to the System.Data.SqlClient as well?