dotnet / SqlClient

Microsoft.Data.SqlClient provides database connectivity to SQL Server for .NET applications.
MIT License
848 stars 283 forks source link

Deadlock when using SinglePhaseCommit with distributed transactions #1800

Closed roji closed 1 year ago

roji commented 2 years ago

In .NET 7.0, the OleTx/MSDTC distributed transactions support has been ported over from .NET Framework (https://github.com/dotnet/runtime/issues/715). This means that it will now be possible to use distributed transactions with .NET 7.0, on Windows only.

We've received a bug report (https://github.com/dotnet/runtime/issues/76010) with two separate console programs, enlisting which propagate a transaction. The initiating program (the "outer" transaction) enlists only a single SqlConnection to the transaction, and hangs when the transaction is committed. This happens quite consistently, but not 100% reliably. To reproduce this, simply run the .NET 7 version of the application in https://github.com/dotnet/runtime/issues/76010 in two separate windows; the first one should hang.

When the program hangs, two threads are stuck with the following stack traces:

Stack 1

SinglePhaseEnlistment.Committed()at C:\Users\shrojans\AppData\Roaming\JetBrains\Rider2022.2\resharper-host\SourcesCache\d42a13b4dda85c7ff6fe99f383b4a1b279abd39735cb45449b7401d99c387\SinglePhaseEnlistment.cs:line 62
SqlDelegatedTransaction.SinglePhaseCommit()at C:\Users\shrojans\AppData\Roaming\JetBrains\Rider2022.2\resharper-host\SourcesCache\7c62ff4f1cfc3e61c3e65f286d919f9f0bbb2ef6a822b99cb37623d6d3a891\SqlDelegatedTransaction.cs:line 389
TransactionStateDelegatedCommitting.EnterState()
CommittableTransaction.Commit()
TransactionScope.InternalDispose()
TransactionScope.Dispose()
Program.StartTransaction()
async Program.Main()
AsyncMethodBuilderCore.Start<System.__Canon>()
Program.Main()
Program.<Main>()

Stack 2:

DbConnectionInternal.DetachTransaction()at C:\Users\shrojans\AppData\Roaming\JetBrains\Rider2022.2\resharper-host\SourcesCache\71847c97c677567f13e9df7e4f1f44a99f349762bb77c34f819752ee84b71\DbConnectionInternal.cs:line 413
DbConnectionInternal.CleanupConnectionOnTransactionCompletion()at C:\Users\shrojans\AppData\Roaming\JetBrains\Rider2022.2\resharper-host\SourcesCache\71847c97c677567f13e9df7e4f1f44a99f349762bb77c34f819752ee84b71\DbConnectionInternal.cs:line 436
DbConnectionInternal.TransactionCompletedEvent()
TransactionStatePromotedCommitted.EnterState()
InternalTransaction.DistributedTransactionOutcome()
RealOletxTransaction.FireOutcome()
OutcomeEnlistment.InvokeOutcomeFunction()
OletxTransactionManager.ShimNotificationCallback()
PortableThreadPool.CompleteWait()
ThreadPoolWorkQueue.Dispatch()
PortableThreadPool.WorkerThread.WorkerThreadStart()
[Native to Managed Transition]

Here is the general flow leading up to this state:

  1. When the TransactionScope is disposed, we get to SqlDelegatedTransaction.SinglePhaseCommit(), which is the SqlClient part that interacts with System.Transactions. a. SqlDelegatedTransaction.SinglePhaseCommit() locks connection (the SqlInternalConnectionTds) b. It then sends Commit to SQL Server. Since the transaction is delegated, the triggers the commit with MSDTC, which causes commit notifications to be sent to the enlisted parties. This includes the running program, which starts thread 2 below running concurrently. b. Finally, it proceeds to call enlistment.Committed(), which is SinglePhaseEnlistment.Committed(), while keeping the lock on SqlInternalConnectionTds. That method then takes _internalEnlistment.SyncRoot, which is the InternalTransaction.

  2. The above commit causes us to get a notification from the native layer (MSDTC) that the transaction committed (triggered above in 1b). a. We get to InternalTransaction.DistributedTransactionOutcome(), which locks the InternalTransaction. b. We then get to DbConnectionInternal.DetachTransaction() (DbConnectionInternal is registered as a listener on the Transaction.TransactionCompleted event), which attempts to lock this (the SqlInternalConnectionTds).

So thread 1 locks the SqlInternalConnectionTds and then the InternalTransaction, while thread 2 - which runs concurrently - locks InternalTransaction and then SqlInternalConnectionTds. This produces a deadlock.

I'm not an expert in this code, but moving the call to enlistment.Committed() (code) outside of the lock could be a way to resolve this deadlock.

Note that I couldn't reproduce the deadlock in .NET Framework; there's likely some timing differences which make the deadlock manifest far more frequently on .NET Core, but AFAICT the bug is in Framework as well. Note that this repro uses SinglePhaseCommit - only one connection is enlisted to the TransactionScope in the application. I recommend also checking having two connections to force 2PC - the same bug could be present in that flow as well.

Many thanks to @nathangreaves for providing the original repro code.

/cc @ajcvickers

JRahnama commented 2 years ago

@roji thanks. We will resume back on the issues on Tuesday. Monday is a holiday in Canada.

roji commented 2 years ago

I did a bit more investigation into this.

First, here's a minimal repro console app - target net7.0 with SDK 7.0.100-rc.1.22431.12 or later. As you can see it's very trivial - anyone using distributed transactions with .NET 7.0 is likely to hit it:

while (true)
{
    using var transaction = new CommittableTransaction();

    // Uncommenting the following makes the deadlock go away as a workaround. If the transaction is promoted before
    // the first SqlClient enlistment, it never goes into the delegated state.
    // _ = TransactionInterop.GetTransmitterPropagationToken(transaction);

    await using var conn = new SqlConnection(...);
    await conn.OpenAsync();
    conn.EnlistTransaction(transaction);

    // Enlisting the second transaction causes causes the transaction to be promoted.
    // After this, the transaction state will be "delegated" (delegated to SQL Server), and the commit below will
    // trigger a call to SqlDelegatedTransaction.SinglePhaseCommit.
    await using var conn2 = new SqlConnection("Server=localhost;Database=test;User=SA;Password=Abcd5678;Connect Timeout=60;ConnectRetryCount=0;Encrypt=false");
    await conn2.OpenAsync();
    conn2.EnlistTransaction(transaction);

    Console.WriteLine("Committing transaction...");
    transaction.Commit();
    Console.WriteLine("Transaction committed.");
}

The deadlock was introduced in #1042, when the lock scope in SinglePhaseCommit was extended to include the call to enlistment.Committed(); moving that call out of the lock scope resolves this deadlock. I don't understand #1042 well enough to know whether moving the call out is the correct fix (or whether it would reintroduce the bug that #1042 fixed) - @cheenamalhotra I think your expertise here would be required :) In any case, I've submitted #1801, which fixes the deadlock.

Note that the netfx version of SqlDelegatedTransaction doesn't include this change (and so neither did System.Data.SqlClient), which explains why this deadlock didn't occur with .NET Framework.

roji commented 2 years ago

One additional note on this... SqlClient currently calls CleanupConnectionOnTransactionCompletion twice when a transaction completes:

  1. It's called directly from SqlDelegatedTransaction.SinglePhaseCommit (on this line).
  2. It's also called from DbConnectionInternal.TransactionCompletedEvent, which is triggered when the transaction is committed.

Removing one of these should make the deadlock go away. Alternatively, detecting inside that cleanup has already occured - and specifically that the transaction has already been detached - would allow skipping taking the lock, and again resolve the deadlock. This may be safer than reducing the scope of the connection lock in SinglePhaseCommit as indicated above (although I don't have the full picture here).

cheenamalhotra commented 2 years ago

Hi @roji

Thanks for your findings! I'm a little occupied this week, and haven't got chance to dig deep, yet.

I hope you have already, but if not, I'd like to confirm #237 doesn't reoccur (fixed by #543 - found cause here) as that's the main cause we had to limit enlistment for externally Aborted transactions.

I'll try getting back to you in a day or two!

nathangreaves commented 2 years ago

Hi @cheenamalhotra, please could we have an update on a rough timescale for this? It's a blocker for our org for upgrading from .net framework to .net 7

llermanos commented 1 year ago

Hello @cheenamalhotra . @roji mentioned a workaround of using TransactionInterop.GetTransmitterPropagationToken(Transaction.Current) . I don't know if it is correct to temporary apply this "fix" , or to wait for the bug to be solved. Any ETA or time to update?. I have been able to migrate 60% of our solution from .NET 4.8 to 7.0, but we are blocked by this issue. Thanks!

cheenamalhotra commented 1 year ago

@nathangreaves @llermanos

Very sorry for being caught up for longer than I thought.. I hope this would be fixed soon so you can be unblocked! Added some notes here: https://github.com/dotnet/SqlClient/pull/1801#issuecomment-1312361757

cc @David-Engel