dotnet / SqlClient

Microsoft.Data.SqlClient provides database connectivity to SQL Server for .NET applications.
MIT License
853 stars 286 forks source link

Azure SQL DTC bug #2970

Open Abrissirba opened 1 week ago

Abrissirba commented 1 week ago

We have an issue that have scared us and that has taken us some time to understand. I am not sure if this is an issue that belongs to SqlClient or System.Transaction or Azure SQL

The bug first appeared when moving our on-prem solution up to Azure. Going from windows to Azure Containers Apps running on linux and moving from SQL Server On-Prem to an Azure SQL database.

Our code is using TransactionScope to make sure everything is commited in one batch.

A regular flow is that at the begining of execution, a TransactionScope is created and record is inserted into a table (using ADO.NET) and then the flow continues to handle busniess logic. (Outbox with TransactionScope in NServicBus)

The business logic code use Entity Framework 6 to communicate with the DB.

At the end of execution the first connection will try to update the outbox record it created and complete the TransactionScope. Since we have two connecitons open simultanously the transaction will escalate to DTC. From my understanding Azure SQL will handle this DTC transaction in something called Elastic Transaction.

During our first tests we had a low SKU on the database and some parts of our db communication (heavy inserts) took very long time to execute. At most it could take up to 40 minutes.

During these long running transactions we got an error at then end when updating the outbox record. This would throw an error leading to the transaction being rolled back. So far so good.

What suprised us was that we could see data in the database that had been commited during this execution that should have been rolled back but instead had been commited. Scary and confusing!

After investigating the error, stating a broken connection, we found out that we used the proxy connection policy in azure. This connection policy seems to close connections that has been idle for 30 minutes. Redirect is the recomended policy by microsoft but proxy is the defualt in many scenarios, including when you use private endpoints which is the case for us. After changing to redirect, and increasing the SKU so that the inserts doesn't take so long to execute, we have not seen this behaviour any more.

However, we could not sleep as good as we would like knowing that this error might still occour, ending up corrupting the database.

So I started digging into the code to find a way to recreate this. This is what I have found so far

This only happens when we use an Azure SQL database. If we use a local db on windows, which will escalate using MSDTC, an error is thrown when the first connection/session dies.

I have been able to reproduce with the example below using the lastest version of both Microsoft.Data.SqlClient and System.Data.SqlClient against an Azure SQL db.

The code below can be used to recreate the issue.

using Microsoft.Data.SqlClient;
using System.Transactions;
// The sql below should be executed in ssms or something similiar to get information about the connections/transactions
// The select query will return the current active transactions
// the transaction with is_local = 1 should then be killed to simulate a connection error of the connection that holds the dtc coordination.
//
//SELECT
//    st.session_id,
//    st.transaction_id,
//  st.is_local,
//  st.is_enlisted,
//  s.client_interface_name
//FROM sys.dm_tran_session_transactions st
//JOIN sys.dm_exec_sessions s ON st.session_id = s.session_id
//WHERE st.transaction_id IN 
//(select Transaction_id from sys.dm_tran_active_transactions where name = 'user_transaction')

//kill <id of the session that has is_local set to 1>

var options = new TransactionOptions
{
    IsolationLevel = System.Transactions.IsolationLevel.Serializable,
    Timeout = TimeSpan.Zero // TimeSpan.Zero is default of `TransactionOptions.Timeout`
};
var scope = new TransactionScope(TransactionScopeOption.RequiresNew, options, TransactionScopeAsyncFlowOption.Enabled);
TransactionManager.ImplicitDistributedTransactions = true;

var connectionString = "Server=.;Database=<DB-name>;Trusted_Connection=True;TrustServerCertificate=True";

var name = "Test8";

// First connection, this will have the transaction as local
{
    var conn1 = new SqlConnection(connectionString);
    conn1.Open();
    var cmd = conn1.CreateCommand();
    cmd.CommandText = $"update Countries set Name = '{name}' where Id = 1";
    _ = await cmd.ExecuteNonQueryAsync();
    //conn1.Close(); // adding this line will prevent the transaction to be escalated to distributed.
                     // In our case we cannot not do this since this connection is handled by third-party code. 
}

// Second connectionm, this will éscalate the transaction as distributed
// This should fail if the transaction dies
{
    var conn2 = new SqlConnection(connectionString);
    conn2.Open();
    {
        // if this block is removed/commented out, an error will be thrown (as expected)
        // It seems like the open/close somehow breaks out of the transactionscope.
        //Microsoft.Data.SqlClient.SqlException - Distributed transaction completed. Either enlist this session in a new transaction or the NULL transaction.
        conn2.Close();
        conn2.Open();
    }

    // KILL THE SESSION THAT HOLDS THE is_local TRANSACTION HERE
    var cmd2 = conn2.CreateCommand();
    cmd2.CommandText = $"update Countries set Name = '{name}' where Id = 2";
    _ = await cmd2.ExecuteNonQueryAsync();
}

scope.Complete();
scope.Dispose();
mdaigle commented 3 days ago

Hi, @Abrissirba, I'm taking a look at this and will keep you updated on my findings. In the meantime, can you please let me know which version of Microsoft.Data.SqlClient you're using?

Thanks!

Abrissirba commented 2 days ago

I used 5.2.2 of Microsoft.Data.SqlClient

mdaigle commented 1 day ago

Hi @Abrissirba, I wanted to share where I'm at to keep you updated. Your repro code helped me verify the issue (thank you!) and I've determined a root cause. I'm currently evaluating potential solutions and will share again when we're ready to go forward with a fix.

Abrissirba commented 20 hours ago

Thanks for the update and happy that my repro code helped.