RetryPolicy is causing transaction exception for connectivity issue

SeanFeldman commented 5 years ago

When an ambient transaction exists, operations such as SendAsync enlist into existing ambient transaction. RetryPolicy is taking that operation later and executes. When an operation is failing to execute due to connectivity problems, RetryPolicy re-executes that operation, which includes repeated transaction enlistment. This in turn throws an exception System.InvalidOperationException: 'Local transactions are not supported with other resource managers/DTC.' thrown by AmqpTransactionManager.EnlistAsync(), completely masquerading connectivity problem.

Stacktrace

at Microsoft.Azure.ServiceBus.Core.MessageSender.OnSendAsync(IList`1 messageList) in C:\source\azure-service-bus-dotnet\src\Microsoft.Azure.ServiceBus\Core\MessageSender.cs:line 562 at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout) in C:\source\azure-service-bus-dotnet\src\Microsoft.Azure.ServiceBus\RetryPolicy.cs:line 83 at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout) in C:\source\azure-service-bus-dotnet\src\Microsoft.Azure.ServiceBus\RetryPolicy.cs:line 105 at Microsoft.Azure.ServiceBus.Core.MessageSender.SendAsync(IList`1 messageList) in C:\source\azure-service-bus-dotnet\src\Microsoft.Azure.ServiceBus\Core\MessageSender.cs:line 252 at ReproApp.Program.Main(String[] args) in C:\Users\Sean\Desktop\RetryPolicyRepro\ReproApp\Program.cs:line 33 at ReproApp.Program.

(String[] args)

Expected Behavior

Operation retried due to connectivity problem should not fail with incorrect transaction exception
Transaction for the same operation should not be enlisted more than once

Steps to reproduce this issue:

Run the following code (bellow, available at this repo)
Block the outgoing TCP port 5671

static async Task Main(string[] args)
{
    var connectionString = Environment.GetEnvironmentVariable("AzureServiceBus_ConnectionString");

    var management = new ManagementClient(connectionString);

    if (!await management.QueueExistsAsync("queue"))
    {
        await management.CreateQueueAsync("queue");
    }

    var sender = new MessageSender(connectionString, "queue");

    while (true)
    {
        using (var tx = new TransactionScope(TransactionScopeOption.RequiresNew, TransactionScopeAsyncFlowOption.Enabled))
        {
            Debugger.Break();
            await sender.SendAsync(new Message(Encoding.Default.GetBytes(DateTime.Now.ToString("s"))));
            tx.Complete();
        }
    }

    await management.CloseAsync();
    await sender.CloseAsync();
}

Versions

OS platform and version: Windows 10
.NET Version: .NET Framework 4.6.1 / .NET Core 2.2
NuGet package version or commit ID: 3.2.0

SeanFeldman commented 5 years ago

Unfortunately it’s difficult for me to provide exact numbers related to how many of our customers this impacts, but just from an anecdotal perspective, just yesterday we had support cases related to this exception from 3 4 separate customers.

nemakam commented 5 years ago

Well, for transactions, you cannot really have a retry within the same transaction. The only way out would be to use RetryPolicy.None and have an uber retry outside of the transaction. Within a transaction its difficult to even define a retry. The first retry needs to fail. The fix we will have to do in SDK is to throw an exception if they have a retry policy set within a transaction.

SeanFeldman commented 5 years ago

SBMP client did not have this issue. How did this work with SBMP client?

SeanFeldman commented 5 years ago

Looks like the old client didn't have this issue and was bubbling connectivity exception up right away (MessagingCommunicationException). I've added a second project with the old client to show that.

nemakam commented 5 years ago

SBMP client did not have this issue as the RetryPolicy in the older client never retried if there was an active transaction. That needs to be implemented in this client as well.

Explaining it with details for the benefit of everyone -

using (new txn)
{
                // retryable operation.
}

In case of SendAsync(), which is an easier point to talk about, lets say first send failed and the second send goes through. Lets say the first send failed with CommunicationException. At this point we don’t know whether the first send completed or not. So what is expected in this case as a client? Should the second send even happen within the same transaction since the first one failed? Transaction rules dictate all-or-none. Do the two operations behave as independent operations or a single operation?

Now if we talk about CompleteAsync(), it gets even more tricky. The first complete() went through all the way to broker, but again the client encountered CommunicationException. So client doesn’t know if the first one actually reached broker or not. In this case, client cannot retry. Lets say it retries, first complete will not succeed till the transaction is committed/aborted. On the service the second complete() is “blocked” on the first complete. And the transaction on the client will not complete till the second complete succeeds. It’s a deadlock and the only way out is when after 2 minutes the transaction times out. It’s a pretty bad experience for customers since every retry will timeout for sure after 2 minutes.

The fix is simple – NO RETRIES during transaction case. The RetryPolicy should understand whether the operation is within a transaction scope or not. If it is, it should never retry. And this should be documented properly.

If you are stuck with this problem,
the immediate fix – make sure you don’t use any retry policy (i.e., use RetryPolicy.None instead). Or almost immediate fix – raise a PR here where a retry doesn’t happen if transaction is active. I hope that makes sense.

Azure / azure-service-bus-dotnet