Azure / azure-amqp

AMQP C# library
Other
94 stars 70 forks source link

Intermittent connection failures when using AMQPS over IPv6 on .NET 6 #242

Closed peolivei2 closed 1 year ago

peolivei2 commented 1 year ago

Issue

Establishing new client connections fail intermittently when using the library to connect against an AMQPS endpoint over IPv6 in .NET 6 (if we switch the client code back to .NET core 2.1 or connect to an IPv4 broker, the errors no longer happen).

Repro steps

  1. Start the local test broker on an IPv6 amqps address, like the following:

    .\bin\Debug\TestAmqpBroker\net6.0\TestAmqpBroker.exe amqps://[::1]:10196 /cert:localhost
  2. Run the following client code using .NET 6:

    
    using System;
    using System.Threading.Tasks;
    using Microsoft.Azure.Amqp;
    using Microsoft.Azure.Amqp.Transport;
    using System.Collections.Generic;

namespace MyApp { internal class Program { static HashSet ints = new HashSet();

    static void Main(string[] args)
    {
        Run().Wait();
    }

    static async Task Run()
    {
        Uri uri = new Uri("amqps://[::1]:10196/");

        for (int i = 0; i < 100; ++i)
        {
            AmqpConnectionFactory factory = new AmqpConnectionFactory();
            factory.Settings.TransportProviders.Add(new TlsTransportProvider(new TlsTransportSettings()
            {
                CertificateValidationCallback = (a, b, c, d) => true,
                CheckCertificateRevocation = false,
                Protocols = System.Security.Authentication.SslProtocols.Tls12
            }));

            await factory.OpenConnectionAsync(uri, TimeSpan.FromSeconds(30));
            Console.WriteLine("Success");
        }
    }
}

}


This code simply opens a connection 100 times. On .NET core 2.1 this code works fine, but on .NET 6, after a few iterations, the code eventually fails with the following exception:

System.IO.IOException : Transport 'tls4' is valid for write operations. ---- System.InvalidOperationException : This operation is only allowed using a successfully authenticated context.


## Investigation
After a lengthy investigation, we were able to identify the root cause of the race condition in the following call on 
[TcpTransportInitiator.cs:44](https://github.com/Azure/azure-amqp/blob/master/src/Transport/TcpTransportInitiator.cs#L44):
```c#
bool connectResult = Socket.ConnectAsync(SocketType.Stream, ProtocolType.Tcp, connectEventArgs);

When this call returns true all works well, which seems to always be the case in .NET core 2.1 or when connecting to an IPv4 broker. However, when it returns false, indicating that the connection was performed synchronously, the library breaks. In .NET 6, this call seems to return false from time to time for IPv6 sockets.

More specifically, when the call above returns false, it causes the following path on AmqpTransportInitiator.cs:367 to be executed twice:

     if (!thisPtr.CompleteSelf(args.CompletedSynchronously, args.Exception))
                {
                    if (args.Transport != null)
                    {
                        // completed by timer
                        args.Transport.Abort();
                    }
                }

The first time causes the operation to complete. The second time, however, because the operation was already completed once, causes Transport.Abort() to be called, which disposes the connection, cause the failures we see above.