OctopusDeploy / Halibut

| Public | A secure communication stack for .NET using JSON-RPC over SSL.
Other
12 stars 44 forks source link

System.OperationCanceledException: The ReadAsync operation was cancelled #513

Closed sburmanoctopus closed 11 months ago

sburmanoctopus commented 11 months ago

[sc-60671]

Background

While monitoring Async Halibut connection errors, we saw a lot of Error initializing TCP client: <error> errors.

Looking at the stack trace, they tended to be caused by this:

System.OperationCanceledException: The ReadAsync operation was cancelled.
   at Halibut.Transport.Streams.NetworkTimeoutStream.ThrowIfAlreadyTimedOut()
   at Halibut.Transport.Streams.NetworkTimeoutStream.get_CanTimeout()
   at Halibut.Transport.Streams.NetworkTimeoutStream.FlushAsync(CancellationToken cancellationToken)
   at System.Net.Security.SslStream.DisposeAsync()
   at Halibut.Transport.SecureListener.ExecuteRequest(TcpClient client)
   at Halibut.Transport.SecureListener.HandleClient(TcpClient client)

Given the large number of these, as well as the fact that it is failing while disposing in the SslStream, we wanted to know why this was happening, and fix it if possible.

Results

Related to OctopusDeploy/Issues#8266

Before

What Is Failing Now

Like the stack trace implies, we are disposing the SslStream created in the SecureListener.ExecuteRequest method.

An SslStream will call FlushAsync on the underlying stream when it is configured to leave the underlying stream open (which is what we do here).

The underlying stream in this case is our custom made NetworkTimeoutStream, and if it has faulted due to a timed out, calling FlushAsync will cause it to rethrow the exception that caused the timeout.

Why Did It Succeed Before NetworkTimeoutStream?

NetworkTimeoutStream is new, but Tentacle is old, and we never used to see these errors before.

Old Tentacle did the same thing as we do now, but we used a standard NetworkStream instead of a NetworkTimeoutStream.

This means that we used to FlushAsync when disposing the SslStream just like we do now.

The difference is that if you look at the implementation of NetworkStream, the Flush/FlushAsync methods are overridden to do nothing.

So it doesn't matter what faulted state the NetworkStream is in, flushing it would never throw an exception. Even if the NetworkStream was disposed...

Disposal Wrong Order

While looking at this we noticed that our code would create the NetworkTimeoutStream, give it to the SslStream, but then dispose the NetworkTimeoutStream first, then dispose of the SslStream (i.e. in the wrong order).

This means that the SslStream is calling FlushAsync on a disposed stream.

As we said earlier, this didn't fail before, because of how NetworkStream implements its flush methods.

But it didn't feel right leaving it that way either.

Is NetworkTimeoutStream Timing Out Correctly?

We believe that NetworkTimeoutStream is doing what it is designed to do. It times out, and throws an exception when that happens.

Then, any subsequent use of the stream also causes the original exception to the rethrown.

So in this case, we believe that a legitimate timeout occurred, and the NetworkTimeoutStream is doing what it was designed to do, and rethrowing the original exception (in this case, the System.OperationCanceledException)

Double Logging

Sometimes, we would see the same exception logged twice (with a different "top level" error message).

This happens because the real exception is logged when the real error occurs, but then because of the disposal of the SslStream, the same exception is rethrown and logged by the outer try/catch.

After

Making Flush Do Nothing

We cannot escape the fact that SslStream will call FlushAsync when it disposes. So the options were:

We first tried the first option. That worked, but felt wrong because it requires knowledge of how Microsoft have implemented NetworkStream.

So we made a stream wrapper that hides errors on disposal.

Fixing Disposal Order

Although not a requirement for this fix, it felt wrong disposing the network stream before the SSL stream.

So we also swapped the ordering so that the SSL stream is disposed first.

How to review this PR

I was not sure if a test for this explicit condition was worthwhile. If it was easy to write, I would have done it. But the only options available where

Both felt awkward, and not worth the effort. But let me know if you think it's worth writing.

Quality :heavy_check_mark:

Pre-requisites

shortcut-integration[bot] commented 11 months ago

This pull request has been linked to Shortcut Story #60671: Halibut initial communication failure of type ErrorInInitialisation occurred: "Error initializing TCP client".

pawelpabich commented 11 months ago

Thanks for a great PR description. It helped me orient myself quickly 👍.

Your solution seems to solve the problem, but it relies on undocumented behaviours of SSLStream and NetworkStream. MS is free to change those behaviours at any time, and this will make our code invalid.

Could we approach this problem from a different angle? The error is raised when we dispose the SSLStream. Do we care whether the disposal fails or succeeds? If not, we could use an approach similar to SafelyCloseStreamAsync to make sure no errors are propagated when we dispose of the SSLStream.

Thoughts?