Closed sburmanoctopus closed 11 months ago
This pull request has been linked to Shortcut Story #60671: Halibut initial communication failure of type ErrorInInitialisation occurred: "Error initializing TCP client".
Thanks for a great PR description. It helped me orient myself quickly 👍.
Your solution seems to solve the problem, but it relies on undocumented behaviours of SSLStream and NetworkStream. MS is free to change those behaviours at any time, and this will make our code invalid.
Could we approach this problem from a different angle? The error is raised when we dispose the SSLStream. Do we care whether the disposal fails or succeeds? If not, we could use an approach similar to SafelyCloseStreamAsync to make sure no errors are propagated when we dispose of the SSLStream.
Thoughts?
[sc-60671]
Background
While monitoring Async Halibut connection errors, we saw a lot of
Error initializing TCP client: <error>
errors.Looking at the stack trace, they tended to be caused by this:
Given the large number of these, as well as the fact that it is failing while disposing in the
SslStream
, we wanted to know why this was happening, and fix it if possible.Results
Related to OctopusDeploy/Issues#8266
Before
What Is Failing Now
Like the stack trace implies, we are disposing the
SslStream
created in theSecureListener.ExecuteRequest
method.An
SslStream
will callFlushAsync
on the underlying stream when it is configured to leave the underlying stream open (which is what we do here).The underlying stream in this case is our custom made
NetworkTimeoutStream
, and if it has faulted due to a timed out, callingFlushAsync
will cause it to rethrow the exception that caused the timeout.Why Did It Succeed Before
NetworkTimeoutStream
?NetworkTimeoutStream
is new, but Tentacle is old, and we never used to see these errors before.Old Tentacle did the same thing as we do now, but we used a standard
NetworkStream
instead of aNetworkTimeoutStream
.This means that we used to
FlushAsync
when disposing theSslStream
just like we do now.The difference is that if you look at the implementation of
NetworkStream
, theFlush
/FlushAsync
methods are overridden to do nothing.So it doesn't matter what faulted state the
NetworkStream
is in, flushing it would never throw an exception. Even if theNetworkStream
was disposed...Disposal Wrong Order
While looking at this we noticed that our code would create the
NetworkTimeoutStream
, give it to theSslStream
, but then dispose theNetworkTimeoutStream
first, then dispose of theSslStream
(i.e. in the wrong order).This means that the
SslStream
is callingFlushAsync
on a disposed stream.As we said earlier, this didn't fail before, because of how
NetworkStream
implements its flush methods.But it didn't feel right leaving it that way either.
Is
NetworkTimeoutStream
Timing Out Correctly?We believe that
NetworkTimeoutStream
is doing what it is designed to do. It times out, and throws an exception when that happens.Then, any subsequent use of the stream also causes the original exception to the rethrown.
So in this case, we believe that a legitimate timeout occurred, and the
NetworkTimeoutStream
is doing what it was designed to do, and rethrowing the original exception (in this case, theSystem.OperationCanceledException
)Double Logging
Sometimes, we would see the same exception logged twice (with a different "top level" error message).
This happens because the real exception is logged when the real error occurs, but then because of the disposal of the
SslStream
, the same exception is rethrown and logged by the outer try/catch.After
Making Flush Do Nothing
We cannot escape the fact that
SslStream
will callFlushAsync
when it disposes. So the options were:NetworkTimeoutStream
ignore when it has timed out, and just call flush on the underlying stream.NetworkTimeoutStream
behave like a regularNetworkStream
, and do nothing when we flushWe first tried the first option. That worked, but felt wrong because it requires knowledge of how Microsoft have implemented
NetworkStream
.So we made a stream wrapper that hides errors on disposal.
Fixing Disposal Order
Although not a requirement for this fix, it felt wrong disposing the network stream before the SSL stream.
So we also swapped the ordering so that the SSL stream is disposed first.
How to review this PR
I was not sure if a test for this explicit condition was worthwhile. If it was easy to write, I would have done it. But the only options available where
Both felt awkward, and not worth the effort. But let me know if you think it's worth writing.
Quality :heavy_check_mark:
Pre-requisites