aspnet / AspNetKatana

Microsoft's OWIN implementation, the Katana project
Apache License 2.0
963 stars 332 forks source link

Access to context.Request.CallCancelled causes intermittent write failure for http/2 over TLS with Content-Length #430

Closed maksimkim closed 2 years ago

maksimkim commented 2 years ago

Final payload write intermittently fails with the error: System.IO.IOException ---> System.Net.HttpListenerException: The I/O operation has been aborted because of either a thread exit or an application request at System.Net.HttpResponseStream.EndWrite(IAsyncResult asyncResult) at System.IO.Stream.<>c.b__53_1(Stream stream, IAsyncResult asyncResult) at System.Threading.Tasks.TaskFactory1.FromAsyncTrimPromise1.Complete(TInstance thisRef, Func`3 endMethod, IAsyncResult asyncResult, Boolean requiresSynchronization) --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Owin.Host.HttpListener.RequestProcessing.ExceptionFilterStream.d36.MoveNext() --- End of inner exception stack trace --- at Microsoft.Owin.Host.HttpListener.RequestProcessing.ExceptionFilterStream.d36.MoveNext()

when the following conditions are applied:

Repro:

OS: Windows 10/Window Server 2016 .NET Framework: 4.6.2+ Owin host: HttpListener

I assume it's somehow related to a native http.sys methods invocation inside DisconnectHandler

Tratcher commented 2 years ago

Notes:

That exception message maps to: ERROR_OPERATION_ABORTED 995 (0x3E3) https://docs.microsoft.com/en-us/windows/win32/debug/system-error-codes--500-999-

You're setting the response Content-Length and writing the entire body at once, so the native API that would be returning this error would be HttpSendHttpResponse. https://docs.microsoft.com/en-us/windows/win32/api/http/nf-http-httpsendhttpresponse

The biggest difference setting the Content-length causes is that the server knows this is the last write so it doesn't set HTTP_SEND_RESPONSE_FLAG_MORE_DATA and do another zero length call to HttpSendResponseEntityBody later to finish the body. That might mean the issue is specific to HttpSendHttpResponse or that it's a matter of timing.

Accessing CallCancelled shouldn't have any effect on response writes, it must be causing some state change inside Http.Sys by calling HttpWaitForDisconnectEx.

maksimkim commented 2 years ago

Modified the server example to split response payload into 2 chunks and can confirm that only the last write fails. The first one never fails.

maksimkim commented 2 years ago

Update. 1) Issue is reproducible on both Win10 and Win11 2) Decrypted TLS shows the following pattern for each request: image

Meaning TCP connections are not reused and gracefully closed by the client after each request. Also there's 3rd 0-length DATA frame which is sent by http.sys on it's own. App doesn't send it.

3) Issue has a clear correlation with accessing context.Request.CallCancelled. Without it no errors pop up in the application on any OS.

Tratcher commented 2 years ago

I was able to repro this using Win11 and the curl implementation from WSL. for i in {1..5000}; do curl -k https://localhost:9090 --resolve localhost:9090:172.23.208.1 --output /dev/null; done

Confirmed it goes away if I comment out the line accessing CallCancelled. It's not clear how that would affect the last write since the token isn't passed to the write.

Interestingly it does repro against Katana, but not against AspNetCore's HttpSys server (6.0, 5.0, nor 2.1 on .NET Core 2.1 or .NET Framework 4.x). The AspNetCore implementation is derived from the Owin and HttpListener one. https://github.com/aspnet/AspNetKatana/blob/dev/src/Microsoft.Owin.Host.HttpListener/DisconnectHandler.cs https://github.com/dotnet/aspnetcore/blob/release/2.1/src/Servers/HttpSys/src/NativeInterop/DisconnectListener.cs

I don't see any significant differences in the disconnect registration logic. The biggest differences I can find between the server implementations is that AspNetCore uses ThreadPoolBoundHandle instead of ThreadPool.BindHandle to set up its overlapped operations.

Variations tried:

If I check the token state before and after the write like this:

            try
            {
                if (token.IsCancellationRequested) Console.WriteLine("Token Canceled before write.");
                await res.Body.WriteAsync(this.payload, 0, this.payload.Length);
                if (token.IsCancellationRequested) Console.WriteLine("Token Canceled after write.");
            }
            catch (Exception ex)
            {
                Console.WriteLine("Exception. IsTokenCanceled? " + token.IsCancellationRequested + "; " + ex.InnerException?.Message);
                throw;
            }

I find that the token is frequently canceled after the write even in success cases. The token is always canceled after when I get this exception, so they're certainly related.

Token Canceled after write.
Token Canceled after write.
Token Canceled after write.
Token Canceled after write.
Token Canceled after write.
Token Canceled after write.
Token Canceled after write.
Token Canceled after write.
Token Canceled after write.
Exception. IsTokenCanceled? True; The I/O operation has been aborted because of either a thread exit or an application request

But why is the token cancelled? I see a TLS alert warning in that trace.

In AspNetCore the token is also frequently canceled after the write finishes, but there's no write exception.

Tratcher commented 2 years ago

Offline update: I captured some ETW traces and they showed that HttpCancelHttpRequest was called after HttpWaitForDisconnectEx, and that's why HttpSendHttpResponse failed. I was able to backtrack that to this callstack that shows it being called indirectly from our HttpWaitForDisconnectEx callback.

AspNetCore does not immediately call Abort when the token fires. That explains the different behaviors across libraries. https://github.com/dotnet/aspnetcore/blob/207854d30d1c2e9ea93d1df64044c3724ce5d5cd/src/Servers/HttpSys/src/RequestProcessing/RequestContext.cs#L57

There’s still the question of what’s causing HttpWaitForDisconnectEx to fire, and why it happens so quickly that HttpSendHttpResponse fails even though the client is able to consume the response. I did some testing with an alternate client that re-uses the connection and saw a similar result. Here’s what I think is happening:

Tratcher commented 2 years ago

I've confirmed with the Windows team that this is the expected behavior, and that the race between HttpWaitForDisconnectEx and HttpSendHttpResponse completing is all about threading.

Best case we'd disconnect the CancellationToken from Abort like in AspNetCore, but that's a risky change at this point.

AspNetCore would be vulnerable to this if you passed the CancellationToken to WriteAsync, however, you'd only ever see a TaskCancelledException, the HttpSys error would be supressed.