dotnet / aspnetcore

ASP.NET Core is a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.
https://asp.net
MIT License
35.5k stars 10.04k forks source link

Unexpected exception in TimingPipeFlusher.TimeFlushAsync when using HttpProtocols.Http1AndHttp2 #8843

Closed silmon27 closed 5 years ago

silmon27 commented 5 years ago

I'm getting a couple dozen "Unexpected exception in TimingPipeFlusher.TimeFlushAsync" exceptions in my logs every day. This happens only when HTTP2 support is enabled in Kestrel.

System.InvalidOperationException: Can't GetResult unless awaiter is completed.
   at System.IO.Pipelines.ThrowHelper.ThrowInvalidOperationException_GetResultNotCompleted()
   at System.IO.Pipelines.Pipe.GetFlushAsyncResult()
   at System.IO.Pipelines.Pipe.DefaultPipeWriter.GetResult(Int16 token)
   at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.TimingPipeFlusher.TimeFlushAsync(MinDataRate minRate, Int64 count, IHttpOutputAborter outputAborter, CancellationToken cancellationToken)

I can't seem to trigger it myself. The exceptions occur on a production server.

The app is running in Kubernetes from the mcr.microsoft.com/dotnet/core/aspnet:2.2.3-alpine image.

jkotalik commented 5 years ago

To confirm, you are saying this only occurs when using HTTP/2, correct? The title and issue conflict.

silmon27 commented 5 years ago

The exceptions only occur when ListenOptions.Protocols is set to HttpProtocols.Http1AndHttp2, they don't occur when the default HttpProtocols.Http1 is used. I haven't tried forcing it to HTTP/2 only.

silmon27 commented 5 years ago

As expected forcing HttpProtocols.Http2 results in the same exception as HttpProtocols.Http1AndHttp2. Also not related to Alpine, same happens with regular image. I will try to enable verbose logging and see if I can find out more.

silmon27 commented 5 years ago

I enabled verbose logging for a couple of days and found that most cases look something like this (shortened):

2019-04-04 09:39:32.629 +00:00 [DBG] Connection id "0HLLO7GQRO35H" started.
2019-04-04 09:39:32.695 +00:00 [VRB] Connection id "0HLLO7GQRO35H" sending "SETTINGS" frame for stream ID 0 with length 18 and flags "NONE"
2019-04-04 09:39:32.696 +00:00 [VRB] Connection id "0HLLO7GQRO35H" sending "WINDOW_UPDATE" frame for stream ID 0 with length 4 and flags 0x0
2019-04-04 09:39:32.696 +00:00 [VRB] Connection id "0HLLO7GQRO35H" received "SETTINGS" frame for stream ID 0 with length 12 and flags "NONE"
2019-04-04 09:39:32.696 +00:00 [VRB] Connection id "0HLLO7GQRO35H" sending "SETTINGS" frame for stream ID 0 with length 0 and flags "ACK"
2019-04-04 09:39:32.696 +00:00 [VRB] Connection id "0HLLO7GQRO35H" received "WINDOW_UPDATE" frame for stream ID 0 with length 4 and flags 0x0
2019-04-04 09:39:32.696 +00:00 [VRB] Connection id "0HLLO7GQRO35H" received "HEADERS" frame for stream ID 1 with length 323 and flags "END_STREAM, END_HEADERS"
2019-04-04 09:39:32.696 +00:00 [INF] Request starting HTTP/2 GET https://www.site.com/  
2019-04-04 09:39:32.696 +00:00 [VRB] This request accepts compression.
2019-04-04 09:39:32.715 +00:00 [VRB] Connection id "0HLLO7GQRO35H" received "SETTINGS" frame for stream ID 0 with length 0 and flags "ACK"
2019-04-04 09:39:32.725 +00:00 [DBG] Response compression is not enabled for the Content-Type 'text/html'.
2019-04-04 09:39:32.725 +00:00 [VRB] Connection id "0HLLO7GQRO35H" sending "HEADERS" frame for stream ID 1 with length 320 and flags "END_HEADERS"
2019-04-04 09:39:32.725 +00:00 [VRB] Connection id "0HLLO7GQRO35H" sending "DATA" frame for stream ID 1 with length 16384 and flags "NONE"
2019-04-04 09:39:32.725 +00:00 [VRB] Connection id "0HLLO7GQRO35H" sending "DATA" frame for stream ID 1 with length 5155 and flags "NONE"
2019-04-04 09:39:32.725 +00:00 [VRB] Connection id "0HLLO7GQRO35H" sending "DATA" frame for stream ID 1 with length 0 and flags "END_STREAM"
2019-04-04 09:39:32.726 +00:00 [INF] Request finished in 29.2291ms 200 text/html; charset=utf-8
2019-04-04 09:39:32.757 +00:00 [VRB] Connection id "0HLLO7GQRO35H" received "HEADERS" frame for stream ID 3 with length 54 and flags "END_STREAM, END_HEADERS"
2019-04-04 09:39:32.757 +00:00 [INF] Request starting HTTP/2 GET https://www.site.com/file.css  
2019-04-04 09:39:32.757 +00:00 [VRB] This request accepts compression.
[...]
2019-04-04 09:39:33.180 +00:00 [VRB] Connection id "0HLLO7GQRO35H" sending "DATA" frame for stream ID 47 with length 16384 and flags "NONE"
2019-04-04 09:39:33.180 +00:00 [ERR] Unexpected exception in TimingPipeFlusher.TimeFlushAsync.
System.InvalidOperationException: Can't GetResult unless awaiter is completed.
   at System.IO.Pipelines.ThrowHelper.ThrowInvalidOperationException_GetResultNotCompleted()
   at System.IO.Pipelines.Pipe.GetFlushAsyncResult()
   at System.IO.Pipelines.Pipe.DefaultPipeWriter.GetResult(Int16 token)
   at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.TimingPipeFlusher.TimeFlushAsync(MinDataRate minRate, Int64 count, IHttpOutputAborter outputAborter, CancellationToken cancellationToken)
[...]
2019-04-04 09:41:33.185 +00:00 [ERR] Unexpected exception in TimingPipeFlusher.TimeFlushAsync.
System.InvalidOperationException: Can't GetResult unless awaiter is completed.
   at System.IO.Pipelines.ThrowHelper.ThrowInvalidOperationException_GetResultNotCompleted()
   at System.IO.Pipelines.Pipe.GetFlushAsyncResult()
   at System.IO.Pipelines.Pipe.DefaultPipeWriter.GetResult(Int16 token)
   at System.Threading.Tasks.ValueTask`1.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.TimingPipeFlusher.TimeFlushAsync(MinDataRate minRate, Int64 count, IHttpOutputAborter outputAborter, CancellationToken cancellationToken)
[...]
2019-04-04 09:41:34.767 +00:00 [DBG] Connection id "0HLLO7GQRO35H" is closed. The last processed stream ID was 375.
2019-04-04 09:41:34.767 +00:00 [DBG] Connection id "0HLLO7GQRO35H" sending FIN because: "The client closed the connection."
2019-04-04 09:41:34.768 +00:00 [DBG] Connection id "0HLLO7GQRO35H" stopped.

There are no other errors or warnings prior to the exceptions.

What I noticed is that in all cases a page was requested with lots of images. The images are not handled by the static file handler in ASP but instead loaded from Azure Blob Storage (via the WindowsAzure.Storage NuGet package) like so (simplified):

static async Task HandleRequest(HttpContext context) {
    string blobName = GetBlobNameFromDatabase();
    await AzureBlobContainer
        .GetBlockBlobReference(blobName)
        .DownloadToStreamAsync(context.Response.Body);
}

The exceptions always happen a couple of requests in, and I can see that each request takes longer and longer. So my suspicious is that DownloadToStreamAsync is running into limits (its not timing out or anything, just slowing down, as far as I can tell) and doesn't exactly write to the output stream as Kestrel would like? I'm just guessing.

I'm not familiar with the Kestrel codebase. Any ideas on how to debug this further?

Weird thing is that this only happens on (a small fraction of) HTTP/2 connections, HTTP/1 works fine always.

jkotalik commented 5 years ago

@Tratcher @halter73 This seems interesting.

halter73 commented 5 years ago

Weird this could happen with the TimingPipeFlusher.

https://github.com/aspnet/AspNetCore/blob/77380fb27b22a0a61a4636a7ef33a27043972015/src/Servers/Kestrel/Core/src/Internal/Infrastructure/TimingPipeFlusher.cs#L73-L76

We know this type of bug can happen if multiple continuations try to simultaneously register to FlushAsync() ValueTasks returned from the same PipeWriter. We thought that verifying that _lastFlushTask == null || _lastFlushTask.IsCompleted was enough to ensure it was safe to call AsTask() again on a new FlushAsync() ValueTask.

This could be bad because the same TimingPipeFlusher logic is used for all requests not just HTTP/2.

@pakrym

halter73 commented 5 years ago

Never mind. I don't think this is a simultaneous write flush continuation registration issue, because the error message should be "Concurrent reads or writes are not supported." in that case.

I don't know how ValueTask.AsTask() could fail otherwise unless there is some bug in DefaultPipeWriter.

@davidfowl

davidfowl commented 5 years ago

Yea I'm looking at it. It seems as though there's some discrepancy between what GetStatus returns and the state of the awaitable. The exception is coming from here https://github.com/dotnet/corefx/blob/0c8fa4e730f966ec923d3f623deb00dba8c431b4/src/Common/src/CoreLib/System/Threading/Tasks/ValueTask.cs#L612

analogrelay commented 5 years ago

Tracking in preview 5 for now.

analogrelay commented 5 years ago

@davidfowl I'm guessing this isn't going to land by tomorrow, so I'm going to move this to preview6 never mind, it already is in preview 6 ;)

analogrelay commented 5 years ago

After some discussion with @halter73, it seems like the best fix here (given that this is 3.0) is to no longer permit parallel writes to the response body stream. Thoughts @davidfowl ?

analogrelay commented 5 years ago

@silmon27 there have been a lot of changes to the underlying Pipelines implementation in 3.0. Could you try updating to a 3.0-preview5 image and see if you can repro this issue there? If so, let us know and we can help investigate further.

davidfowl commented 5 years ago

After some discussion with @halter73, it seems like the best fix here (given that this is 3.0) is to no longer permit parallel writes to the response body stream. Thoughts @davidfowl ?

@anurse yes, we should do this.

analogrelay commented 5 years ago

See https://github.com/aspnet/AspNetCore/issues/9876 . Getting that behavior consistent with HttpSys is a Big Thing. Do we want this behavior to be different by server? That seems like a pit of failure to me.

davidfowl commented 5 years ago

See #9876 . Getting that behavior consistent with HttpSys is a Big Thing. Do we want this behavior to be different by server? That seems like a pit of failure to me.

I'm for making it consistent by default and if we think that there are cases that warrant allowing parallel writes we can expose a flag to disable the behavior in http.sys only. I'm guessing that the 99.5% of people do this by mistake, not intentionally. Overlapping asynchronous operations are generally forbidden on the HttpContext.

analogrelay commented 5 years ago

I'm for making it consistent by default and if we think that there are cases that warrant allowing parallel writes we can expose a flag to disable the behavior in http.sys only.

That sounds like a reasonable compromise. @Tratcher?

Tratcher commented 5 years ago

No, Http.Sys isn't broken, has a real scenario for this behavior, and isn't worth the effort to add the new checks. It will get filed and backlogged like all other Http.Sys work this milestone. Consistency is useful for common behaviors, but this isn't a common issue, it's a deliberate pattern someone has to go out of their way to do.

davidfowl commented 5 years ago

No, Http.Sys isn't broken, has a real scenario for this behavior, and isn't worth the effort to add the new checks. It will get filed and backlogged like all other Http.Sys work this milestone.

So you're arguing we should leave http.sys alone? I actually don't care at all as long as we block it in kestrel.

Consistency is useful for common behaviors, but this isn't a common issue, it's a deliberate pattern someone has to go out of their way to do.

That's not the only way it happens. It's a pattern that people do by mistake because they can forget to await.

analogrelay commented 5 years ago

The concern with being inconsistent here is with a user who is running an app on HttpSys server and switches to Kestrel. If they were doing overlapping WriteAsync calls (which HttpSys server allows) then switching servers will cause their app to start throwing.

That's not the only way it happens. It's a pattern that people do by mistake because they can forget to await.

💯 this. It's not common to intentionally do this, but it is quite common to unintentionally do this.

Regarding the desire to not modify HttpSys, could we do this at a higher level across all servers (by wrapping the body stream)? Or would that wrapper be difficult/undesirable?

Tratcher commented 5 years ago

A wrapper would be just be more overhead, state machines, etc..

Honestly we don't see many people switch between Http.Sys and Kestrel. They usually switch between IIS and Kestrel. If they're using Http.Sys it's for a specific feature not available elsewhere.

silmon27 commented 5 years ago

I don't think I'm doing parallel writes to the response body stream. Microsoft.WindowsAzure.Storage.Blob.CloudBlob.DownloadToStreamAsync should be the only one writing to the response body stream.

Simplified version:

static async Task Main(string[] args)
{
  ThreadPool.SetMinThreads(100, 100);
  ServicePointManager.DefaultConnectionLimit = 100;

  await BuildWebHost(args).RunAsync();
}

public static IWebHost BuildWebHost(string[] args)
{
  var builder = new WebHostBuilder();

  builder.UseKestrel((builderContext, options) => {
    options.ListenAnyIP(443, listenOptions => {
      listenOptions.Protocols = HttpProtocols.Http1AndHttp2;
      listenOptions.UseHttps(httpsOptions => {
        httpsOptions.ServerCertificateSelector = (connectionContext, name) => { ... };
      });
    });
  });

  builder.UseStartup<Startup>();

  return builder.Build();
}
public void Configure(IApplicationBuilder app, IHostingEnvironment env)
{
  app.UseCustomExceptionPage(); // Very similar to UseDeveloperExceptionPage
  app.UseExceptionLogging(); // Never writes to the response body

  app.UseResponseCompression();

  app.Run(async (context) => {
    string blobName = GetBlobNameFromDatabase(context.Request.Path);
    if (blobName != null) {
      context.Response.Headers.Add("...", "...");
      context.Response.ContentType = "...";

      await AzureBlobContainer
        .GetBlockBlobReference(blobName)
        .DownloadToStreamAsync(context.Response.Body);
    } else {
      await WritePageNotFoundAsync();
    }
  });
}

I also checked the exception page middleware and it does not WriteAsync if Response.HasStarted is true. But that shouldn't matter because DownloadToStreamAsync is awaited, that really should be the only one writing at any given time.

I can't reliably reproduce the error yet. I can sometimes trigger it when running a load test and it seems related to how long DownloadToStreamAsync takes. After a while it becomes harder to trigger the error because Azure Storage begins returning consistently fast (cached?). Azure Storage client's implementation seems nontrivial and I don't have time to look into their source right now.

Anything I might be missing?

I'll see if I can do a test with 3.0 preview 5 in a couple of days (maybe weeks).

analogrelay commented 5 years ago

The discussion on parallel writes is actually a (very related) tangent. The issue you are seeing is in TimingPipeFlusher, which we've been having problems with for a while. That component is only necessary because we allow parallel calls to WriteAsync though. If we don't allow that, we can remove the component entirely (which should resolve your bug :)).

silmon27 commented 5 years ago

Oh, I see. That makes sense. I can't imagine there are many use cases for this.

halter73 commented 5 years ago

Reopening since the attempted fix was reverted by #10315

analogrelay commented 5 years ago

We think this is done (again) right @halter73 ?

analogrelay commented 5 years ago

Let's close this after the IIS PR is merged.

halter73 commented 5 years ago

IIS PR

T3chArmy commented 5 years ago

This is still occurring for me on the latest preview of ASPNet Core on Ubuntu 18.04. is this something I'm doing wrong, or is this an issue related to this?

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  18.04
 OS Platform: Linux
 RID:         ubuntu.18.04-x64
 Base Path:   /usr/share/dotnet/sdk/3.0.100-preview6-012264/

Host (useful for support):
  Version: 3.0.0-preview6-27804-01
  Commit:  fdf81c6faf

.NET Core SDKs installed:
  3.0.100-preview6-012264 [/usr/share/dotnet/sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.App 3.0.0-preview6.19307.2 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 3.0.0-preview6-27804-01 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

Edit: Just went back and re-read this again, Just noticed the milestone is marked as preview7 now. Disregard.

analogrelay commented 5 years ago

Just went back and re-read this again, Just noticed the milestone is marked as preview7 now.

Correct, the fix is in preview7 which has not yet been released. Daily builds are available but are not fully tested and have some quirks to using them: https://github.com/aspnet/AspNetCore/blob/master/docs/DailyBuilds.md