aspnet / KestrelHttpServer

[Archived] A cross platform web server for ASP.NET Core. Project moved to https://github.com/aspnet/AspNetCore
Apache License 2.0
2.63k stars 528 forks source link

Kestrel stops processing requests for a period #2986

Closed peppy closed 6 years ago

peppy commented 6 years ago

I have been migrating a .NET Framework 4.x project to .NET core. In the process I ran into an issue with HttpListener not working as expected, so migrated my code to use Kestrel.

The server application is a high-performance game server that handles request and response bodies in a completely custom way. It processes around 3000req/s. After around 5-30 minutes of correct execution, the server goes into a state where it is no longer processing incoming requests. Eventually it will recover from this state, but by this point clients believe they have lost connections.

I have been debugging over the last few days and have ruled out my code being the direct issue, but I have also been unable to reproduce under non-production load. I did manage to get a core dump in the hanging state and you can find the log of sos EEstack here. I am able to share any further lldb/sos output should it be relevant.

While I fully intend to continue investigation on my own, I want to post this here for visibility in case someone working on the project can guide me in the right direction based on the thread call stacks above.

The project was built against .NET core 2.1.4 / Kestrel 2.2.0-preview2-35157 on a macOS host using dotnet publish, running on an Ubuntu host (rid linux-x64).

Note that I was seeing similar behaviour before migrating from HttpListener, so it is quite possibly something at a lower level or unrelated to Kestrel directly.

As the project this is being used in is private, I cannot link to the source, but here is an extract showing how I am using Kestrel:


private static void StartHttpListener()
{
    var host = new WebHostBuilder().UseKestrel(options =>
        {
            options.ConfigureEndpointDefaults(o => o.NoDelay = true);
        })
        .Configure(app =>
        {
            app.Run(context =>
            {
                DogStatsd.Increment("http_requests");
                return HandleConnection(context);
            });
        })
        .UseUrls("http://*:80")
        .Build();

    host.Run();
}

private static Task HandleConnection(HttpContext context)
{
    var request = context.Request;
    var response = context.Response;

    var client = UserManager.GetClientFromToken(incomingToken);
    if (client == null)
    {
        response.StatusCode = 403;
        return Task.CompletedTask;
    }

    var tcs = new TaskCompletionSource<bool>();

    client.SetStreams(request.Body, response.Body, (int)(request.ContentLength ?? 0), tcs);
    // the request is read by the client synchronously here.

    // the response is processed and written to by a separate worker pool within one second of this call.
    // after the worker finishes using the response, it will set tcs.SetResult(true) to complete the returned task below.
    return tcs.Task;
}

Thanks in advance for any help with this issue.

davidfowl commented 6 years ago

It looks like you're blocking thread pool threads by calling Task.Wait in a bunch of places:

00007F45C63B6158 00007f45ef0aa192 (MethodDesc 00007f45eeb82790 + 0x292 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)), calling 00007f46689696d0 (stub for System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object))
00007F45C63B6250 00007f45ef0aa192 (MethodDesc 00007f45eeb82790 + 0x292 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)), calling 00007f46689696d0 (stub for System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object))
00007F45C63B62E0 00007f45ef0609e9 (MethodDesc 00007f45eeb7b470 + 0xb9 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken)), calling (MethodDesc 00007f45eeb82790 + 0 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken))
00007F45C63B6340 00007f45ef060879 (MethodDesc 00007f45eeb7b460 + 0x159 System.Threading.Tasks.Task.InternalWaitCore(Int32, System.Threading.CancellationToken)), calling (MethodDesc 00007f45eeb7b470 + 0 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken))
00007F45C63B6360 00007f45f21a2d03 (MethodDesc 00007f45f210ee70 + 0xb3 /root/bancho-publish/Microsoft.AspNetCore.Server.Kestrel.Core.dll!Unknown), calling (MethodDesc 00007f45eeb8b7f8 + 0 System.Threading.Tasks.ValueTask`1[[System.Int32, System.Private.CoreLib]].AsTask())

Are you doing synchronous reads/writes anywhere? Is your custom body logic reading from the request body Stream synchronously (Stream.Read/Stream.Write)? If so it sounds like classic thread pool starvation.

peppy commented 6 years ago

Yes, there is one instance of a non-async read (I believe there was a reason to do it this way originally). To ensure there is no thread pool starvation I do have the thread pool set to a very large size.

Thanks for the advice. I will remove the synchronous stream read logic and see if it helps.

davidfowl commented 6 years ago

Maybe look at the number of items queued to the thread pool:

0:016> !dumpheap -stat -type QueueUserWorkItem
Statistics:
              MT    Count    TotalSize Class Name
00007ffd3d76d8b0        1           24 System.Threading.QueueUserWorkItemCallbackDefaultContext+<>c
00007ffd3d76d860     1000        32000 System.Threading.QueueUserWorkItemCallbackDefaultContext
davidfowl commented 6 years ago

Hmm that may not work for queued tasks. Checking something.

peppy commented 6 years ago
(lldb) dumpheap -stat -type QueueUserWorkItem

Statistics:
              MT    Count    TotalSize Class Name
00007f45f20c5bf0        1           24 System.Threading.QueueUserWorkItemCallbackDefaultContext`1+<>c[[System.Object, System.Private.CoreLib]]
00007f45ef38da00     3301       132040 System.Threading.QueueUserWorkItemCallback
00007f45f20c5b28     6250       200000 System.Threading.QueueUserWorkItemCallbackDefaultContext`1[[System.Object, System.Private.CoreLib]]
Total 9552 objects

This is not looking too happy, that's for sure.

davidfowl commented 6 years ago

Starvation it is! It's likely Kestrel can't serve requests because it can't even schedule the callback to parse the incoming http requests. If you can't change the calling code, you can buffer the incoming body so you won't kill the thread pool.

Option 1: Change the Stream reading to be async.

Option 2: Buffer the incoming Streams so you can keep the synchronous stream reading code.

private static async Task HandleConnection(HttpContext context)
{
    var request = context.Request;
    var response = context.Response;

    var client = UserManager.GetClientFromToken(incomingToken);
    if (client == null)
    {
        response.StatusCode = 403;
        return Task.CompletedTask;
    }

    context.Request.EnableBuffering();
    await context.Request.Body.DrainAsync(CancellationToken.None);

    context.Request.Body.Seek(0L, SeekOrigin.Begin);

    var tcs = new TaskCompletionSource<bool>();

    client.SetStreams(request.Body, response.Body, (int)(request.ContentLength ?? 0), tcs);
    // the request is read by the client synchronously here.

    // the response is processed and written to by a separate worker pool within one second of this call.
    // after the worker finishes using the response, it will set tcs.SetResult(true) to complete the returned task below.
    await tcs.Task;
}
davidfowl commented 6 years ago

DrainAsync is in the Microsoft.AspNetCore.WebUtilities namespace.

peppy commented 6 years ago

Thanks again for the options. I will make the changes and report back after testing with production workload tomorrow!

peppy commented 6 years ago

On looking back over the data I collected I noticed that I was graphing the AvailableThreads from the worker pool rather than correctly calculating how many were free/spare using (Min - (Max-Available)) (used this as a general reference). I also noticed we were setting the MaxThreads to increase the pool size, rather than MinThreads (which would have only had a negative effect).

I am also kicking myself for not realising the cause here, as I spend a good month working on a large refactor to async code in another project which encountered similar issues.

I have since re-deployed and things are looking to be 100% stable (I converted one .Read to .ReadAsync and set SetMinThreads to a sane 16 * ProcessorCount just to catch any potential lingering issues).

image

It's a wonder in this async world we live in there are no debug level diagnostics to warn the develoeper of threadpool starvation, as it's easy to run in to and hard to diagnose if you don't know what you're looking for.

Anyways, all fixed. Happily serving ~2,000req/s with no signs of struggling đź‘Ť.

davidfowl commented 6 years ago

https://twitter.com/davidfowl/status/1048441309696909312 it's coming