Closed peppy closed 6 years ago
It looks like you're blocking thread pool threads by calling Task.Wait in a bunch of places:
00007F45C63B6158 00007f45ef0aa192 (MethodDesc 00007f45eeb82790 + 0x292 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)), calling 00007f46689696d0 (stub for System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object))
00007F45C63B6250 00007f45ef0aa192 (MethodDesc 00007f45eeb82790 + 0x292 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)), calling 00007f46689696d0 (stub for System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object))
00007F45C63B62E0 00007f45ef0609e9 (MethodDesc 00007f45eeb7b470 + 0xb9 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken)), calling (MethodDesc 00007f45eeb82790 + 0 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken))
00007F45C63B6340 00007f45ef060879 (MethodDesc 00007f45eeb7b460 + 0x159 System.Threading.Tasks.Task.InternalWaitCore(Int32, System.Threading.CancellationToken)), calling (MethodDesc 00007f45eeb7b470 + 0 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken))
00007F45C63B6360 00007f45f21a2d03 (MethodDesc 00007f45f210ee70 + 0xb3 /root/bancho-publish/Microsoft.AspNetCore.Server.Kestrel.Core.dll!Unknown), calling (MethodDesc 00007f45eeb8b7f8 + 0 System.Threading.Tasks.ValueTask`1[[System.Int32, System.Private.CoreLib]].AsTask())
Are you doing synchronous reads/writes anywhere? Is your custom body logic reading from the request body Stream
synchronously (Stream.Read/Stream.Write)? If so it sounds like classic thread pool starvation.
Yes, there is one instance of a non-async read (I believe there was a reason to do it this way originally). To ensure there is no thread pool starvation I do have the thread pool set to a very large size.
Thanks for the advice. I will remove the synchronous stream read logic and see if it helps.
Maybe look at the number of items queued to the thread pool:
0:016> !dumpheap -stat -type QueueUserWorkItem
Statistics:
MT Count TotalSize Class Name
00007ffd3d76d8b0 1 24 System.Threading.QueueUserWorkItemCallbackDefaultContext+<>c
00007ffd3d76d860 1000 32000 System.Threading.QueueUserWorkItemCallbackDefaultContext
Hmm that may not work for queued tasks. Checking something.
(lldb) dumpheap -stat -type QueueUserWorkItem
Statistics:
MT Count TotalSize Class Name
00007f45f20c5bf0 1 24 System.Threading.QueueUserWorkItemCallbackDefaultContext`1+<>c[[System.Object, System.Private.CoreLib]]
00007f45ef38da00 3301 132040 System.Threading.QueueUserWorkItemCallback
00007f45f20c5b28 6250 200000 System.Threading.QueueUserWorkItemCallbackDefaultContext`1[[System.Object, System.Private.CoreLib]]
Total 9552 objects
This is not looking too happy, that's for sure.
Starvation it is! It's likely Kestrel can't serve requests because it can't even schedule the callback to parse the incoming http requests. If you can't change the calling code, you can buffer the incoming body so you won't kill the thread pool.
Option 1: Change the Stream reading to be async.
Option 2: Buffer the incoming Streams so you can keep the synchronous stream reading code.
private static async Task HandleConnection(HttpContext context)
{
var request = context.Request;
var response = context.Response;
var client = UserManager.GetClientFromToken(incomingToken);
if (client == null)
{
response.StatusCode = 403;
return Task.CompletedTask;
}
context.Request.EnableBuffering();
await context.Request.Body.DrainAsync(CancellationToken.None);
context.Request.Body.Seek(0L, SeekOrigin.Begin);
var tcs = new TaskCompletionSource<bool>();
client.SetStreams(request.Body, response.Body, (int)(request.ContentLength ?? 0), tcs);
// the request is read by the client synchronously here.
// the response is processed and written to by a separate worker pool within one second of this call.
// after the worker finishes using the response, it will set tcs.SetResult(true) to complete the returned task below.
await tcs.Task;
}
DrainAsync is in the Microsoft.AspNetCore.WebUtilities namespace.
Thanks again for the options. I will make the changes and report back after testing with production workload tomorrow!
On looking back over the data I collected I noticed that I was graphing the AvailableThreads
from the worker pool rather than correctly calculating how many were free/spare using (Min - (Max-Available))
(used this as a general reference). I also noticed we were setting the MaxThreads
to increase the pool size, rather than MinThreads
(which would have only had a negative effect).
I am also kicking myself for not realising the cause here, as I spend a good month working on a large refactor to async code in another project which encountered similar issues.
I have since re-deployed and things are looking to be 100% stable (I converted one .Read
to .ReadAsync
and set SetMinThreads
to a sane 16 * ProcessorCount
just to catch any potential lingering issues).
It's a wonder in this async world we live in there are no debug level diagnostics to warn the develoeper of threadpool starvation, as it's easy to run in to and hard to diagnose if you don't know what you're looking for.
Anyways, all fixed. Happily serving ~2,000req/s with no signs of struggling đź‘Ť.
I have been migrating a .NET Framework 4.x project to .NET core. In the process I ran into an issue with
HttpListener
not working as expected, so migrated my code to use Kestrel.The server application is a high-performance game server that handles request and response bodies in a completely custom way. It processes around 3000req/s. After around 5-30 minutes of correct execution, the server goes into a state where it is no longer processing incoming requests. Eventually it will recover from this state, but by this point clients believe they have lost connections.
I have been debugging over the last few days and have ruled out my code being the direct issue, but I have also been unable to reproduce under non-production load. I did manage to get a core dump in the hanging state and you can find the log of sos EEstack here. I am able to share any further lldb/sos output should it be relevant.
While I fully intend to continue investigation on my own, I want to post this here for visibility in case someone working on the project can guide me in the right direction based on the thread call stacks above.
The project was built against
.NET core 2.1.4
/Kestrel 2.2.0-preview2-35157
on a macOS host usingdotnet publish
, running on an Ubuntu host (ridlinux-x64
).Note that I was seeing similar behaviour before migrating from
HttpListener
, so it is quite possibly something at a lower level or unrelated to Kestrel directly.As the project this is being used in is private, I cannot link to the source, but here is an extract showing how I am using Kestrel:
Thanks in advance for any help with this issue.