ASP.Net Core Thread Starvation in high load

EikeSchwass commented 2 years ago

Is there an existing issue for this?

[X] I have searched the existing issues

Describe the bug

We are migrating a large code base from .Net Framework 4.7.2 ASP.NET to .NET 6 ASP.NET Core (Hosted in IIS 10).

Unfortunately we noticed a regression in high load scenarios. The ASP.NET was able to recover from load peaks, while ASP.NET Core enters thread starvation and only recovers if load is reduced and even then it takes minutes. Due to Oracle not providing a true async API, a large part of our code base runs synchronously.

Our current hypothesis for the difference is the missing default request queue that was present in ASP.NET (?). The naive approach would be to introduce a rate limiting middleware, but hard coding the number of concurrent requests that are allowed seems problematic. Is there a way to configure ASP.NET Core so that it starts to throttle/rate limit the number of requests as it approaches thread starvation?

We assume we would need something like this: https://referencesource.microsoft.com/#System.Web/RequestQueue.cs,8

Expected Behavior

ASP.NET Core should not allow overloading itself with requests and instead buffer them in that case similar how ASP.NET + IIS did it.

Steps To Reproduce

Slowly increase load in ASP.NET WebApi until response timouts. Decrease load and observe how ASP.NET quickly recovers
Slowly increase load in ASP.NET Core WebApi until response timouts. Decrease load and observe how ASP.NET Core stays unresponsive due to thread starvation for much longer

Exceptions (if any)

No response

.NET Version

6.0.403

Anything else?

We gathered some information from https://developercommunity.visualstudio.com/t/on-net-core-timeout-in-large-concurrency/693778#T-N694931

davidfowl commented 2 years ago

The naive approach would be to introduce a rate limiting middleware, but hard coding the number of concurrent requests that are allowed seems problematic. Is there a way to configure ASP.NET Core so that it starts to throttle/rate limit the number of requests as it approaches thread starvation?

This is the right direction and yes hardcoding a number is not great but it's what ASP.NET did (and HTTP.sys and the layers beneath). .NET 7 has better options for rate limiting (other than just concurrency and can do it per endpoint https://devblogs.microsoft.com/dotnet/announcing-rate-limiting-for-dotnet/).

If you want an idea of some of the existing numbers for .NET Framework:

System.Web request queue - 5000 * number of CPUs (setting is called maxConcurrentRequestsPerCPU)
IIS concurrent request limit - 5000 (appConcurrentRequestLimit)
HTTP.sys queue - 1000 (https://learn.microsoft.com/en-us/windows/win32/http/configuring-properties-in-http-version-2-0?redirectedfrom=MSDN)

Throttling incoming requests to blocking endpoints is definitely the way to go here.

We assume we would need something like this: https://referencesource.microsoft.com/#System.Web/RequestQueue.cs,8

That isn't being used by ASP.NET. It's an older, less efficient queue that was used prior to it moving to native code.

EikeSchwass commented 2 years ago

@davidfowl thanks for the quick response. What is a good estimate for the total number of concurrent requests? How did ASP.NET decide how many it let through?

davidfowl commented 2 years ago

What is a good estimate for the total number of concurrent requests?

There's no good number and it's hard to bake a number into the framework. Applications have a much easier time with it because they can optimize for a specific load profile. Doing it in the server or framework means we need to make assumptions about the load profile of any application.

How did ASP.NET decide how many it let through?

Load testing on some specific scenarios and some guesstimating.

Applications that pick a number usually find the breaking point of the application by driving load to it and then observing metrics. Once you figure out where it breaks then reduce the concurrency number until the performance is reasonable.

I'd recommend driving load and observing metrics with https://learn.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-counters.

There's a high-level tutorial here https://learn.microsoft.com/en-us/dotnet/core/diagnostics/event-counter-perf

Here is the list of well-known counters https://learn.microsoft.com/en-us/dotnet/core/diagnostics/available-counters

If you want a more sophisticated load tool then consider https://github.com/dotnet/crank (it's possible to make it work with IIS as well but it's not documented right now). This tool can drive load and also collect counters. Start simple and see if you can look at the counters locally while reproducing the issue on IIS.

EikeSchwass commented 2 years ago

@davidfowl The ASP.NET Version of our app must have the maximum concurrent requests set somewhere though right? I assumed the numer is baked in ASP.NET somewhere and for starters I would simply like to copy the limit directly to our Core Version. We didn't configure anything in that regard for the Framework version and it does throttle appropriately somehow.

davidfowl commented 2 years ago

The ASP.NET Version of our app must have the maximum concurrent requests set somewhere though right?

That's what I specified in the last message:

System.Web request queue - 5000 * number of CPUs

5000 * Environment.ProcessorCount

I believe there's also a concurrent request limit in IIS appConcurrentRequestLimit that's 5000 by default (I'm not sure if that's per CPU).

The HTTP.sys queue is 1000 not 5000 (I tweaked it).

EikeSchwass commented 2 years ago

@davidfowl ah sorry I misunderstood. I thought that referred to the maximum queue length and not the maximum concurrent calls. Thanks for clearing that up! This has helped tremendously! <3

davidfowl commented 2 years ago

I updated the issue with the relevant settings in case you want to do more research. Let me know how it turns out.

EikeSchwass commented 2 years ago

@davidfowl completely eliminated the problem, so now only fine tuning is left. Thanks again!

davidfowl commented 2 years ago

@EikeSchwass Can you share your middleware configuration here to help future developers 😄 ?

EikeSchwass commented 2 years ago

@davidfowl sure!

We used Microsoft.AspNetCore.ConcurrencyLimiter

In our Startup.cs:

public void ConfigureServices(IServiceCollection services)
{
    // ...
    services.AddStackPolicy(options =>
    {
        options.RequestQueueLimit = 5000 * Environment.ProcessorCount;
        options.MaxConcurrentRequests = Configuration.MaxConcurrentRequests * Environment.ProcessorCount;
    });
    // ...
}

public void Configure(IApplicationBuilder app, IWebHostEnvironment env, IHostApplicationLifetime appLifetime)
{
    // ... (no other middlewares)
    app.UseConcurrencyLimiter();
    // ..
}

and our appsettings.config:

{
"..."
"MaxConcurrentRequests": "15" 
"..."
}

However, this value will most likely change as we do more testing. Nevertheless is did fix the issue for our TEST environment. Notice that the value gets multiplied by Environment.ProcessorCount.

davidfowl commented 2 years ago

I want to turn this into guidance.

davidfowl commented 2 years ago

@BrennanConroy this is a great use of the new rate limiting APIs

ghost commented 1 year ago

Thanks for contacting us.

We're moving this issue to the .NET 8 Planning milestone for future evaluation / consideration. We would like to keep this around to collect more feedback, which can help us with prioritizing this work. We will re-evaluate this issue, during our next planning meeting(s). If we later determine, that the issue has no community involvement, or it's very rare and low-impact issue, we will close it - so that the team can focus on more important and high impact issues. To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

karlra commented 1 year ago

Since you mentioned Oracle - if you are using MySQL, and using the Oracle driver, do yourself a favor and switch to the open source one. We also experienced random complete process lockups (that did not recover) using the Oracle driver, even when all calls are async. Not a single hang since switching to MySqlConnector with months and months of uptime.

Unfortunately the completely worthless quality of Oracle's drivers is such a problem for the .net community since so many people use MySQL and probably assume that it's dotnet's fault when the entire process just stops working. Oracle's connectors should be blacklisted from Nuget....

mgravell commented 1 year ago

Tangential but very relevant to the genesis of this thread: Oracle.ManagedDataAccess version 23+ allegedly (I'm going by release notes here, not personal usage) has support for async. However, the v23 drivers are (at time of writing) do not seem to be fully released, with 21.12.0 the most recent without the -dev suffix.

Poltuu commented 9 months ago

We are facing a very similar issue and are about to test the suggested approach in this thread. We would have appreciated this problem to be more prevalent in the official documentation /guidance as to how to migrate 👍

dotnet / aspnetcore