dotnet / aspnetcore

ASP.NET Core is a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.
https://asp.net
MIT License
34.59k stars 9.79k forks source link

BadHttpRequestException: Reading the request body timed out due to data arriving too slowly #4707

Closed clement911 closed 3 years ago

clement911 commented 5 years ago

We have an asp.net core 2.1 app running on .net 4.7 in azure web app service.

Recently we started getting A LOT of the following error:

Microsoft.AspNetCore.Server.Kestrel.Core.BadHttpRequestException: Reading the request body timed out due to data arriving too slowly. See MinRequestBodyDataRate.

Our app is under pretty constant heavy load (from incoming webhooks mostly). During big spikes, we tend to see more of these.

We used to run on asp.net core 2.0 / .net 4.6.1 for many months and we never saw that error before. It seems to have started happening following our recent upgrade to asp.net core 2.1 / .net 4.7.

We'd like to get to the bottom of this issue, but we are not even sure where to start. Any thoughts?

Tratcher commented 5 years ago

@sebastienros have you worked up an FAQ wiki yet?

davidfowl commented 5 years ago

I wonder if it's because if thread pool starvation... Is your application fully async? Do you have any other logs from Kestrel? Can you capture a dump or a profile? Is the application CPU bound or IO bound?

clement911 commented 5 years ago

It just happened a couple of times just earlier, usually it's happening for several successive requests in a short time. I'm monitoring cpu, memory, thread count and a bunch of other metrics in Azure monitor and it all looks healthy. So even when there is no apparent bottleneck, and the app is very responsive, it can still happen occasionally. Yes it is fully async.

Do you have any other logs from Kestrel?

No other warnings/errors. Probably info level but I'm not persisting those, although there is probably a way to get these.

Can you capture a dump or a profile?

I should be able to capture a dump from the azure portal if you think this will help? Not sure what a profile is?

clement911 commented 5 years ago

Any thoughts on how to get to the bottom of this issue?

We're not sure if the errors are caused by legit requests. Is it possible find out the url that was requested when this message gets generated?

We have been getting a couple of hundreds of these per day, even though all metrics look healthy.

Any particular log source that I can turn on in azure/asp.net core maybe?

The stack trace doesn't give us much.

Exception: Microsoft.AspNetCore.Server.Kestrel.Core.BadHttpRequestException: Reading the request body timed out due to data arriving too slowly. See MinRequestBodyDataRate.
   at Microsoft.AspNetCore.Server.Kestrel.Core.BadHttpRequestException.Throw(RequestRejectionReason reason)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Http1MessageBody.d__4.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
   at System.IO.Pipelines.Pipe.GetReadResult(ReadResult& result)
   at System.IO.Pipelines.Pipe.TryRead(ReadResult& result)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Http1MessageBody.OnConsumeAsync()
clement911 commented 5 years ago

There was just one log entry with a slightly better call stack including some of our calling code. The stack is pointing to our webhook action. In order to verify the webhook is valid, we must hash the request body and we use the following code. Is there anything wrong with this code?

[HttpPost]
        public async Task<OkResult> Webhook()
        {
            var memoryStream = new MemoryStream();
            await Request.Body.CopyToAsync(memoryStream);
            bool isValidRequest = await AuthorizationService.IsAuthenticWebhook(Request.Headers, memoryStream, _shopifyLibOptions.SecretKey);
            if (!isValidRequest)
            {
                throw new UnauthorizedAccessException("This request is not an authentic webhook request.");
            }
            memoryStream.Position = 0;
            string body = new StreamReader(memoryStream).ReadToEnd();

        //omitted code here unrelated to asp.net

            return Ok();
        }

The call stack is pointing to the line
bool isValidRequest = await AuthorizationService.IsAuthenticWebhook(Request.Headers, memoryStream, _shopifyLibOptions.SecretKey); Although in my experience, line numbers in call stack are sometimes a bit off and I wonder if the real issue is the line before await Request.Body.CopyToAsync(memoryStream);

Then the stack below is

Exception: Microsoft.AspNetCore.Server.Kestrel.Core.BadHttpRequestException: Reading the request body timed out due to data arriving too slowly. See MinRequestBodyDataRate.
   at Microsoft.AspNetCore.Server.Kestrel.Core.BadHttpRequestException.Throw(RequestRejectionReason reason)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Http1MessageBody.d__4.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
   at System.IO.Pipelines.Pipe.GetReadResult(ReadResult& result)
   at System.IO.Pipelines.Pipe.GetReadAsyncResult()
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.MessageBody.d__22.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpRequestStream.d__23.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at LINE FROM OUR CODE
Tratcher commented 5 years ago

.ReadToEnd(); is a red flag, that's blocking sync IO. This data rate error has often been reported in applications that are having trouble with too many blocked threadpool threads. In this case there's an easy ReadToEndAsync() replacement. Have you checked for thread starvation issues elsewhere?

clement911 commented 5 years ago

But the stream here is a memory stream. That's why it's using the sync version. No I/O right?

The number of threads is low and metrics are healthy. This issue seems quite sporadic but we get these errors in batches of 5 or 10... Could it be caused by the GC causing a pause?

Tratcher commented 5 years ago

Ah, you got me. I wasn't reading closely enough.

davidfowl commented 5 years ago

How big are the posts?

clement911 commented 5 years ago

They are pretty small json payloads. Order, Customer, Product, this kind of thing...

davidfowl commented 5 years ago

What size is pretty small?

clement911 commented 5 years ago

Typically under 10k characters, although there a few outliers over 100k characters. We get a lot of requests though. Roughly 500k per day.

clement911 commented 5 years ago

Are there any further logs that can be turned on to get to the bottom of this issue? This is really polluting our log files and we're still not sure why it occurs in the first place.

clement911 commented 5 years ago

We appear to have gotten rid of the issue. We are not entirely sure but most likely it was caused by some code accessing the request object from a background thread.

davidfowl commented 5 years ago

We are not entirely sure but most likely it was caused by some code accessing the request object from a background thread.

Using the IHttpContextAccessor? What did that code look like?

clement911 commented 5 years ago

Sorry I had to re-open. I thought it was fixed after our deployment on Sunday evening, but the absence of errors was simply due to the very low activity at this time of the week. Once Monday kicked in, this error started flooding again.

David, that deployment included a fix related to the IHttpContextAccessor that you help me resolved here. I incorrectly thought this might have fixed this issue as well.

The call stack varies slightly when the error is captued in the log. Usually only asp.net frames, but it also sometimes happen in this new piece of code, which is very similar to the one I posted above handling webhooks:

Simplified a bit for clarity:

using Microsoft.AspNetCore.Http;
using Microsoft.AspNetCore.Http.Extensions;
using Microsoft.AspNetCore.Http.Internal;
using Microsoft.Extensions.Logging;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Middlewares
{
    public class RequestLogScopeMiddleware
    {
        private readonly RequestDelegate _next;
        private readonly ILogger<RequestLogScopeMiddleware> _logger;

        public RequestLogScopeMiddleware(RequestDelegate next, ILogger<RequestLogScopeMiddleware> logger)
        {
            _next = next;
            _logger = logger;
        }

        public async Task Invoke(HttpContext context)
        {
            using (_logger.BeginScope(await GetHttpInfo(context)))
            {
                await _next(context);
            }
        }

        private async Task<HttpInfo> GetHttpInfo(HttpContext ctx)
        {
            try
            {
                var req = ctx.Request;
                req.EnableRewind();
                string body = await new StreamReader(req.Body).ReadToEndAsync();
                req.Body.Position = 0;
                try
                {
                    body = JsonConvert.SerializeObject(JsonConvert.DeserializeObject(body), Formatting.Indented, new JsonSerializerSettings { NullValueHandling = NullValueHandling.Ignore });
                }
                catch
                {
                    //body may not be json
                }
                return new HttpInfo(req.Method, req.GetDisplayUrl(), body, string.Join(Environment.NewLine, req.Headers.Select(kv => $"{kv.Key}: {kv.Value}")));
            }
            catch (Exception ex)
            {
                _logger.LogWarning(ex, "Failed to extract http info from request");
                return null;
            }
        }
    }
}

Logged error:

Failed to extract http info from request

    HttpInfo: 

    Exception: Microsoft.AspNetCore.Server.Kestrel.Core.BadHttpRequestException: Reading the request body timed out due to data arriving too slowly. See MinRequestBodyDataRate.
   at Microsoft.AspNetCore.Server.Kestrel.Core.BadHttpRequestException.Throw(RequestRejectionReason reason)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Http1MessageBody.d__4.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
   at System.IO.Pipelines.Pipe.GetReadResult(ReadResult& result)
   at System.IO.Pipelines.Pipe.TryRead(ReadResult& result)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Http1MessageBody.OnConsumeAsync()

I'm pretty sure that it is related to await new StreamReader(req.Body).ReadToEndAsync();

clement911 commented 5 years ago

Here is another stack pointing clearly showing the reading of the stream is the issue:

Error from Microsoft.AspNetCore.Diagnostics.ExceptionHandlerMiddleware:
    An unhandled exception has occurred while executing the request.

    HttpInfo: 

    Exception: Microsoft.AspNetCore.Server.Kestrel.Core.BadHttpRequestException: Reading the request body timed out due to data arriving too slowly. See MinRequestBodyDataRate.
   at Microsoft.AspNetCore.Server.Kestrel.Core.BadHttpRequestException.Throw(RequestRejectionReason reason)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Http1MessageBody.d__4.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
   at System.IO.Pipelines.Pipe.GetReadResult(ReadResult& result)
   at System.IO.Pipelines.Pipe.ReadAsync(CancellationToken token)
   at System.IO.Pipelines.Pipe.DefaultPipeReader.ReadAsync(CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.MessageBody.d__21.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpRequestStream.d__21.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.AspNetCore.WebUtilities.FileBufferingReadStream.d__35.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.IO.StreamReader.d__97.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.IO.StreamReader.d__62.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Middlewares.RequestLogScopeMiddleware.d__5.MoveNext()
davidfowl commented 5 years ago

@halter73 any ideas?

PS: are you using that middleware in production? It’s looks terribly inefficient.

clement911 commented 5 years ago

Yes we are using this in prod. We haven't suffered any performance issues but I'm listening if you have ideas on how to make this more performant. Having a log of the body is very useful for investigating why a given request had errors...

davidfowl commented 5 years ago

Having a log of the body is very useful for investigating why a given request had errors...

Do you have structured logging in place? You don't need to attach this much information to the scope, you just need enough information attached to every request so that you can find all correlated logs. You would log the body in chunks and then look at the entire payload by looking at all logs for a specific request id.

We haven't suffered any performance issues but I'm listening if you have ideas on how to make this more performant.

It's a DoS waiting to happen. I can just send a large payload to your application and boom, out of memory. (see https://github.com/davidfowl/AspNetCoreDiagnosticScenarios/blob/master/Scenarios/Controllers/BigJsonInputController.cs for an example). It'll also be one of the ASP.NET Core no-no (see https://github.com/davidfowl/AspNetCoreDiagnosticScenarios/blob/master/AspNetCoreGuidance.md#avoid-reading-the-entire-request-body-or-response-body-into-memory)

Having a log of the body is very useful for investigating why a given request had errors...

Yes, I've seen it enough times now that I think we should build something more efficient out of the box https://github.com/aspnet/AspNetCore/issues/3700

clement911 commented 5 years ago

Do you have structured logging in place? You don't need to attach this much information to the scope, you just need enough information attached to every request so that you can find all correlated logs. You would log the body in chunks and then look at the entire payload by looking at all logs for a specific request id.

I guess you could say it is structured, yes. Logging the body in chunks? I'm not too sure what you mean? Does that mean that I have to reconstruct the body from multiple log entries? It certainly would be much simpler to have a single entry with the full body.

Also, the vast majority of requests we get are webhooks, that can take various shape (but always in json format) and we must validate that they are authentic by hashing the body. (see the Webhook() method I posted above) So we can't use [FromBody] because the payload can differ a lot from request to request.

It's a DoS waiting to happen. I can just send a large payload to your application and boom, out of memory.

I see how this could be a problem. Should I use something like BigContentManualGood() from your linked sample? As far as I can tell it still loads the entire json in memory, into a JObject, so I fail to see how this is different? I guess there must be some sensible max-request-size limit default as well in asp.net core?

Going back to the original issue, I just wanted to point out that the pattern we're seeing is that these errors occur in batch. There won't be any issues from a while (5 minutes or even up to a couple of hours) and then all of sudden we get many times that error in a rapid succession (say between 3 and 10+). That may be a useful piece of information. This got me thinking that it might be caused by a longer that usual GC pause... Just an idea...

clement911 commented 5 years ago

What is there a difference between the following two?

await new StreamReader(Request.Body).ReadToEndAsync();
await new HttpRequestStreamReader(Request.Body, Encoding.UTF8).ReadToEndAsync();
clement911 commented 5 years ago

We deployed yet again today to try to fix this. We do not collect the body anymore when logging, based on @davidfowl 's good comments about perf and security. We removed all usages of HttpAccessor and we are left with the only other piece of code that reads the entire body and causes the error, in our Webhook() action, which now looks like this:

        [HttpPost]
        public async Task<OkResult> Webhook()
        {
            var body = await new HttpRequestStreamReader(Request.Body, Encoding.UTF8).ReadToEndAsync();
            bool isValidRequest = AuthorizationService.IsAuthenticWebhook(Request.Headers, body, _options.SecretKey);
            if (!isValidRequest)
            {
                throw new UnauthorizedAccessException("This request is not an authentic webhook request.");
            }
            //handle webhook
            return Ok();
        }

Using HttpRequestStreamReader hasn't fixed the problem.

We also wanted to rule temporary thread pool exhaustion (even though the thread count counter in the azure portal looks ok) because we thought that the threadpool might take time to ramp up threads, and maybe a spike of requests might cause temporary starvation. But, it looks like there are plenty of threads and the error is still happening...

public override void OnException(ExceptionContext context)
{
    context.ExceptionHandled = true;
    ThreadPool.GetAvailableThreads(out var workerThreads, out var completionPortThreads);
    ThreadPool.GetMaxThreads(out var maxWorkerThreads, out var maxCompletionPortThreads);
    _logger.LogError(context.Exception, $"Error in global mvc exception filter. Available Threads = ({workerThreads:N0}/{maxWorkerThreads:N0}, {completionPortThreads:N0}/{maxCompletionPortThreads:N0}).");
}
Error in global mvc exception filter. Available Threads = (32,757/32,767, 1,000/1,000).

    Exception: Microsoft.AspNetCore.Server.Kestrel.Core.BadHttpRequestException: Reading the request body timed out due to data arriving too slowly. See MinRequestBodyDataRate.
   at Microsoft.AspNetCore.Server.Kestrel.Core.BadHttpRequestException.Throw(RequestRejectionReason reason)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Http1MessageBody.d__4.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
   at System.IO.Pipelines.Pipe.GetReadResult(ReadResult& result)
   at System.IO.Pipelines.Pipe.TryRead(ReadResult& result)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Http1MessageBody.OnConsumeAsync()

We are running out of ideas.

tstuts commented 5 years ago

@clement911 What is your App Service scale?

We are having what looks like the same issue. ASP.NET Core 2.1 app in an Azure App Service (Windows) deployment with similar behavior - fine for a few minutes, then a group of connections all throw this exception, then fine again.

All request are POSTs with a JSON body that is being read and deserialized in a middleware.

clement911 commented 5 years ago

Hi @tstuts, It's a deployment with 3 to 5 S2 instances and we get a ton of requests, mostly webhooks. Several 100k per day.

I'm not sure if it is related to that issue, but we also noticed we have been getting a lot more of the following errors. Nowhere near as much, but we used to maybe get it once a day and now we get it a lot more, and I feel like they might be happening around the same time as the orginal issue.

Warning from Microsoft.AspNetCore.Server.Kestrel:
    Heartbeat took longer than "00:00:01" at "11/01/2018 19:48:02 +00:00".
davidfowl commented 5 years ago

That usually means there's thread pool starvation... Hmm

clement911 commented 5 years ago

@davidfowl I don't understand how this is possible given that ThreadPool.GetAvailableThreads is reporting there are plenty of threads available? We don't create manual threads. We have a few Task.Run() and also several periodic background jobs scheduled via Observable.Timer, which uses the ThreadPool under the hood.

Here is a screenshot from Azure Portal showing our thread count over the last 12 hours. I don't believe 200+ threads is excessive?

image

Tratcher commented 5 years ago

200 threads is a lot more than you have cores. If those threads are all active then some of them may not be getting enough CPU time (similar to thread pool starvation). We're making some changes in the data rate timers to mitigate these scenarios. https://github.com/aspnet/KestrelHttpServer/pull/3070. That said, your system is likely still overtaxed.

clement911 commented 5 years ago

If that is the case then I guess increasing the number of instances might fix the issue.

Pardon my ignorance here but I don't understand. CPU and memory seems healthy to me (see screenshots below). We are running S2 instances but Azure portal is recommending S1. I tried to find out how many cores S2 has but all I can tell is that they have 200 ACU (Azure compute units), whatever that means.

I went through the most recent MinRequestBodyDataRate errors and the difference between the available treads and the maximum threads is always less than 100, so there are less than 100 worker threads out of a supposedly sensible (?) default maximum of 32,757! 100/32757 is less than 1/3 of 1%.

I suppose I could increase the number of instances and maybe this will fix the issue, but I guess I would like to understand what is the bottleneck as it is not obvious to me at all. (rather than throwing money at the problem).

image

image

davidfowl commented 5 years ago

See https://blogs.msdn.microsoft.com/vancem/2018/10/16/diagnosing-net-core-threadpool-starvation-with-perfview-why-my-service-is-not-saturating-all-cores-or-seems-to-stall/. For details on how to diagnose starvation.

davidfowl commented 5 years ago

Are you using Task.Wait or Task.Result anywhere?

clement911 commented 5 years ago

Thank you @davidfowl that was a good refresher.

Are you using Task.Wait or Task.Result anywhere?

No. I will keep looking for any blocking calls I might have missed.

By the way we don't use .net core. We use .net 4.7.1, but I assume that the thread pool works in a similar way.

sebastienros commented 5 years ago

https://github.com/benaadams/Ben.BlockingDetector

davidfowl commented 5 years ago

@clement911 yes its fundamentally the same (core is a bit more optimized)

clement911 commented 5 years ago

The .NET Threadpool DOES try to inject more threads, but does it at a modest rate (e.g 1 or 2 a second), so it makes little difference in the short term (you need many minutes to get up to 1000).

I wonder if this is a problem. That doesn't sound like a great strategy to handle temporary spikes.

Thanks, I will check Ben.BlockingDetector.

All our webhook method does is validate that it is authentic (CPU bound) and save it in the db with ef core (async IO).

I noticed that when the MinRequestBodyDataRate error occurs, it is likely (but not always) to happen in conjunction with a warning for multiple other webhooks saying the save to the database took an absurd amount of time (say, over 40 seconds and even up to 180 seconds!) to save the webhook data. I thought it might be the EF core retry strategy that does multiple retries, but as far as I can tell the database is very responsive and I haven't found an event in EF core to log when a retry occurs... It seems like a weird coincidence so I'll assume it might be due to the same thread exhaustion.

davidfowl commented 5 years ago

Yes those symptoms sound exactly like there's some starvation going on. Taking a memory dump when the problem occurs can help verify. If you have that many threads, you can also look at what those threads are doing and it may hint at the problematic area.

clement911 commented 5 years ago

Will do thanks guys, I really appreciate the help

clement911 commented 5 years ago

What HTTP Status code will be returned to the client when a MinRequestBodyDataRate / BadHttpRequestException occurs? I would think a 4xx? (even though it is likely a internal issue in my case)

We get a lots of these errors in our logs but the Azure portal metrics claims there are hardly any HTTP 4xx or 5xx responses. I don't get it.

clement911 commented 5 years ago

I got a memory dump from the azure portal (twice) hoping it would shed some light, but nothing obvious is showing up in the parallel stacks window. There are very few threads (7) with a managed stack and nothing blocking besides Program.Main and other couple of usual suspect threads (such as ConsoleLoggerProcessor queue). All the other threads are showing 'Not available' in the threads window.

I tried another way of getting stacks from all threads, which worked, but that showed the same thing.

I also tried to replicate locally by creating a little program sending fake webhooks and I wasn't able to reproduce the problem, but at least it is showing a all the threads (though async call stacks are hardly readable as usual).

Out of deseperation I (reluctantly) enabled Application Insights which I read could collect metrics without re-deploying. It didn't work. None of the metrics are showing any data. I might have to look into the build-time application insights configuration.

I don't know if I can trust the azure metrics anyway. As I mentioned above, Azure monitor reports no http errors...

The issue seems to be getting more and more severe with the increasing activity that we are getting over time. :weary: :confused:

clement911 commented 5 years ago

The metric below from the dump caught my attention though, but the finalizer thread stack has no user code.

image

clement911 commented 5 years ago

By the way, I was surprised to see that 3 of the threads were taken by the ConsoleLogger MessageProcessor Queue. That got me wondering why it is enabled in prod by default...

clement911 commented 5 years ago

So I've been battling with this and wanted to report my findings.

We had to act quickly so and we decided to deploy an exact copy of our app into a new App Service Plan. Then, we moved all your webhook subscriptions to point to the new instance. So, the original app service only handles user facing requests and it made it very stable. No errors. The new app service handles webhooks and DOES NOTHING ELSE. The new app service is not even exposed to users and the only requests hitting are a simple Webhook() action that does a hash of the body and stores the body in the db by making an async call via SqlCommand with an await cmd.ExecuteNonQueryAsync(); Well, the MinRequestBodyDataRate error still occurs! (on the new instance of course).

Given that there is very little code being exercised in the new instance, I'm confident that we don't accidentally call any sync API or block any threads. In the logs, we can clearly see that the errors occur when there is a spike of requests. I'm thinking that the issue may be either in the sql connection pool or the thread pool.

With regular user traffic, one doesn't expect to have huge sudden peak of requests. It usually ramps up slowly and the thread pool seems pretty good at that. On the other hand, with webhooks, we can sometimes get extremely sudden peaks because some systems somewhere generated a ton of events somewhere. But the thread pools spawns thread very slowly.

I'm thinking of experimenting with ThreadPool.SetMinThreads and ADO.net 'MinPoolSize' setting. It might be a good strategy to keep a lot of threads and/or connections running to handle peaks. If anyone has experience in that area I'm listening. Those settings are easy to set, but there is very little documentation on how it works behind the scene and it feels very black boxy. Probably a good reason except when you run into a scenario like this...

Tratcher commented 5 years ago

Sounds like you're on the right track. One alternative we've considered for scenarios like these is request throttling. This would reduce the demand on the threadpool.

clement911 commented 5 years ago

Not sure exactly how you're implementing your throttling and how this would solve the problem?

One issue is that most webhook providers (the systems/APIs sending the requests) typically require listener (like us) to respond with a HTTP 200 pretty quickly or they considered it a failure. In our case it's 5 seconds.

Tratcher commented 5 years ago

It's only theoretical at this point, but ASP.NET 4 avoided this class of issues using a similar approach. Your requests are not extremely expensive to process, there's just a lot of them and they compete for resources like threads, CPU time, memory, etc.. If you put those requests in a queue and only process a few at a time then there's no longer contention and they can complete quickly. A naive version of this would be a middleware that used a mechanism like SemaphoreSlim to limit the number of concurrently processed requests. As long as everything is flowing smoothly there's no reason you shouldn't be able to handle large bursts in a timely fashion. You just have to keep the burst from crushing you.

davidfowl commented 5 years ago

@clement911 can you reproduce the issue on the site with webhooks if you hammer it with similar traffic?

clement911 commented 5 years ago

@Tratcher that's an interesting suggestion. But if a request waits a SemaphoreSlim, will this count towards the ThreadPool's IO threads. I think the hard part will be finding the sweet spot of number of threads in the pool to achieve maximum throughput.

@davidfowl both app services host exactly the same code. I tried to reproduce the issue locally with a naive local program issuing request but that didn't work. It's probably possible to reproduce with a proper load testing tool if that is what you are asking.

Tratcher commented 5 years ago

SemaphoreSlim has an async waiter so it doesn't block threads.

mmitche commented 5 years ago

@davidfowl I think I've got a very similar issue with 2.1.5 handling Azure DevOps webhooks in a test web service. Number of requests is low (bursts of 6 only when queuing builds). In my case it repros on maybe 5-10% of bursts. Again I'm reading/hashing the body to validate against a signature.

tstuts commented 5 years ago

@clement911 @mmitche Are you guys seeing a lot of this exception too?

Microsoft.AspNetCore.Connections.ConnectionResetException: An existing connection was forcibly closed by the remote host