aspnet / IISIntegration

[Archived] ASP.NET Core IIS integration. Project has moved to https://github.com/aspnet/AspNetCore
Apache License 2.0
149 stars 59 forks source link

Kestrel + IIS is falling over with a small load in production with RTM #245

Closed jruckert closed 7 years ago

jruckert commented 8 years ago

We are currently experiencing an issue in production (after upgrading from DNX to RTM) when our Web API is

It seems that sometimes when the larger API get queued whilst processing the smaller requests it overwhelms Kestrel.

We have a setup as follows:

What we find is that we start eventually getting 502.3 errors and then Kestrel become unresponsive or takes a very long time to respond (e.g. > 30 seconds for a specific API when usually the API would respond in less than a second).

The only way to resolve this issue is to recycle the app pool and kill the process thread (currently we are having to do this at least once a day in production).

Do we need to tune IIS to handle this scenario and how can we diagnose, provide log files to get to the bottom of this, as prior to RC2 (we upgraded from DNX through to RTM) the website has never crashed.

jruckert commented 8 years ago

We are also planning on running the load tool wrk for a few larger API's > 2 seconds and large JSON response payload to see if we can replicate this issue if this would help.

jruckert commented 8 years ago

From the IIS Performance Counters:

5:20pm - thread count started growing dramatically 5:40pm - access issues to website first reported 5:50pm - IIS Application Pool stopped and started, and thread count drops

Process counters for w… The this is IIS w3wp worker process.

iisw3wp-threadcount

Process counters for P… This is the NetCore process.

netcore-threadcount

wpostma commented 8 years ago

I am having a similar problem. Nowhere near that level of load. 3 tiny dotnet apps, on a server that also has an SQL Server 2016 instance.

I get about 6 seconds of 404 errors, followed by a permanent 502.5 condition, that lasts until I do a stop on w3svc and then start it again. Restarting app pools is not sufficient.

http://stackoverflow.com/questions/38613615/dotnet-process-inside-iis-crashes-how-can-i-determine-cause-and-remediate/38745424#38745424

Do you get anything interesting if you enable stdout logging?

What do you see in the iis text log files (c:\inetpub\logs)? Is it similar to the log output in my answer in stackoverflow link above?

KLuuKer commented 8 years ago

Looks like i may also have the same issue, we finished converting our multi-tenant framework 4.6 mvc 5 to core mvc 1 our traffic monthly is about:

Our setup:

dotnet --version

Microsoft .NET Core Shared Framework Host

  Version  : 1.0.1
  Build    : cee57bf6c981237d80aa1631cfe83cb9ba329f12

stdout

Hosting environment: Production
Content root path: C:\inetpub\project.name-mvccore
Now listening on: http://localhost:21704
Application started. Press Ctrl+C to shut down.

and then all loads of variations of this that never get finished (all from real users hitting our real e-commerce site)

info: Microsoft.AspNetCore.Hosting.Internal.WebHost[1]
      Request starting HTTP/1.1 GET http://example.com/variouspaths/that-should-work/

i also ran it with debug logging, no real difference (i can give it to the microsoft team if they want it)

{
  "Logging": {
    "IncludeScopes": false,
    "LogLevel": {
      "Default": "Debug",
      "System": "Debug",
      "Microsoft": "Debug"
    }
  }
}
KLuuKer commented 8 years ago

@shanselman @DamianEdwards sorry to @ you like this, but we are letting people hit our old mvc 5 app until this can be fixed and so we cant deploy new code in the mean time (saas e-commerce app and our customers are getting angry)

muratg commented 8 years ago

cc @SajayAntony @sivagms

muratg commented 8 years ago

@KLuuKer Could you share the logs?

pan-wang commented 8 years ago

Could anyone provide a repro so that I can take a look at it? 404 error is generated by Krestral. 502 means IIS could not start the Krestral process. For this error, please take a look at windows event log. It usually gives some error code indicating the root cause of starting failure. Please also try enabling IIS FREB log which might help to identify the root cause.

KLuuKer commented 8 years ago

@muratg can you give me your email here or dm via twitter to https://twitter.com/KLuuKer

wpostma commented 8 years ago

I have set up an indentical system to the one that is falling over on my W2012R2 vm, on an azure vm, but it's using Windows Server 2016 TP5 on the azure VM, and if it reproduces for me, I'll get access to any MS employee who needs it. Right now I'm running exactly the same setup as the local VM that failed, except without using the IIS URL Rewrite extension. If no reproduce, I'll add back in IIS URL rewrite. I suspect that the problem is exacerbated or made more likely by me using three apps in a single pool.

wpostma commented 8 years ago

@pan-wang How do I enable IIS FREB for this context? I don't know what to enter for "Specify Content to Trace" prompt in iismgr. All contents? Then 502 error? freb

KLuuKer commented 8 years ago

Just making this clear: i am running kestrel\mvc core 1 using the full framework 4.6 because of legacy bits

jruckert commented 8 years ago

Also the spin up time for our process due to the large dependency injection requirement is approx 5 seconds before IIS can access it. Would this affect things?

pan-wang commented 8 years ago

IIS will log the start and failure of Kestrel process to windows event log. In case of 502 error, this log will be most important source to help identifying the root cause of the failure. If possible, please do take a look at it and share it.

@jruckert IIS has a timeout setting on Kestrel process starting, default is 2 minutes. @wpostma Yes, you can choose all contents and your current FREB setting looks good. For your reference, current hosting mode is one Kestrel application per IIS apppool (site). There will be some issues especially about recycling/file change if you host multiple apps together. I am working on changing it to allow user to host multiple applications inside one apppool.

guardrex commented 8 years ago

Ping @tuespetre ... and I ... have an interest in your (@pan-wang) comment on multiple apps per pool, as @tuespetre is doing it in staging, and the green dino here is :chicken: too chicken :chicken: to try it at all. :smile:

I am working on changing it to allow user to host multiple applications inside one apppool.

@pan-wang, there is no GH issue we can track on for that, correct?

KLuuKer commented 8 years ago

From what i have seen on my server, sometimes it starts up correctly getting hit with real user traffic and after a while it starts getting slower, later it just slows to a crawl and then just stops and i get timeouts

If at that moment I recycle the apppool or manually stop iis (iisreset /stop) and start iis (iisreset /start) i still keep getting timeouts

when it starts giving timeouts (before and after restart) i also notice kestrel using almost no cpu (1% and less) and the entire cpu also running at idle doing almost nothing

in my views i have code like this that calls into a class that holds adhoc query's using the HttpClient

return Task.Run(async () =>
{
    var response = await this.httpclient.GetAsync("get-something-from-my-seperate-backend-machine/");
    response.EnsureSuccessStatusCode();

    return await response.Content.ReadAsAsync<IList<MySomethingPoco>>();
}).Result;

this is because of legacy reasons when views (mvc 5) were not able to run async code

KLuuKer commented 8 years ago

@jruckert @wpostma do you also some code like mine above? or just normal async await code with allot of HttpClient calls to a different machine?

jruckert commented 8 years ago

I'm having a similar experience where it seems to gets slower but it also sometimes recovers. As we are only hosting Web api controllers I'm not 100% sure if all client calls are using the approach you have outlined.

KLuuKer commented 8 years ago

TL;DR its not the end users\browsers that are using that code i am

that is the code that i have in my frontend server (kestrel\iis) that is getting hit with browser requests and since i have a n-tier system the frontend (my code) is using that piece of code to hit a different machine my backend server (web api 2\mvc5\iis\EF6) and that grabs it from the database and returns it

KLuuKer commented 8 years ago

Is there some way of limiting the amount of concurrent request in iis? I would like to try to see if limiting the amount of simultaneous request can temporarily fix the problem.

KLuuKer commented 8 years ago

so i removed that old return Task.Run(async () => { // work here }.Result; code and...... it did NOT help as soon as the users started hitting the site again hard it began to crawl to a halt again so no luck there

guardrex commented 8 years ago

@KLuuKer You might get some good info if you PerfView the app before and after it goes into :snail: mode. https://blogs.msdn.microsoft.com/dotnet/2012/10/09/improving-your-apps-performance-with-perfview/

davidfowl commented 8 years ago

@KLuuKer IIS limits concurrent requests to 5000 by default.

Has anyone collected a profile or process dump? Where is the CPU time spent? Also is everyone using IIS + Kestrel?

muratg commented 8 years ago

Folks, could someone also try to use a single app per app pool to see if this is the culprit? @pan-wang is working on better supporting this in a future release of AspNetCoreModule, but perhaps this could be the work-around for now.

KLuuKer commented 8 years ago

@davidfowl yes i am using IIS (win 2012r2 fully patched) + kestrel anyway to lower that (dont mind machine wide) but keeping the queue big?

@muratg i am using single app per app pool

also the strange thing is i dont see big cpu usage (or 25% because its a quadcore box) more like below 10% total cpu usage

Tratcher commented 8 years ago

Can anyone share an app that reproduces the issue for us to investigate directly?

KLuuKer commented 8 years ago

i tried making a repro app but failed (will try again)

@wpostma how is your repro going?

pan-wang commented 8 years ago

This post lists a way to viewing long running IIS request inside InetMgr https://technet.microsoft.com/en-us/library/cc732518(v=ws.10).aspx. You can also use perfmon to monitor IIS perf counters such as Active Requests under W3SVC_W3WP. The max number of active requests is 5000. Once this max is reached, IIS will not server new quest. Could you please open perfmon to take a look at this IIS perf counter?

jruckert commented 8 years ago

I'm still working on a repro and using wrk as the load simulator.

KLuuKer commented 8 years ago

@jruckert are you able to repro it using wrk on the full app? and are you now just cutting down the code to where the problem is?

jruckert commented 8 years ago

I think it is a combo of a couple of simultaneous heavy calls with a couple of short calls. I'm in the process of rerunning the IIS logs as this is our best chance of reproducing. We are hosting approx 1200 api's and the consumer base is well over 20 million so the sooner we get to the root cause the better.

jruckert commented 8 years ago

BTW I'm unfortunately unable to simulate this in production every day due to the existing load failing over every day. I have no roll back plan as we have committed to dotnet cli 12 months ago.. So I'm reliant on the IIS logs to replay load. As we are publishing records back to 1800 this is a major issue.

davidfowl commented 8 years ago

Does anyone have a process dump? That might help as well.

KLuuKer commented 8 years ago

I took another try at running our production traffic trough kestrel\mvc core but it crashed (got really slow) again here are some things i noticed

cpu usage on sql, backend, frontend (this runs mvc core) machines is below 10% tcp connections on the frontend are below 2000 total (including those waiting for timeouts) so no ephemeral port are exhausted

it seems to be a combination of request, just allot of GET requests to / (the frontpage) is not enough to replicate the problem

DamianEdwards commented 8 years ago

@KLuuKer can you paste your Startup.cs, project.json', &web.configfiles here? Pls ensure you're not using the Console logger in production, as that takes a global lock on every log message. Configure something like Serilog to log to a text file for in-app logging and just have startup errors go through console out to the configure txt log viaweb.config`.

KLuuKer commented 8 years ago

@DamianEdwards you are not going to like this, i don't either and i want to clean it up eventually (you know when i will get "around" to do it).

i should be on gitter when you read this message i will check gitter for any random questions

web.config - https://gist.github.com/KLuuKer/2ddc14ee76c259efdd020a9e76cb204e program.cs - https://gist.github.com/KLuuKer/d5da65927244174a70083eb8a4334b58 project.json - https://gist.github.com/KLuuKer/d1f5106f941ef80ec472d2f8f012e08b

and please make sure your seat is in the upright position the: "omg what the hell are your doing please do this in a different way for the love of programming" startup.cs - https://gist.github.com/KLuuKer/95d0b76f56ea29bd56caeeb585726929 (feedback welcome)

jruckert commented 8 years ago

@davidfowl We have a process dump of the issue how can we get this to you??

KLuuKer commented 8 years ago

@DamianEdwards i switched to Serilog and removed console log, its still has the same 0% cpu and a ton of request pouring in getting timed out

muratg commented 8 years ago

@jruckert Could you upload it to onedrive or similar and share a link?

jruckert commented 8 years ago

@muratg I am assuming this process dump has the potential to contain confidential customer information. I can put up on a secure download site. Can someone DM me at twitter.com/jruckert

pan-wang commented 8 years ago

@KLuuKer did you check IIS active request counter when the hanging happened?

KLuuKer commented 8 years ago

@DamianEdwards tested again this time also removed\disabled the timers and the websocketclient i have in there, same problem. If you have special preview\beta builds that you want me to try, i can also try that because its the only kestrel site we have and because it crashes from 1 hour to 10-5 minutes to while kestrel is still starting up

@pan-wang request monitoring looked like this request-monitoring-privacy-stripped

jruckert commented 8 years ago

I have been able to reproduce the issue consistently within our test environment and using wrk as the load tester. As it's a large code base what would you recommend the next step is to provide a public repo?

jruckert commented 8 years ago

It is definitely a load issue and related to a long running query. We see that the dotnet process hovers around 100% CPU after about 5 seconds although SQL is only running at 5% after an initial burst. It seems that it's still trying to process something but it stops calling SQL after a while.

KLuuKer commented 8 years ago

@jruckert so with your problem the process is stuck at 100%? because mine is that its stuck at 0% (w3wp, kestrel and the rest of the cpu) and it looks like its waiting for some lock somewhere

yeah public repo would be very helpful because Microsoft can then do all the profiling and dumping they want, so start stripping out some of the code that is not touched, repo again, make a backup of the stripped version, and keep stripping\repoing\backuping until you cant repo

KLuuKer commented 8 years ago

today we tried running on azure windows server 2016 TP 5 that also didn't help we tried with a quad core and a single core machine

on the quad core (DS3_V2) it looked good for a hour or so and the single core (DS1_V2) machine was good for 2 hours

the both ended up taking really long for each request and they started to timeout again with next to nothing cpu usage (total usage below 5%)

jruckert commented 8 years ago

Also what should i expect when running the following:

wrk -t5 -c50 -d60s https://webapithatisquick

wrk -t5 -c50 -d60s https://webapithatisslow

When i run the top command, the site works fine

When i run the bottom command, the site hangs and does not recover unless i recycle the app pool.

Would this be expected behaviour?

jruckert commented 8 years ago

@muratg Is it best that I log a premier support ticket on this as the customer is not the happiest in the world at the moment.

davidfowl commented 8 years ago

@jruckert You should do that as well. Github isn't the same as customer support. We're looking into it but you should use as many channels as you have at your disposal.

davidfowl commented 8 years ago

It is definitely a load issue and related to a long running query. We see that the dotnet process hovers around 100% CPU after about 5 seconds although SQL is only running at 5% after an initial burst. It seems that it's still trying to process something but it stops calling SQL after a while.

Are you using EF?