dotnet / aspnetcore

ASP.NET Core is a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.
https://asp.net
MIT License
35.56k stars 10.05k forks source link

reopen issue #10117 - IIS app pool recycle throws 503 errors #41340

Closed alex-jitbit closed 6 months ago

alex-jitbit commented 2 years ago

Is there an existing issue for this?

Describe the bug

IIS app pool throws 503 errors during recycles. This is a known issue with ANCM module that has been reported previously in #10117 - which has 33 likes and a 3-year discussion, it was never fixed, but it was automatically closed by a bot "as a clean-up due to lack of discussion".

P.S. This is not a deployment problem. There are many scenarios when IIS app pool is being recycled outside of our control (adding/removing SSL certificates, changing IP addresses to listen to, etc... basically, touching any IIS setting causes a recycle - and 503 errors are unacceptable for high-availability scenarios).

.NET Framework was free of this bug.

Expected Behavior

No errors during recycles.

Steps To Reproduce

see the issue linked #10117

Exceptions (if any)

No response

.NET Version

5.0, 6.0 7.0 8.0

Anything else?

No response

Tratcher commented 2 years ago

https://github.com/dotnet/aspnetcore/issues/10117#issuecomment-498865444

Using separate app pools and a load balancer is our recommended approach for high-availability as it allows you full flexibility over deployment process and the ability to easily revert versions.

Trying to achieve high-availability with a single instance is not recommended.

alex-jitbit commented 2 years ago

@Tratcher like I indicated above, this is not about deployments. There are a lot of scenarios when IIS recycles the pool (see above)

Tratcher commented 2 years ago

Deployments are just one example that disrupt availability. A single instance is not advised for high-availability for many reasons.

ghost commented 2 years ago

We've moved this issue to the Backlog milestone. This means that it is not going to be worked on for the coming release. We will reassess the backlog following the current release and consider this item at that time. To learn more about our issue management process and to have better expectation regarding different types of issues you can read our Triage Process.

benjamin-stern commented 2 years ago

@Tratcher Even having more than a single instance this would still strongly affect a service, as all the requests going to the server that's recycling would be returned the 503 error.

c0shea commented 2 years ago

We have two instances running behind a load balancer and have still experienced this issue intermittently. When the app pool inevitably recycles (due to deployment, config change, etc), it starts returning 503 instead of queuing up the requests.

The load balancer doesn't immediately treat the 503s as the server being down and take it out of the rotation. Instead, it uses a polling mechanism that calls an endpoint (i.e. /status) on each instance and checks for a successful response. While that status endpoint is monitored frequently enough, there is obviously plenty of time where a bunch of requests will fail with 503 while the recycle is happening. We can't have the load balancer take the instance out of the rotation if it sees 503s being returned because (1) it's not an available option in NetScaler and (2) if both happen to recycle at the same time, both servers would be taken out of the rotation and the service would be completely down until manual intervention told the load balancer that the requests aren't failing anymore.

RomBrz commented 2 years ago

In this scenario (using two or more instances behind a load balancer), since you're using a "/status" to check availability, i suggest that you make some routines during the deployment or maitenence to, before start doing anything, force the "/status" to throw an "unhealthy" status, so the load balance could remove the node from the balancer and then make the changes.

About the issue itself, the IIS default behavior on a recycle is to first start a new application pool, route the new requests to the new application pool, wait the default set time to the current requests ends, an than close/finish the current application pool, keeping only the new as the application pool.

Recycling an application pool, could be an ASPNET Core "expected behavior", but looking at IIS, throwing 503 during one recycle isn't a "normal behavior".

c0shea commented 2 years ago

The problem is that ASP.NET Core doesn't use the overlapping recycle behavior that .NET Framework did. While the old worker process is being shutdown (especially if there were a lot of inflight requests being handled by it), the new process isn't yet started and those requests in the middle get the 503.

alex-jitbit commented 2 years ago

TIL that StackOverflow also runs ASP.NET Core under IIS

image

luizfbicalho commented 2 years ago

Is there any way to minimize this problem, or ate least to detect what is causing it? Any configuration in the application pool to minimize it?

peter-bertok commented 2 years ago

Deployments are just one example that disrupt availability. A single instance is not advised for high-availability for many reasons.

High availability and throwing 503s from otherwise "perfectly fine servers" are separate concerns.

Most load balancers do not hide HTTP errors! If the IIS process responds to a HTTP request with 503, then that's what the user will see. In particular, none of the Azure load balancer offerings hide errors from the users. They pass them on faithfully.

If a previously working server throws 503 errors then it will take significant time for the load balancers to detect this. Minutes even, or 10+ minutes if using CDN-type solutions such as Azure Front Door.

This behavior is triggered by many actions, not all of which are resolved via hosting on multiple server instances. Scheduled recycling, for example, has been mentioned by many people as a common trigger.

Similarly, many people have pointed out in the previous thread that it's not just 503 errors that are seen, but slow uploads are also unceremoniously terminated.

luizfbicalho commented 2 years ago

My problem seems to be on IIS recycle when takes too long to recycle, it's not related to cpu neither to memory, I asked the IT infrastructure to add more performance counters to grafana but they didn't add yet,

I'm inclined to think that the problem is with Network connections, what do you think that I should monitor in grafana?

alex-jitbit commented 2 years ago

TIL that fuget.org also uses IIS + ASP.NET Core, just caught them during a recycle. image

How many more examples do we need?

AndreasJilvero commented 2 years ago

In my experience, this happens also when simply setting the physical path of a website. The response is somewhat difference though - a recycle renders the text "The service is unavailable" and setting the physical path just gives an empty 503 result.

https://stackoverflow.com/questions/74326315/iis-setting-physical-path-gives-503-status-for-a-few-seconds

luizfbicalho commented 2 years ago

In my case it's not for seconds, after the first 503 only iisreset solves the problem

kadamgreene commented 1 year ago

We are also experiencing this issue. Is there any expectation of a fix? This is going to keep us from moving forward with .NET migration / any new work in .NET 6+. It's not the deployments, blue/green can handle that, it's the "unexpected" recycles in the run of a day that are the issue.

cun-dp commented 1 year ago

We are also seeing this with our nightly apppool recycles in the production environment, and with all our aspnetcore microservices.

Example of IIS Logs (anonymized): 2023-04-23 22:47:03 W3SVC5 HOSTNAME1 [IP censored] GET /app/health 443 - HTTP/1.1 - - my.fqdn.example.com 503 0 1255 202 130 15

HTTP 503 with win32 substatus 1255. According to https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--1000-1299- code 1255 matches: 1255 (0x4E7) ERROR_SERVER_SHUTDOWN_IN_PROGRESS: The server machine is shutting down.

This is a critical issue for us, since we are running a 24/7 plattform.

cun-dp commented 1 year ago

I did a few tests with an almost empty aspnetcore 6 application and with some of our production applications on IIS10. Conclusion: the problem always occurs, not matter how simple the application is. Sync or Async controller actions as well as InProcess or OutOfProcess hosting make no difference in my tests.

Of all my tests the problem seemed to be exacerbated the most if the application is doing things during the Application Stopped event:

app.Lifetime.ApplicationStopped.Register(() => { Thread.Sleep(TimeSpan.FromMilliseconds(1000)); });

Using this snippet in my almost empty aspnetcore 6 application leads to 20-50 times the amount of HTTP503 responses during apppool recycle compared to using no Application Stopped event handler.

Tratcher commented 1 year ago

@cun-dp that makes sense, the application has stopped serving traffic and can't re-start until the current process exits.

cun-dp commented 1 year ago

@Tratcher It makes sense only in the way that my test further confirms the bug with aspnetcore app pool recycling: I can see that a second W3SVC instance gets spawned the moment the app pool is asked to recycle, so the application is restarting. But routing of new requests to the new application instance simply does not work. Instead, the requests get routed to the old application that is shutting down and therefor is rejecting requests.

This supports the fact that IIS (or the aspnet core v2 module?) does not handle app pool recycles correctly by overlapping both application instances and routing the new requests to the new application instance the moment the application pool is asked to recycle, like IIS does with aspdotnet framework 4.x (and previous) applications.

And to make this abundantly clear, because it has been misinterpreted in #10117 a lot: This is not about deployments of applications. This bug is triggered by only recycling an app pool running an existing, unchanged, aspnetcore application.

The behaviour can be reproduced by just running Restart-WebAppPool MyAppPool in PowerShell while continuously sending request to the app running in "MyAppPool".

luizfbicalho commented 1 year ago

@cun-dp that makes sense, the application has stopped serving traffic and can't re-start until the current process exits.

but the correct approach wasn't to start a new process, redirect the new connections to that process and let the old process die in peace as long as it takes?

Is there a way for me to see what's blocking the old process to die?

Is there a way to force all the threads in the old process do exit?

Is there any workaround to help with this problem?

gcbenjamin commented 1 year ago

And to make this abundantly clear, because it has been misinterpreted in #10117 a lot: This is not about deployments of applications. This bug is triggered by only recycling an app pool running an existing, unchanged, aspnetcore application.

The behaviour can be reproduced by just running Restart-WebAppPool MyAppPool in PowerShell while continuously sending request to the app running in "MyAppPool".

We see the exact same thing and easy to reproduce as explained by @cun-dp , just recycle while under load and requests will fail with 503's. This behaviour is not seen in any of our framework api's, only .net core. I've tried all different combinations of IIS/App pool settings and nothing has worked. Ran my .net core api continuously using K6 and always get hit with 503's when recycling under load (this api is a port from .net framework which never has this issue running same load tests).

While this doesn't fix the problem, it has helped, in the app pools advanced settings, setting Disable Overlapped Recycle to TRUE I get around 90% less 503's (in my tests from 60 to 7).

luizfbicalho commented 1 year ago

And to make this abundantly clear, because it has been misinterpreted in #10117 a lot: This is not about deployments of applications. This bug is triggered by only recycling an app pool running an existing, unchanged, aspnetcore application. The behaviour can be reproduced by just running Restart-WebAppPool MyAppPool in PowerShell while continuously sending request to the app running in "MyAppPool".

We see the exact same thing and easy to reproduce as explained by @cun-dp , just recycle while under load and requests will fail with 503's. This behaviour is not seen in any of our framework api's, only .net core. I've tried all different combinations of IIS/App pool settings and nothing has worked. Ran my .net core api continuously using K6 and always get hit with 503's when recycling under load (this api is a port from .net framework which never has this issue running same load tests).

While this doesn't fix the problem, it has helped, in the app pools advanced settings, setting Disable Overlapped Recycle to TRUE I get around 90% less 503's (in my tests from 60 to 7).

Nice, I'll try that solution, is there a way to find what is locked in the old process? if is it a file lock, or another resource?

estebanorellana commented 1 year ago

In our work, the exact same thing happens to us when we install the .NET 6 application on a server with IIS, after the first 503 error it does not come out without an iisreset.

but if we raise this site with kestrel we have no problem.

alex-jitbit commented 1 year ago

Looking at the ANCM commit history I see it had 2 commits in 1 year I'm thinking MS priorities are elsewhere

(or the C++ guru who wrote it has left the company and now everyone's just afraid to touch it)

luizfbicalho commented 1 year ago

I saw that there is a @BrennanConroy that is commiting code, It would be great if we could get a better error message that is preventing the shutdown of the aspnet core app.

petersladek commented 1 year ago

This clearly is deterioration vs ASP.NET behavior in IIS and it catches most people by surprise in production (doesn't matter if one is using dedicated load balancer or not as typically there will be http 503 responses passed through LB to client before LB removes the traffic to unhealthy node). Can we at least get the documentation updated to state this difference in behavior of ASP.NET Core apps in IIS vs ASP.NET Framework apps in IIS? And maybe suggesting that when running ASP.NET Core apps in IIS one should consider disabling periodic recycling of app pool (as default is to recycle every 1740 minutes) to avoid http 503 errors.

divil5000 commented 1 year ago

We are running into this issue in production too. Our service is constantly serving requests, and changing the physical path (to update the binaries) leads to many seconds of 503 responses before it starts serving successfully again. I cannot believe Microsoft are ignoring this issue.

luizfbicalho commented 1 year ago

We are running into this issue in production too. Our service is constantly serving requests, and changing the physical path (to update the binaries) leads to many seconds of 503 responses before it starts serving successfully again. I cannot believe Microsoft are ignoring this issue.

My problem is worse, when I receive 503 que app pool stop and doesn't restart

TRMack commented 1 year ago

We are in the same boat; we run behind a load balancer but a scheduled app pool recycle on an instance -- it recycles every 29 hours -- returned 503 to a customer which disrupted an automated process of theirs. I always trusted IIS to handle the recycle gracefully and am frustrated to see that that is no longer the case.

cun-dp commented 1 year ago

[...]

While this doesn't fix the problem, it has helped, in the app pools advanced settings, setting Disable Overlapped Recycle to TRUE I get around 90% less 503's (in my tests from 60 to 7).

I could not reproduce this mitigation.

In my tests, the differences were well within the variance I would expect. I did 7 testruns, each once with OR ON and OR OFF (with IIS Resets inbetween for the last 2 runs, which didnt make a difference either).

My test: Send GET requests to the application as fast as possible. Recycle the apppool 10 times, waiting 5 seconds between each recycle. Results (# of HTTP503 responses): Overlapped Recycle ON: 591, 691, 741, 712, 736, 631 Overlapped Recycle OFF: 561, 677, 748, 696, 621, 598

luizfbicalho commented 1 year ago

I just want an official microsoft statement, if you use this, do that, change to docker, or something else

Did anyone get any document about that? I need to show something to my customer

MV10 commented 1 year ago

We run hundreds of thousands of IIS servers, and some of our load balanced farms span 120+ servers in multiple global data centers, and this is becoming a very serious problem as apps (finally) migrate away from .NET Framework.

As many others have stated clearly, this is almost entirely unrelated to deployment. With a couple tens of thousands of applications in production usage, our operations people do several hundred overlapped recycles of unhealthy applications every day, and most of our applications are on staggered nightly recycle schedules.

We haven't had much luck getting our very-expensive Microsoft DSEs to help prioritize anything in .NET from a dev/issue perspective (I'll spare you my rants about Agile perma-backlogged deferral), but I'll be running this one up the chain with management again. There's always a first time, I guess.

Also, the area-networking tag is incorrect in my opinion. This is runtime process behavior. Or maybe something to do with IIS interaction, if such a tag exists.

Whatever the tag, this is extremely critical for real-world production business usage.

alex-jitbit commented 1 year ago

This problem has become way worse in .NET 8!

Before net8 (we were on net6) - the app becomes sluggish and throws 5-10 errors 503.

After moving from net6 to net8 - it seems like all requests throw 503, until the recycled process spawns up.

Not only you haven't fixed this, you even made it worse

MV10 commented 1 year ago

We were told this is "by design" and won't be fixed.

Insane.

luizfbicalho commented 1 year ago

We were told this is "by design" and won't be fixed.

Insane.

Where they told you this? If you find this for me is better to convince my client that this won't work anymore

MV10 commented 1 year ago

@luizfbicalho Our DSEs (Dedicated Support Engineers) quoting whatever internal sources they have.

luizfbicalho commented 1 year ago

@luizfbicalho Our DSEs (Dedicated Support Engineers) quoting whatever internal sources they have.

Thanks, but i needed some oficial answer

MV10 commented 1 year ago

DSEs are MS employees, to be clear -- but I know you can't just quote "some random guy on the Internet" (me).

luizfbicalho commented 1 year ago

yeah, would be nice if microsoft had an official statement abou this, and this client could use linux because it doesn't use anything from iis

Hieu-Nguyen-1 commented 1 year ago

This problem has become way worse in .NET 8!

Before net8 (we were on net6) - the app becomes sluggish and throws 5-10 errors 503.

After moving from net6 to net8 - it seems like all requests throw 503, until the recycled process spawns up.

Not only you haven't fixed this, you even made it worse

I think that it will be fixed in .NET 8! But.....!!!! Please fix it @Microsoft @BrennanConroy

alex-jitbit commented 1 year ago

OP here.

Welp.

It's been more than 4 years of Microsoft ignoring this issue (if you count the original one). Even though the 503 errors can be seen on major .NET-powered websites like StackOverflow (see example screenshots above in this issue). StackOverflow, by the way, is proudly showcased as Microsoft's number one customer on their ".NET Customers Showcase" page https://dotnet.microsoft.com/en-us/platform/customers/aspnet

I still love .NET and C# too much to abandon it, so I will continue to use it.

However as of this weekend we have moved all our production servers away from Windows to Linux. We're not paying for Windows Server licenses any more. If Microsoft can't make their own two products work together - well, you just lost a loyal paying customer.

💔

luizfbicalho commented 1 year ago

OP here.

Welp.

It's been more than 4 years of Microsoft ignoring this issue (if you count the original one). Even though the 503 errors can be seen on major .NET-powered websites like StackOverflow (see example screenshots above in this issue). StackOverflow, by the way, is proudly showcased as Microsoft's number one customer on their ".NET Customers Showcase" page https://dotnet.microsoft.com/en-us/platform/customers/aspnet

I still love .NET and C# too much to abandon it, so I will continue to use it.

However as of this weekend we have moved all our production servers away from Windows to Linux. We're not paying for Windows Server licenses any more. If Microsoft can't make their own two products work together - well, you just lost a loyal paying customer.

💔

Can you detail more what you are going to use? what linux? what webserver? configuration that you changed from the default

alex-jitbit commented 1 year ago

Can you detail more what you are going to use? what linux? what webserver? configuration that you changed from the default

@luizfbicalho Our config is very basic and follows the official MS docs

  1. Ubuntu 22.04 with Nginx as a reverse proxy in front of the app (that runs on port 5002)
  2. A systemd "service" described by one file /etc/systemd/system/MyApp.service, which simply runs dotnet MyApplicaion.dll and sets some basic settings.

That's it. Very simple. Just make sure you call UseForwardedHeaders in your app to make "reverse-proxy-firendly".

Whenever nginx configuration needs to change (a new IP to listen to, a new SSL certificate installed, a new hostname to listen to etc etc) you just change nginx config files and run sudo service nginx reload with zero downtime.

P.S. We've actually implemented two systemd-services for blue-green deployment that run on different ports, but that's another story.

BrennanConroy commented 11 months ago

Hey folks, I looked into this and made some improvements, at least with local testing. I'd love for people on this thread who are seeing 503s during recycles to voluntarily test a change.

At the bottom of this comment is a dll with some changes to try and improve 503s during overlapped recycles. The dll has been signed by Microsoft, you can verify this by running signtool verify /v /pa aspnetcorev2.dll. Signtool is located in C:\Program Files (x86)\Windows Kits\10\bin\10.0.22621.0\x64\signtool.exe (note: 10.0.xxxxx.0 version may vary depending on your machine). The SHA256 for the dll is BECDEC71EE95A2F117A75A5F2BF961EED3EEC1A641D8A4573EC53AEE25696AE2

For those of you who would prefer building your own dll, or want to preview the changes, you can check out https://github.com/dotnet/aspnetcore/pull/52807 and build via build.cmd in /src/Servers/IIS/.

To use the dll below (assuming x64), replace the dll in C:\Program Files\IIS\Asp.Net Core Module\V2\ (keep the original around so you can switch back to the supported dll later).

There is also an optional config you can change to influence how slow (or fast) shutdown occurs if you're running in a slower environment and need more time for shutdown to avoid the 503s. The option is shutdownDelay which can be set in web.config. The default is 1000ms (1 second).

<aspNetCore processPath="dotnet" arguments="myapp.dll" stdoutLogEnabled="false" stdoutLogFile=".logsstdout">
      <handlerSettings>
        <!-- Milliseconds to delay shutdown by, this doesn't mean incoming requests will be delayed by this amount, but the old app instance will shutdown after this timeout occurs -->
        <handlerSetting name="shutdownDelay" value="5000" />
      </handlerSettings>
    </aspNetCore>

Note: Use of this dll is purely experimental and at your own risk, if problems like crashes occur please let us know, so we can improve the final code. aspnetcorev2.zip (github doesn't allow uploading .dll so the dll is in a zip file)

luizfbicalho commented 11 months ago

I think that my problem is related to the https://github.com/HangfireIO/Hangfire/issues/1345

JuergenAuer commented 11 months ago

There is a crazy workaround (sorry, doesn't work, see edit).

I'm running a web application, 17 years, .NET-Framework, code updates or nightly recycling - never a problem. After a restart - the next call is slower, but it worked.

Now switched to NET.8, only the test system. Checking one page with a command line tool, recycle - 23 http status 503. If a user posts data, he has to do it again, that's not a solution. Slow is ok, but a 503 is deadly.

Found this topic.

Buggy configuration:

Root: E:\net-www + web.config, E:\net-www\bin with a lot of dll, web.config:

      <aspNetCore processPath="dotnet" arguments=".\bin\this_webserver.dll" stdoutLogEnabled="true" 
stdoutLogFile=".\logs\stdout" hostingModel="InProcess">
        <handlerSettings>
            <handlerSetting name="EnableShadowCopy" value="true" />
            <handlerSetting name="shadowCopyDirectory" value="E:/temp/ShadowCopyDirectory/" />
        </handlerSettings>
    </aspNetCore>

Now:

  1. Change the hostingModel to OutOfProcess. Result: Started command line fetch, recycle - no 503. First call is slow, then it's ok, no significant difference to InProcess.

  2. The dll are blocked, updates are not possible. No shadow copy. But:

  3. Created two subdirectories E:\net-www\bin1, E:\net-www\bin2, copy all files from E:\net-www\bin into bin1/bin2.

  4. If required, switch arguments=".\bin\this_webserver.dll" to arguments=".\bin1\this_webserver.dll" or arguments=".\bin2\this_webserver.dll". Or save these files somewhere and copy the file, so the process is recycled.

Conclusion: OutOfProcess waits instead of throwing a 503, bin1/bin2 is like a green-blue-configuration. IIS selects a new port (random).

Edit (some hours later): Sorry, wrong false positive. Rechecked, there are 503 again. Only 10, but 503. So this "workaround" doesn't really work.

ckobelski commented 10 months ago

@BrennanConroy we tried out the updated DLL in one of our testing environments and got really promising results. We set up a a few constant streams of requests to one of our .Net 6 services hosted in IIS, and observed that when we recycled the app pool, the requests were queued by IIS and eventually served, with no 503s. This behavior exactly matches our .Net Framework services, both with Overlapped Recycle enabled or disabled (disabling Overlapped Recycle simply causes a longer queue time since the old process has to fully exit before the new one can start up). We didn't touch the shutdown delay setting, and IIS reliably queued the incoming requests without any 503s, so that default seems sufficient.

I don't think we would roll a DLL tweak like this out to production to really test the stability like you asked for, but I hope the other devs in this thread and the previous thread give it a try to see if it resolves the issue for them. I'm also hoping this fix can land as a patch for .Net 6 in addition to 8, so we can take advantage of it without having to rush our upgrades.

divil5000 commented 10 months ago

That's really encouraging to hear. Like you though I cannot afford to deploy this into production, unless there's a straightforward (and easily reversible) way of doing so. Brennan?

MV10 commented 10 months ago

@ckobelski thanks for posting that. If you don't mind, could you elaborate on your testing? Did you watch perf counters and saw requests queued? Was the test designed to simulate an unhealthy old pool, for example, hung and no longer processing requests? Could you test other recycle scenarios like WAS automated recycles due to exceeding the private bytes threshold?

Unfortunately my org couldn't touch this (even for testing) unless it was officially supported. And per-app settings in web.config is a deal-breaker given that we support tens of thousands of apps, many of which are quite large and on busy shared servers, and could easily take more than 1000ms to spin up -- but it's nice to know a solution may be possible.

I assume a true fix would require IIS changes, I'm guessing this behavior is tied to managed-pool support within IIS itself, and perhaps this is the real reason MS has been unwilling to fix this properly.