App Warmup failed under stress if backend process failed to start

pan-wang commented 6 years ago

Reported from AppService that App WarmUp sometime failed for core app. so far we cannot repro on local server. On possibility is about how ANCM handles dotnet.exe start failure, i.e., child request for retry. Will change the retry mechanism without child request.

nforysinski commented 6 years ago

I am seeing issues very consistent with this behavior. ASP Net Core 2.0 app hosted on Azure App Service. Occasionally any app condition that results in a restart or recycle of the app causes the app to stop responding and reporting 502.3 results consistent with kestrel timeouts. The app does not recover unless we manually restart the app service, change the instance count, or perform a slot swap.

olitomlinson commented 6 years ago

We get 500s after a platform event has occurred such as a restart or a new worker Is assigned, and definitely not when the App is under load. Only way to resolve is with a restart of the App Service.

jkotalik commented 6 years ago

@nforysinski @dotdestroyer if you guys have any logs (application or ETW), feel free to share with me at my personal email. We are investigating this issue.

nforysinski commented 6 years ago

@jkotalik i sent you some info, but wanted to ask if you thought rolling back to a previous version of netcore could help. We are running on 2.0.1 and dont mind rolling back to 1.x for the time being until there is a fix. This problem is super disruptive to us (and made worse by it being random and unexpected) so we wanted to see if there was any kind of work-around to get us through the time being.

Also wanted to share that all of the error logging in azure is not helpful because it appears nothing it making it to our app pipeline. There are no event logs for the 502 that occurs, there is nothing in app insights for it. Its as if the requests never make it to our app at all and the w3wp process keeps restarting over and over. This is often accompanied by extremely high thread counts.

pan-wang commented 6 years ago

It is nothing related with asp.net core version. The issue is appwarmup module sends a faked request to warm up the application and listens on io completion synchronizely. For some reason, the io callback never happens (could be a leak of request in aspnetcore module which is global singleton and not controlled by the app or bug in io layer) and thus causes IIS worker process hangs. So far we cannot repro the issue on local IIS server. The simple workaround is to remove warmup module from your site (if your app does not use appwarmup feature) by using site extension. We are working on the change of aspnetcore module to simplify process start logic. It will help to reduce the chance of process failure.

nforysinski commented 6 years ago

@pan-wang we are hosting on azure app services so we don't have that level of control over IIS, unless im missing something.

pan-wang commented 6 years ago

App Service does allow user to tune IIS settings via site extension. One of our CCS did it for another customer. Let me ping him to get the whole solution and then update you. Please expect some delay due to holiday.

pan-wang commented 6 years ago

How.to.disable.preload.docx

@nforysinski could you please try the steps in the attached doc to disable app warmup?

nforysinski commented 6 years ago

@pan-wang we set this up a few days ago and everything seems to be working much better. several app restarts have come through fine and the application hasn't gone into the 502 issues we were seeing previously.

rollsch commented 6 years ago

Could you explain what preload is and why disabling it will likely fix this issue? What are the cons from disabling preload?

Also what is the warmup module. How would I know if I am using it?

nphmuller commented 6 years ago

@pan-wang : When I try to create the file applicationHost.xdt I get the following error:

409 Conflict: Could not write to local resource 'D:\local\Config\applicationhost.xdt' due to error 'Access to the path 'D:\local\Config\applicationhost.xdt' is denied.'.

Am I creating the file in the wrong folder? I assumed the file had to live next to the original (applicationhost.config).

Also, what is the easiest way to test if preloading is disabled? So I can see if the workaround is implemented successfully.

veatcha commented 6 years ago

@nphmuller - The .xdt file needs to go at the site root - d:\home\site. If you look at the applicationhost.config after a restart, you should see the change reflected in that file.

davidebbo commented 6 years ago

@nphmuller See also https://github.com/projectkudu/kudu/wiki/Azure-Site-Extensions#debugging-private-extensions for more general info on using/debugging this xdt file.

@pan-wang can you update your Word doc to indicate where the xdt should be created?

pan-wang commented 6 years ago

@davidebbo I have update the doc as you suggested @rolandh AppWarm module (aka application initialization) is an IIS native module. It is a useful feature in the scenario that your managed application needs long time to warm up (e.g., loading assemblies, setting up SQL connection, et al), while you want your application to serve some custom page during the warming up instead of just letting client waiting/spinning there. If you don't know about this module, most likely you are not using it.

sergei66666 commented 6 years ago

Faced with a similar problem. I noticed that this happens after restarting the server. Do not restart the application, namely restart the server. If you look at the field "System up time" in kudu (scm) - it perfectly correlates with the time of inaccessibility of the site. To reproduce this, I tried to restart the server. But I did not find a way to do it. Is there some way to do this or maybe simulate?

May be it linked with #226

vdevappa commented 6 years ago

I see this too - but ~30 second delay for some calls. Definitely all method calls are affected for sure - I had to write a script to call all my APIs the first time after a deployment so my user does not notice this slowness.

"We get 500s after a platform event has occurred such as a restart or a new worker Is assigned, and definitely not when the App is under load."

In my case under service fabric, the issue goes away after the first call.

Azure ticket: 117121117302812

Note for me, this problem was not there with 1.1 or 2.0 release. It started happening out of the blue around the time the ticket was opened. I may have updated all my nuget packages to the latest version around that time.

prejeanb commented 6 years ago

@davidebbo and @pan-wang Is there an easy way to script this applicationhost.xdt file addition that appears to resolve this issue? We have numerous ASP.NET Core APIs that can benefit from this, and it seems like this could be generalized and scripted.

davidebbo commented 6 years ago

@prejeanb you should be able to upload the file via the Kudu vfs API.

prejeanb commented 6 years ago

Thanks @davidebbo . I did create a PowerShell script that automates this using the Kudu vfs API.. Note: it does require Azure PowerShell to automate the restarting of the web app.
aspnetcorefix.zip

xt0rted commented 6 years ago

This wasn't mentioned in the post above, but you can create a generic transform that can be applied to any site like so:

<?xml version="1.0"?>
<configuration xmlns:xdt="http://schemas.microsoft.com/XML-Document-Transform">
  <system.applicationHost>
    <sites>
      <site name="%XDT_SITENAME%" xdt:Locator="Match(name)">
        <application preloadEnabled="false" xdt:Transform="SetAttributes(preloadEnabled)">
        </application>
      </site>
    </sites>
  </system.applicationHost>
</configuration>

The %XDT_SITENAME% value will be pulled from Kudu. There's a few other variables you can use as well that are listed in the Kudu wiki.

prejeanb commented 6 years ago

@xt0rted - Thanks for sharing! That simplifies the PowerShell script some. Here is the simplified script... aspdotnetcorefixsimplified (2).zip

joshmouch commented 6 years ago

I just wanted to add my experience in case it helps in this. I'm working with a Microsoft Support employee who's helping me with the following exception that occurs periodically during slot swapping and during auto-scaling operations of my Azure App Service:

Message: Failed to bind to address http://127.0.0.1:6629: address already in use. Exception: Microsoft.AspNetCore.Server.Kestrel.Transport.Libuv.Internal.Networking.UvException

He suggested I try this solution. I haven't yet since I'm trying to understand how it applies, but maybe it'll help your investigation.

halter73 commented 6 years ago

@pan-wang I don't understand this issue particularly with the address in use case. Here's why:

Any time Kestrel logs "Failed to bind to address http://: address already in use.", it should always throw an exception from KestrelServer.Start and in turn WebHost.Run.
This exception should bubble up through Program.Main and crash the dotnet process.
ANCM should observe this process crash and attempt to restart the dotnet process with a new random port which should have a very good chance at succeeding in most scenarios.
If it doesn't succeed a second time, ANCM should retry up to 10 times before ultimately failing.

Do we have any idea at which of the above steps things are falling apart? Based on other reports I've seen, the logs don't support the idea that there are 10 process start failures in a row. Do we even know if the dotnet process is crashing?

I'm starting to wonder if a bunch of applications are out there swallowing exceptions from WebHost.Run/Start that's leading to this strange behavior.

@JosephTremoulet Do you have any insight at which stage in the process crash/recovery cycle things are stalling?

rollsch commented 6 years ago

I just wanted to add my experience in case it helps in this. I'm working with a Microsoft Support employee who's helping me with the following exception that occurs periodically during slot swapping and during auto-scaling operations of my Azure App Service:

Message: Failed to bind to address http://127.0.0.1:6629: address already in use. Exception: Microsoft.AspNetCore.Server.Kestrel.Transport.Libuv.Internal.Networking.UvException

He suggested I try this solution. I haven't yet since I'm trying to understand how it applies, but maybe it'll help your investigation.

I was also sent here by a Microsoft support employee. My problem occurs during instance changes of my WebApp hardware which result in random "AccessDenied" when accessing all of my DLLs, this locks the app as none of the DLLs can be accessed until the entire WebApp is restarted (not just the process, the actual WebApp environment must be stopped, then started again)

I've since done this and not see the error again (It was sporadic and would happen twice a week, once a month etc) but I can't be sure it is fixed.

Can anyone comment how this issue would relate to all of my .NET DLLs returning access denied when they are loaded? Case number is 117120117249053 if any other employees want to have a look.

pan-wang commented 6 years ago

The root cause is that 1) AppWarmUp module send a faked request to start IIS pipeline and waiting for the IO completion of this faked request. ANCM received the faked request and tried to start backend process. if starting failed at first time, ANCM will create a child request to try again. Somehow, the IO completion for the parent/child request got lost. AppWrmUp module holds IIS pipeline forever. This error happens with parent/child request pattern. I will simply the starting logic to abandon the child request in the new code.

joshmouch commented 6 years ago

@halter It sounds like you are saying this exception can be ignored. Although I would wonder why a random port has a 99% chance of already being in use.

pan-wang commented 6 years ago

@rolandh if you hit access deny for loading your assemblies, you definitely had a different problem. Do you have any log showing this access deny? @joshmouch , ANCM will retry to start the application again as @halter mentioned. In multitenant scenario, the chance of port conflict gets higher as the number of hosted applications increase, especially some application may consume more ports, e.g., using SQL connection.

rollsch commented 6 years ago

@pan-wang yes I have a massive amount of logs, memory dumps and diagnostic information attached to my support case. Are you an MS employee who can access it? REQ:117120117249053

pan-wang commented 6 years ago

@rolandh I don't have the permission to access your case. Could you please share some logs at OneDrive or shot me email at panwang"AT"microsoft"dot"com. I can take a look at it.

pan-wang commented 6 years ago

I now understand the root cause of this warm up issue. It seems ANCM did not handle the async return of child request (which is created for retry purpose in case starting Kestrel process failed). Will fix it soon.

FinVamp1 commented 6 years ago

@rolandh I've been analysing your data and discussed this with @pan-wang as well last night. The issue with Access denied is different from the issue discussed in this thread and as I understand the issue is not reproducing at the moment. If it happens again then dump files of both the Worker Process and the Kestrel process will help us get further. I'll work with the support engineer to loop me into your incident.

rollsch commented 6 years ago

I attached dump files of those processes to the case. Can you see them? Or do you mean if it occurs again?

Thanks for following this up, I appreciate it.

joshmouch commented 6 years ago

@pan-wang So this async fix will fix the port-in-use error? I still have a Microsoft ticket open on this case, and they're just giving me random suggestions at this point, so I'd like to close it if you think you have this figured out.

nforysinski commented 6 years ago

@pan-wang not sure if related, but we've started seeing another uptick of this issue even with preloadEnabled=false. For the past two days, some event has caused our API to become inaccessible and the only fix we can reliably engage is a release and slot swap. Otherwise the app simply never responds to a request and we get 502s (presumably the app service is ending the request after the 2 minute threshold is reached).

pan-wang commented 6 years ago

@nforysinski do you have FREB log

sergei66666 commented 6 years ago

Guys, correct me if I'm wrong. The first thing we need to do is to make the errors repeatable. In my experience (in 100% of cases) and the experience of some people from this discussion, one of the reasons is a hard reboot of the server. May be we try to find a way to reboot manually so we do not wait for any system events that lead to the problem.

p.s. I'm not sure that my problem is related to warmup, but the behavior of my application looks the same.

pan-wang commented 6 years ago

@rolandh I looked at #2700. For your case, Kestrel server hit some assembly loading issue. It is unrelated with AppWarmUp issue, in which requests get blocked by AppWarmUp module. @Tratcher and @davidfowl , do you have any idea about the possible root cause of issue "Could not load DiscountRules.CustomerRoles.dll Access is Denied #2700"

pan-wang commented 6 years ago

@joshmouch in most case, the port-in-use error is ignorable as the backend Kestrel exits on this error and ANMC will do the retry on the next request. To fix this port in use issue, the new code will do retry inside serving the first request instead of waiting for new request.

pan-wang commented 6 years ago

@sergei66666 , could you please provide more info about your scenario. Is your application hosted on AppService?

sergei66666 commented 6 years ago

@pan-wang , yes, my app in AppService. And after it rebooted (after update for example), I get 502.5 error. I have logs, but they without any errors. I see my app in kudu panel, but i get 502.5. It can (sometimes can't) start works after plan or slot restart. If I remember correctly, it started in November.

p.s. As I said, I do not know if this is due to warmup. Maybe this is another issue. But the symptoms look similar as I can judge.

joshmouch commented 6 years ago

@nforysinski and pan-wang. I have about 3 different 502's from yesterday. Yesterday after one restart, all requests returned 502 until a second restart. Doesn't sound exactly like what you're talking about, though. I have FREB logs, but didn't want to throw too many things at pan-wang at once until I had this port issue fixed, and I didn't want to confuse this thread too much. ;) Also, I wanted to see if I could find the cause myself before posting here. I'll open up some tickets on these different issues.

joshmouch commented 6 years ago

@nforysinski I have two FREB logs here from yesterday showing your issue if I can post them somewhere.

nforysinski commented 6 years ago

@joshmouch @pan-wang has shared his email earlier in the thread - I assume if you have FREB logs showing issues related to what I posted, you can send them over to him with a quick description that they are for an issue both of us are seeing.

I believe they are related to the issue that preloadEnabled=false is supposed to fix as it has all the same symptoms: instance count changes do not affect the outcome, restarts are usually unsuccessful, very random throughout the day and accompanied by a log that has a sudden, unexpected app shutdown. Only a slot swap with a deploy seems to resolve it this time.

The preload fix was working fine until two days ago when at ~2pm our app stopped responding to requests, and one day ago when at ~1pm it also stopped responding to requests. Only slot swaps after a deploy could get it back to working.

pan-wang commented 6 years ago

@nforysinski slot swaps will lead to w3wp recycle. What do you mean "stopped responding"? 500s error or hang? Did you notice any abnormal about memory usage of w3wp.exe or dotnet.exe?

nforysinski commented 6 years ago

@pan-wang after two minutes, 502 which seems to indicate app service timeout instead of net core timeout. Looking back at our logs, there were no abnormal or unexpected rises in memory or CPU usage on the app service at all.

joshmouch commented 6 years ago

@nforysinski @pan-wang I'm still not sure if my problem is the same, but just though I'd chime in. When I see this timeout after two minutes, the FREB log shows a "2147954402" error in the AspNetCore module. For example:

93. | MODULE_SET_RESPONSE_ERROR_STATUSWarning | ModuleName="AspNetCoreModule",  Notification="EXECUTE_REQUEST_HANDLER", HttpStatus="502",  HttpReason="Bad Gateway", HttpSubStatus="3", ErrorCode="2147954402",  ConfigExceptionInfo="" | 06:48:58.315
-- | -- | -- | --
94. | NOTIFY_MODULE_COMPLETION | ModuleName="AspNetCoreModule",  Notification="EXECUTE_REQUEST_HANDLER",  fIsPostNotificationEvent="false", CompletionBytes="0", ErrorCode="The  operation completed successfully.  (0x0)" | 06:48:58.315

pan-wang commented 6 years ago

@joshmouch based on the log and error code 2147954402 (WININET_E_TIMEOUT) , I would say you hit a different issue other than warmup. In your case, ANCM forwarded the request to the backend while the backend never response. You may want to turn on stdout log in web.config to see what happened at the backend. In the past I saw one similar case that there was a deadlock in backend during start up.

nforysinski commented 6 years ago

@pan-wang I just emailed you an example of a failed request that occurred during this time. Spoiler alert: I am seeing the exact same error code as @joshmouch but this has occurred this morning and two different times about 10 minutes apart. After several interventions the site was able to recover. Again, this bears much resemblance to the warmup issues we were seeing before. I have turned on stdout to see if there are any errors reporting at the instance they are occurring.

nforysinski commented 6 years ago

A little more info - in our event logs, everytime this occurs, we see several events like this:


        <System>

            <Provider Name="W3SVC-WP"/>

            <EventID>2303</EventID>

            <Level>2</Level>

            <Task>0</Task>

            <Keywords>Keywords</Keywords>

            <TimeCreated SystemTime="2018-02-06T22:30:25Z"/>

            <EventRecordID>32770765</EventRecordID>

            <Channel>Application</Channel>

            <Computer>RD00155D85A22C</Computer>

            <Security/>

        </System>

        <EventData>

            <Data>api__ed8f</Data>

            <Binary>02000780</Binary>

        </EventData>

    </Event>

    <Event>

        <System>

            <Provider Name="W3SVC-WP"/>

            <EventID>2303</EventID>

            <Level>2</Level>

            <Task>0</Task>

            <Keywords>Keywords</Keywords>

            <TimeCreated SystemTime="2018-02-06T22:30:25Z"/>

            <EventRecordID>32770781</EventRecordID>

            <Channel>Application</Channel>

            <Computer>RD00155D85A22C</Computer>

            <Security/>

        </System>

        <EventData>

            <Data>api__ed8f</Data>

            <Binary>02000780</Binary>

        </EventData>

    </Event>

    <Event>

        <System>

            <Provider Name="W3SVC-WP"/>

            <EventID>2303</EventID>

            <Level>2</Level>

            <Task>0</Task>

            <Keywords>Keywords</Keywords>

            <TimeCreated SystemTime="2018-02-06T22:30:26Z"/>

            <EventRecordID>28674375</EventRecordID>

            <Channel>Application</Channel>

            <Computer>RD00155D858F1B</Computer>

            <Security/>

        </System>

        <EventData>

            <Data>api__ed8f</Data>

            <Binary>02000780</Binary>

        </EventData>

    </Event>

    <Event>

        <System>

            <Provider Name="W3SVC-WP"/>

            <EventID>2303</EventID>

            <Level>2</Level>

            <Task>0</Task>

            <Keywords>Keywords</Keywords>

            <TimeCreated SystemTime="2018-02-06T22:30:26Z"/>

            <EventRecordID>28674390</EventRecordID>

            <Channel>Application</Channel>

            <Computer>RD00155D858F1B</Computer>

            <Security/>

        </System>

        <EventData>

            <Data>api__ed8f</Data>

            <Binary>02000780</Binary>

        </EventData>

    </Event>

pan-wang commented 6 years ago

this event is W3_EVENT_APPLICATION_PRELOAD_ERROR_GENERIC.which is emitted by warmup module. You definitely hit some warmup issue. I am not sure this is because warmup request got timeout as it ANCM never got response from backend. the stdout log will help.

From: Nick Forysinski notifications@github.com Sent: Tuesday, February 6, 2018 6:37:32 PM To: aspnet/AspNetCoreModule Cc: pan-wang; Mention Subject: Re: [aspnet/AspNetCoreModule] App Warmup failed under stress if backend process failed to start (#260)

A little more info - in our event logs, everytime this occurs, we see several events like this:

    <System>

        <Provider Name="W3SVC-WP"/>

        <EventID>2303</EventID>

        <Level>2</Level>

        <Task>0</Task>

        <Keywords>Keywords</Keywords>

        <TimeCreated SystemTime="2018-02-06T22:30:25Z"/>

        <EventRecordID>32770765</EventRecordID>

        <Channel>Application</Channel>

        <Computer>RD00155D85A22C</Computer>

        <Security/>

    </System>

    <EventData>

        <Data>mongoose-sms-api__ed8f</Data>

        <Binary>02000780</Binary>

    </EventData>

</Event>

<Event>

    <System>

        <Provider Name="W3SVC-WP"/>

        <EventID>2303</EventID>

        <Level>2</Level>

        <Task>0</Task>

        <Keywords>Keywords</Keywords>

        <TimeCreated SystemTime="2018-02-06T22:30:25Z"/>

        <EventRecordID>32770781</EventRecordID>

        <Channel>Application</Channel>

        <Computer>RD00155D85A22C</Computer>

        <Security/>

    </System>

    <EventData>

        <Data>mongoose-sms-api__ed8f</Data>

        <Binary>02000780</Binary>

    </EventData>

</Event>

<Event>

    <System>

        <Provider Name="W3SVC-WP"/>

        <EventID>2303</EventID>

        <Level>2</Level>

        <Task>0</Task>

        <Keywords>Keywords</Keywords>

        <TimeCreated SystemTime="2018-02-06T22:30:26Z"/>

        <EventRecordID>28674375</EventRecordID>

        <Channel>Application</Channel>

        <Computer>RD00155D858F1B</Computer>

        <Security/>

    </System>

    <EventData>

        <Data>mongoose-sms-api__ed8f</Data>

        <Binary>02000780</Binary>

    </EventData>

</Event>

<Event>

    <System>

        <Provider Name="W3SVC-WP"/>

        <EventID>2303</EventID>

        <Level>2</Level>

        <Task>0</Task>

        <Keywords>Keywords</Keywords>

        <TimeCreated SystemTime="2018-02-06T22:30:26Z"/>

        <EventRecordID>28674390</EventRecordID>

        <Channel>Application</Channel>

        <Computer>RD00155D858F1B</Computer>

        <Security/>

    </System>

    <EventData>

        <Data>mongoose-sms-api__ed8f</Data>

        <Binary>02000780</Binary>

    </EventData>

</Event>```

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/aspnet/AspNetCoreModule/issues/260#issuecomment-363603608, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ARVK27B7olt6RKyCeu1JWIzZ_qyCbzFOks5tSOI8gaJpZM4QoRe2.

aspnet / AspNetCoreModule

App Warmup failed under stress if backend process failed to start #260