App fails to start when Azure App Service changes underlying VM.

hallatore commented 7 years ago

We have multiple app services on the same service plan. When the underlying VM is switched out for updates the apps sometimes fail to start.

The website will then timeout if you try to access it.

I see this in the eventlog.

Application 'MACHINE/WEBROOT/APPHOST/APP...' with physical root 'D:\home\site\wwwroot\' 
created process with commandline 'D:\home\site\wwwroot\App.exe ' 
but either crashed or did not reponse or did not listen on the given port '20236', 
ErrorCode = '0x800705b4'

We see two problems here.

When Azure App Services are switched to a new VM (Computer name just changed in eventlog) they sometimes hang. Only way to recover is to click Stop and Start the App Service in the Azure Portal.
When the app fails to start, IIS AspNetCore Module does not recover/retry. Causing the app to stay broken until manually stopped and started.

Extra info

We are using latest .net core 2.0.
We have multiple apps per App Service Plan.
We only see this problem when Azure switches out the machine hosting the app service plan.

shirhatti commented 6 years ago

@davidebbo Have you seen this before?

davidebbo commented 6 years ago

@shirhatti I don't think so. In theory, switching to a new VM should not be much different from doing a full site restart on the current VM, which cleans up all the files and starts new (unlike Stop then Start, which is lighter weight).

shirhatti commented 6 years ago

Do you have any hypothesis on what could be happening here? Based on what you said it should be no different than a restart as seen by us.

davidebbo commented 6 years ago

I can't say that I do. But to investigate, we should try to find out whether the app.exe process is even starting at all. And maybe add more logging to app.exe so we can understand how far it got?

The message "either crashed or did not reponse or did not listen" leaves a lot of possibilities :)

hallatore commented 6 years ago

I have a hypothesis. Don't have much proof though..

We have quite a few apps (22) on the service plan. Two of them are heavy that each use 30% of the memory. The rest are small. In total the app service plan idles around 70-80% memory usage.

When the underlying VM is swapped out all the apps need to start up at about the same time on the new VM. This again causes a memory spike and some of the apps fail to properly start.

This would answer why we only see this when the underlying VM is swapped. Since that is the only time all the app services are started at the same time.

Even if that is the cause I'm not sure what exactly causes the app service to get stuck and not recover. When an app gets broken it will respond with a 503 after about 116 seconds, and I need to manually stop and start to fix it.

davidebbo commented 6 years ago

@hallatore that is plausible. And as far as you know, it never recovers on its own, however long you wait?

shirhatti commented 6 years ago

Building on @hallatore's hypothesis, it could be a port contention issue. Thoughts @pan-wang?

davidebbo commented 6 years ago

There are really two questions here:

Why does it fail?
Why does it not recover?

We have some theories for 1, but ANCM should be built to be resilient to this and retry clean on subsequent requests, allowing it to recover.

hallatore commented 6 years ago

I was thinking it had something to do with the ports too. But there are only a few of the 22 apps that use core. And I didn't find any ports getting reused in the other eventlogs. Might still be a possibility though.

Memory makes sense since stuff has tendency to behave strange when you ran out of it.

I think there is another question here as well.

Why does it fail?
Why does it not recover?
Why does it not know that it failed? (There is no indication that the app is down, other than the timeout if you try to access it)

hallatore commented 6 years ago

@davidebbo When an app get stuck it never recovers until you stop and start the app service. The app service seems to think everything is running just fine.

davidebbo commented 6 years ago

Now the question for @pan-wang is whether it keeps trying to launch app.exe and keeps failing, or does it get into a state where it doesn't even try.

pan-wang commented 6 years ago

For each request, ANCM does try to start the application twice. It also try again on following requests until it reaches the threshold which is configurable. It then enter fast failure protection mode and will not try to start the application for certain time.

On 2017-10-30 17:05 , David Ebbomailto:notifications@github.com Wrote:

Now the question for @pan-wanghttps://github.com/pan-wang is whether it keeps trying to launch app.exe and keeps failing, or does it get into a state where it doesn't even try.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/aspnet/AspNetCoreModule/issues/226#issuecomment-340620863, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ARVK23r71XLjeKkD2jeP7Vc0EJyvC9cMks5sxmRbgaJpZM4QK4Az.

hallatore commented 6 years ago

@davidebbo @pan-wang

I managed to reproduce it by making a core project that uses about 500MB of memory at startup, and then adding clones of the same project to the app service plan.

I started them all att once, and all the apps failed to load with a 503 error. When I stopped 5 out of 6 the last one didn't recover until I manually stopped and started it.

What logging should I turn on, and what other stuff should I check?

hallatore commented 6 years ago

Added 5 app servivces with the same project that uses some memory at startup.
Started each app individually. All loaded up fine.
Marked all 5 app services, stopped and then started them at the same time.
3 out of 5 apps came up. The two others give me HTTP Error 503. The service is unavailable. after about 2 minutes.
I see no indication that the affected app service is not working. On the dashboard everything looks fine.
Stopped and started app 1. This failed to start.
Stopped both affected apps.
Started app 1. Started just fine.
Started app 2. Started just fine.

On step 6 I have a feeling that the other stuck app was using enough memory to keep the other app from starting. Had to stop both to get the first one to start.

I don't find it strange that I can't use more memory than what is available. The problem is that it doesn't recover in time, and that I get no indication that anything is wrong. The affected app service seems to think the underlying app.exe is running just fine.

hallatore commented 6 years ago

Further notes from testing. Each app holds on to the large chuck of memory for about 40 seconds, but GC should reduce it after that.

The sum of the apps startup memory is more than what the app service plan has available. Starting one and one app works fine since the sum is then below the threshold.

When I start all 5 apps at the same time:

The results are not consistent. Sometimes 2 fail, sometimes 4, and one time all booted fine after a while.
When 4 out of 5 got stuck, clicking restart on one app caused that one to boot up just fine. (3 still stuck)
Of the 3 apps still stuck 2 suddenly worked after a while (10 minutes maybe), but one remains stuck still after 30 minutes.
There is no indication in the portal that anything is wrong. And the app service plan shows only 40% memory usage.

davidebbo commented 6 years ago

@hallatore nice job with these experiments!

Could you get a test site in that bad state and share its name? Also, make sure it has debug turned on in web.config (should look like stdoutLogEnabled="true" stdoutLogFile="\\?\%home%\LogFiles\stdout").

hallatore commented 6 years ago

@davidebbo Hmm, maybe it's still port related.

https://gist.github.com/hallatore/f3b021727e4e29ca05f67f1fe42b4fc7

Notice these lines. Seems it doesn't understand it failed to start properly.

Unable to bind to http://localhost:23649 on the IPv6 loopback interface: 'Error -4090 EADDRNOTAVAIL address not available'.
Now listening on: http://localhost:23649
Application started. Press Ctrl+C to shut down.

I left the app in a stuck state here: http://appservicetest-2.azurewebsites.net/ (Should time out after 120 seconds)

davidebbo commented 6 years ago

I also see this:

System.IO.IOException: Failed to bind to address http://127.0.0.1:23896: address already in use. --->

@pan-wang could it be that ANCM picks the port early on, and never tries a different one? So if starting many apps at once causes a conflict, it will never recover since another app will forever use the conflicting port.

How are you picking an available port anyway?

pan-wang commented 6 years ago

ANCM first call Windows API to get and free port. If this Windows API is blocked (e.g., on App Service), ANCM will pick a random port using random number generator. If application failed to start, ANCM will retry with different port. ANCM will log the start failure to Windows event log with details showing the app path, assigned port, commandline and hresult.

pan-wang commented 6 years ago

it has the possibility that concurrent applications may be assigned same port number as the seed of random number generator is current time. Windows event log should show whether there were retries

hallatore commented 6 years ago

I tried making the apps not use to much memory and I'm having a hard time making them freeze. On first try one froze, but the last 4 tries none of the apps failed to load.

hallatore commented 6 years ago

@pan-wang I see a lot of Failed to bind to address http://127.0.0.1:*: address already in use. by using stdoutLogFile. But the apps seem to recover from this after a while. It only seem to cause a longer startup time for the app.

I don't see much of that in the eventlogs though. So it seems to be a collision when all start at the same time, but it doesn't seem to be whats causing them to get stuck.

Not sure if the port issue is related, or just a bug that is somewhat unrelated to the root problem.

It still seems to only happen when the memory is used up while booting up all the apps.

hallatore commented 6 years ago

@pan-wang @davidebbo I think I got fooled by the log.

2017-10-31 23:12:00.323 +00:00 [Warning] Microsoft.AspNetCore.Server.Kestrel: Unable to bind to http://localhost:3836 on the IPv6 loopback interface: 'Error -4090 EADDRNOTAVAIL address not available'.
warn: Microsoft.AspNetCore.Server.Kestrel[0]
      Unable to bind to http://localhost:3836 on the IPv6 loopback interface: 'Error -4090 EADDRNOTAVAIL address not available'.
Now listening on: http://localhost:3836

This is from when I started up just one app. I think it only says it failed to bind that port on IPv6 loopback interface, and not that the port was taken in any way. This error will come up in the log stream every time.

If that is true then it's probably not a problem with ports. So I think we are back to the original idea that something gets stuck in limbo if the VM runs out of memory.

Any ideas how I can see what gets stuck?

shirhatti commented 6 years ago

@hallatore Unable to bind IPv6 is a benign warning. Kestrel still bound on IPv4. Azure App Service does not allow you to bind to IPv6 so you will see that warning even during normal operation

pan-wang commented 6 years ago

@hallatore "HTTP Error 503. The service is unavailable". ANCM will not generate except app_offline scenario which is not the case here. Could you please turn on IIS FREB log so that we can figure out the source of the 503 error.

hallatore commented 6 years ago

@pan-wang @davidebbo I got this internally.

HTTP Error 502.3 - Bad Gateway There was a connection error while trying to route the request.

Module: AspNetCoreModule Notification: ExecuteRequestHandler Handler: aspNetCore Error code: 0x80070032

Here is the trace: https://i.imgur.com/76ow2yw.png

I also started a profiling and got a profile_679e7a_IISProfiling_dotnet_13136.diagsession file. Where should I send this?

hallatore commented 6 years ago

I simplified the test case and manage to reproduce it every time.

You only need a single app that uses more memory than available at startup. It will cause the same problem as before.

The request will load for about 2 minutes and result in a 502 - Web server received an invalid response while acting as a gateway or proxy server.
Eventlog will show this: Application 'MACHINE/WEBROOT/APPHOST/APPSERVICETEST-1' with physical root 'D:\home\site\wwwroot\' created process with commandline 'dotnet .\AppServiceTest.dll' but either crashed or did not reponse or did not listen on the given port '25143', ErrorCode = '0x800705b4'
No signs in the portal that anything is wrong.

Here is the code I added to crash it on a small instance.

public class Program
{
    public static void Main(string[] args)
    {
        var memory = new List<string>();

        for (var i = 0; i < 10000000; i++)
        {
            memory.Add("1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890" + i);
        }

        BuildWebHost(args).Run();
    }

    public static IWebHost BuildWebHost(string[] args) =>
        WebHost.CreateDefaultBuilder(args)
            .UseStartup<Startup>()
            .Build();
}

Earlier questions:

Why does it fail? It runs out of memory
Why does it not recover? (most likely one of the following) a. It doesn't have enough memory b. It doesn't know that it failed
Why does it not know that it failed? (There is no indication that the app is down, other than the timeout if you try to access it)

Expected behavior would be for the app to get stuck in a loop starting over and over again, but with logging that says it ran out of memory. The portal should also display this and create an alert that the app crashed with out of memory exception.

hallatore commented 6 years ago

Azure App Services is happy to report everything is running normally. :)

pan-wang commented 6 years ago

Based on the event log "Application 'MACHINE/WEBROOT/APPHOST/APPSERVICETEST-1' with physical root 'D:\home\site\wwwroot\' created process with commandline 'dotnet .\AppServiceTest.dll' but either crashed or did not reponse or did not listen on the given port '25143', ErrorCode = '0x800705b4'", it seems that ANCM started backend process but the later did not response to the ping (i.e., ANCM try set up a tcp connection with the backend) with default process start timeout (2 minutes). Port conflict is not the root cause as it will fail much early. ANCM should terminate the backend process and try restart the application later which may still hit run-out-of-memory and then timeout error. You should be able to see more error events later. If it is not the case, please let me know. I think since IIS worker process still runs correctly, Portal reporting may be fooled by it.

hallatore commented 6 years ago

@pan-wang The error above are the only ones I see. stdoutLogFile seems to fail when running out of memory, so that one is empty. And only one entry in the eventlog.

pan-wang commented 6 years ago

Interesting. It seems both ANCM and backend application ran abnormally. Just curious, have you try the same test on a local server? The behavior may be related with how the resource control is enforced. On local server, once the memory threshold is reached, a worker process recycle may be triggered. I will do some investigation and add a test case for new ANCM release. As a workaround (you may have already taken), you can change you application and deployment to avoid such out-of-memory situation.

hallatore commented 6 years ago

@pan-wang The problem is reproducible on any Azure App Service. I haven't tested it on a local server since I don't have any available atm.

We moved some applications to make sure we should't run out of memory in the future.

hallatore commented 6 years ago

@pan-wang: I tested what would happen if I ran out of memory while the app was running.

I just made it so that each requests takes up more and more memory until it's full by filling a list. From the user perspective the last requests takes a bit longer to complete, because internally the app is restarting.

I only see these two in the eventlog:

Sent shutdown HTTP message to process '532' and received http status '202'.
Application 'MACHINE/WEBROOT/APPHOST/APPSERVICETEST20171103043004' started process '1900' successfully and is listening on port '25897'.

I can't find any indication anywhere in the logs that the app crashed from a memory problem.

My biggest concern is that if a core app runs out of memory there will be no logs telling the owner what actually happened. The app will just restart, or worse get stuck in limbo.

Here is the trick I used to test it.

public void Configure(IApplicationBuilder app, IHostingEnvironment env)
{
    var memory = new List<string>();

    app.Run(async (context) =>
    {
        for (var i = 0; i < 500000; i++)
        {
            memory.Add("1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890" + i);
        }

        await context.Response.WriteAsync("Hello World!");
    });
}

hallatore commented 6 years ago

This might be a contributing factor. Not sure if it creates a notification in the portal if it's invoked. I tried disabling it with the app setting. Logs didn't change but the timeout took longer than before and I got a timeout instead of a 502.

https://blogs.msdn.microsoft.com/appserviceteam/2017/08/17/proactive-auto-heal/

With and without there are nothing telling me that memory is a problem, or that the app is having problems.

pan-wang commented 6 years ago

It's hard for ANCM to capture and report such memory issue as ANCM itself is also a victim of out of memory and capturing the error info in another process is challenging. I agree that the user experience should be improved. Two things may be improved 1) IIS/AppService should recycle the worker process and log an event with recycle reason. IIS does it on full server. Not sure about AppService environment. 2) ANCM should enable stderr log by default to capture the error thrown by the backend service. For 2), I will try to push it into ANCM. For 1), @davidebbo could you please help to check whether AppService has this support?

hallatore commented 6 years ago

As a user I would at least like to know that something went wrong. And it would be nice to get a hint that it was an OutOfMemoryException.

Is there something that can be done with dotnet to properly log/pipe an OutOfMemoryException back to ANCM?

Maybe @davidfowl has some input?

davidebbo commented 6 years ago

The 'Diagnose and solve problems' page on the portal does have a number of tools to help diagnose memory and other issues.

hallatore commented 6 years ago

@davidebbo A lot of useful information there! :)

I see that the memory was high. But App Restart Analysis was rather empty. Might be since ANCM never crashed, just the dotnet process?

If ANCM can report the dotnet crash then the analysis could probably correlate it with the high memory usage.

aspnet / AspNetCoreModule

App fails to start when Azure App Service changes underlying VM. #226