Azure / azure-functions-host

The host/runtime that powers Azure Functions
https://functions.azure.com
MIT License
1.92k stars 440 forks source link

Random "The service is unavailable." and "Azure Functions runtime is unreachable" errors #8583

Open Arjan321 opened 2 years ago

Arjan321 commented 2 years ago

We have been running a couple of Azure Functions in various subscriptions, and every once in a while (about once a week), the entire Azure Function goes down and reports "The service is unavailable." when accessing the Function-app via HTTP and reports "Azure Functions runtime is unreachable" in the Azure Portal.

HTTP response: image

Portal: image

The issue appears randomly, without any changes in our end (no deployment, etc.) and also resolves randomly without any interaction from our side.

In the "Activity log" an error is written for the job "Sync Web Apps Function Triggers" with status "Failed":

        "statusCode": "BadRequest",
        "statusMessage": "{\"Code\":\"BadRequest\",\"Message\":\"Encountered an error (ServiceUnavailable) from host runtime.\",\"Target\":null,\"Details\":[{\"Message\":\"Encountered an error (ServiceUnavailable) from host runtime.\"},{\"Code\":\"BadRequest\"},{\"ErrorEntity\":{\"Code\":\"BadRequest\",\"Message\":\"Encountered an error (ServiceUnavailable) from host runtime.\"}}],\"Innererror\":null}",

The issue seems quite similar to #8519, however we are running Linux.

This is causing quite some problems, since we are no longer able to provide reliable service to our end-users.

Investigative information

Repro steps

None that we can find

Expected behavior

Always work

Known workarounds

None

Related information

Hosting Model: Consumption Plan OS: Linux Version: V4 Hosting Model In-Process Language: C#/dotnet6 Configuration:

wbail commented 2 years ago

Hi Arjan32,

Normally this error message is related with misconfiguration.

Please take a look the bullets below:

Arjan321 commented 2 years ago

As mentioned in then original issue, Those values point to valid values.

Given the extreme randomness of both the issue appearing and resolving itself, I hardly doubt any setting on our end is responsible.

gilesmatthews commented 1 year ago

Hi, is there any update on this issue? I am seeing the same random service unavailable issues. Thanks

rlucassen commented 1 year ago

Same issue here, completely random. Waiting on a fix for a while now.

LuckyLub commented 1 year ago

Sorry for you guys, but I'm happy to read this... thought I was going crazy. What region do you guys have your Azure Functions running? Ours are in West Europe.

rlucassen commented 1 year ago

Sorry for you guys, but I'm happy to read this... thought I was going crazy. What region do you guys have your Azure Functions running? Ours are in West Europe.

Mine are running in West Europe as well, same for @Arjan321

tomabg commented 1 year ago

same region West Europe

tomabg commented 1 year ago

this also breaks terraform deployment...as we are on test only we will try different region soon

rlucassen commented 1 year ago

this also breaks terraform deployment...as we are on test only we will try different region soon

Curious to know if other regions do work.

Heard that Microsoft is doing updates in the West Europe region around next week

LuckyLub commented 1 year ago

We also deploy with Terraform BTW. Changing the location to Central US seems to do the trick.

jnekrasov commented 1 year ago

Any news on this?! We are experiencing the same problems in West Europe region

Ralle1986 commented 1 year ago

Also seeing issue in West Europe region

balag0 commented 1 year ago

Sorry for the delay in responding. Yes, there was a regression due to a recent update in the region which caused intermittent errors. The fix rollout has already started and in progress currently. If there are any apps still experiencing failures, could you please share the details and we can double check them. Thanks

LuckyLub commented 1 year ago

Before we move back, can you please confirm when the roll-out is completed?

lightwaver commented 1 year ago

any news on that topic or link to the issue to see if its solved ?

LuckyLub commented 1 year ago

@balag0 any update?

LuckyLub commented 1 year ago

@surgupta-msft, @balag0, any updates? Just tried to redeploy to West-Europe, still running into random "The service is unavailable." errors Running the same functions in North-Europe works just fine.

LuckyLub commented 1 year ago

Just had contact with Azure's Help + Support. I presented the problem, it seemed to be known. However, they will collect some logs from my Azure Functions and look further into it. They are still working on upgrading the services in West-Europe. Currently it was advised to use another region.

jeroenvermunt commented 1 year ago

I can confirm that it is still occurring

Rutix commented 1 year ago

We have had these problems too. We have been in contact with Azure Support and they are saying the following:

" Conclusion: The 503s were detected ONLY on Azure Front End, the Front End instances encountered some unexpected error at that moment and weren’t able to handle the http requests and distributed them to specific workers that were hosting your function app.

.....

Let’s take a closer look at these 2 Front End instances at that time, Front End instance 24 encountered an error when trying to get a worker from the data role, and the same situation for the instance 5.

image image

Unfortunately, this is an underlying platform issue as the Front End is an important component inside of Azure platform, and both the user side and our side cannot have action to interact with it, we apologize for all the inconvenience caused, but please rest assured that I’ve already reflected this to the Microsoft Azure Team and they already know this, we met this kind of issue before. "

^ this has been now several months ago. The fact that these problems are still popping up is disappointing. Also telling us to a different region is also disappointing. We are bound by compliance issues so we cant leave our region as easy.

rlucassen commented 1 year ago

Last wednesday we got the message from Microsoft that this issue should be fixed, we already transferred to a premium plan because we got tired of it after 4 months. I'm curious to know if anybody is still encountering this issue in the West-Europe region?

LuckyLub commented 1 year ago

So using premium is a valid work around?

rlucassen commented 1 year ago

So using premium is a valid work around?

Yes upgrading to premium worked for us

Arjan321 commented 1 year ago

So using premium is a valid work around?

If "Throwing money at the problem" counts as a workaround, then yes. This ticket is specifically about the Consumption Plan.

LuckyLub commented 1 year ago

Just good to know what options are out there.

LuckyLub commented 1 year ago

Message from MS:

Product Group is still working on improvement on the west Europe region. Will keep you updated on any progress.

Rutix commented 1 year ago

I got some messages that we got hit again. We are not entirely sure if it was the same cause as this issue but we had "Service unavailable" a couple of times today in west europe.

rdvansloten commented 1 year ago

@Rutix I happened upon this thread as well this afternoon, also from The Netherlands, having Functions (Consumption tier) in West Europe. They were unavailable and/or throwing SSL errors. I also could not deploy from VS Code (unavailable)

What fixed it for me is going into the Portal and Restarting my Functions manually.

LuckyLub commented 1 year ago

That’s a temporary fix.

Rutix commented 1 year ago

@rdvansloten sadly that is only a temp fix and also doesn't always work (or takes a long time to recover). The best way would be for the Azure team to fix this asap.

rdvansloten commented 1 year ago

@Rutix yeah it is, but it's better than "nothing" if you're using this in Prod. It keeps happening for me and a reboot is a fast fix for now. Did you put in a support ticket?

I get this diagnostic when creating a ticket:


Description | Function was running on 0 worker instance for more than 825 minutes between 10/26/2022 9:20:00 PM and 10/27/2022 11:10:00 AM.
-- | --
Possible Cause: | Function App was offline due to previous deploymentPlease restart the Function site manually or redeploy the function site to get the issue mitigated or use the AppOffline History detector to check if the function was offline during this period. Please visit Azure App Service Deploy task for more information. Function app site was Stopped or disabledPlease use the Web App Restarted detector to check if the function was stopped or disabled during this period.

</div></div>Host Runtime instance (Dynamic Plan) was not available for a long time period (> 15 minutes)
Description 
Function was running on 0 worker instance for more than 825 minutes between 10/26/2022 9:20:00 PM and 10/27/2022 11:10:00 AM.

Possible Cause: 
Function App was offline due to previous deployment

Please restart the Function site manually or redeploy the function site to get the issue mitigated or use the [AppOffline History ](https://applens.azurewebsites.net/subscriptions/865f86e6-0a9a-4c2f-8742-ce207e509dad/resourceGroups/myRG/sites/myApp/detectors/FunctionAppOfflineHistory?startTime=2022-10-26T11:25&endTime=2022-10-27T11:10)detector to check if the function was offline during this period. 

Please visit [Azure App Service Deploy task](https://docs.microsoft.com/en-us/azure/devops/pipelines/tasks/deploy/azure-rm-web-app-deployment?view=azure-devops) for more information. 

Function app site was Stopped or disabled

Please use the [Web App Restarted ](https://applens.azurewebsites.net/subscriptions/865f86e6-0a9a-4c2f-8742-ce207e509dad/resourceGroups/myRG/myApp/detectors/webapprestart?startTime=2022-10-26T11:25&endTime=2022-10-27T11:10)detector to check if the function was stopped or disabled during this period.

Apparently it was running for 825 minutes? Odd.

Rutix commented 1 year ago

@Rutix yeah it is, but it's better than "nothing" if you're using this in Prod. It keeps happening for me and a reboot is a fast fix for now. Did you put in a support ticket?

Indeed it's better than nothing :) but would be better if Microsoft just fixes the problem. And yea we have had support tickets open. You can see the answer in one of my earlier comments. They know it's a known issue

LuckyLub commented 1 year ago

Another update from MS, they claim it has been fixed!

The Microsoft Azure Team has investigated the issue you reported on your devOps pipeline that resulted in errors of your consumption apps becoming unavailable. This issue was found to be related to an issue within the capacity restriction in the region.

Your consumption app was placed in a region with not enough capacity to fulfill all of the traffic. Because the West Europe region has one of the highest traffic for our product, we had to request an increase in the capacity through our partner team. After increasing the capacity for the region, the region was able to handle the high traffic and the availability issue was resolved.

We are continuously taking steps to improve the Azure Web App service and our processes to ensure such incidents do not occur in the future, and in this case it includes (but is not limited to) ensuring high availability throughout the globe.

We apologize for any inconvenience.

stooone commented 1 year ago

Happened with me yesterday and today too.

rdvansloten commented 1 year ago

We moved our Functions to North Europe. EUW is a damned mess right now. I couldn't deploy a new Function in EUW because capacity was mysteriously gone.

hegde89 commented 1 year ago

we are also experiencing the same with EUW. is there any update

Rutix commented 1 year ago

@rdvansloten been a couple of days, has North Europe treated you better? If so we will also make the efforts to move to North Europe

jeroenvermunt commented 1 year ago

@Rutix Our function runs at low volume (10 times a day), but since we moved to NE a few weeks ago I have yet to run into this issue

rdvansloten commented 1 year ago

@rdvansloten been a couple of days, has North Europe treated you better? If so we will also make the efforts to move to North Europe

Not a single hitch since I put everything in North Europe.

Rutix commented 1 year ago

We have experienced the problem again in West Europe. We bit the bullet and moved the function to North Europe and that seems to help for now. So really seems to be a region issue like suggested earlier in the thread already. We made an Azure Support Issue again because its becoming kinda ridiculous how long this is taking Microsoft to solve.

akselikap commented 1 year ago

Also banging my head against the wall with this issue.

stevengaaa commented 1 year ago

We have experienced the same problem in Australia East today for 4hours starting from 4am (utc time) . Linux, Consumption tier. but Windows based functions are running ok. apps on Premium plan are ok, too.

Rutix commented 1 year ago

We contacted Azure Support and this is the message they gave back:

Upon further checking, there is an emerging issue ongoing which is impacting a lot of Function apps on Linux Consumption plan in West Europe region. Unfortunately, the stamp that your app is hosted on is impacted.

Our Product Group team is actively working on this issue right now and once there is any further update, I will let you know as soon as possible.

Please accept our sincere apologies on the inconvenience this issue has caused. Typically, when similar issue occurs, as we have backend algorithm to detect and auto heal such issues, your app should be recovered within 0.5 hour. That is why normally issue like this should be a "one-time issue".

However, this time it is an emergent issue on the platform side which cannot be mitigated by our backend automatically, and the corresponding team is needed to be engaged to fix the issue. We will keep you posted over the progress.

If the issue is business critical and you will need the app to be working immediately, we will recommend temporarily moving to a dedicated plan as the apps hosted in dedicated plans are using a different set of stamps (worker machines) and are not impacted by this issue.

Hope the above could be helpful to you. If you have any further question or concern, please feel free to let me know anytime. We are always here to help.

bcdunbar commented 1 year ago

I haven't read through all the above but dropping a note here in case it is useful for others.

We experienced this random error yesterday (Australia Southeast), which the runtime was unreachable, but Function App was running and stating healthy. We checked storage accounts, rotated keys, restarted app, but none worked. In the end, we migrated v3 to v4 locally, confirmed it was working, and then used forced deployment to resolve the issue (as regular deployment was failing).

Migration: https://learn.microsoft.com/en-us/azure/azure-functions/migrate-version-3-version-4?tabs=net6-in-proc%2Cazure-cli%2Cwindows&pivots=programming-language-python

Force deployment:

func azure functionapp publish <FUNCTION-APP-NAME> --force --verbose

Perhaps the migration is not needed, and a forced redeployment may resolve the issue. A caveat here is that this is likely a temporary fix but there is no indication as to what the root cause was given there had been no changes to that service for a month.

Rutix commented 1 year ago

We received another update from Azure support yesterday:

Summary of Impact: Between 01:19 UTC and 10:31 UTC on 14 Dec 2022, you were identified as a customer using App Service in West Europe who may have received intermittent > HTTP 500-level response codes while creating the resource, experienced timeouts or high latency when accessing App Service (Web, Mobile, and API Apps), App Service > (Linux), or Function deployments hosted in this region.

Preliminary Root Cause: We determined a backend service experienced failures due to a disabled component resulting in the above failures.

Mitigation: We manually enabled the component of the backend service to mitigate the issue.

Next Steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom > service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.

Once a full root cause analysis is available, I will let you know as soon as possible. Thank you again for your patience and understanding.

apoorvmintri commented 1 year ago

I've just started to experience this in East US; unable to resolve for the past 2 hours. Any help would be really appreciated!

Edit: Forgot to mention that this was on P1 Plan (.NET 6, Linux, Code deployment i.e. no docker)

Update: Went for lunch, came back and it works now - no changes made. Is Funtion v4 production ready?

vgool commented 1 year ago

Good to hear I'm not the only one suffering this issue..

My Functions (windows, consumption plan, West Europe) deployed to our Development environment run very slow and fail randomly, however the Functions on our Test environment run fast and all succeed..

Unbelievable this issue is open for over half a year!

markleavesley commented 1 year ago

We are also experiencing intermittent "HTTP Error 503. The service is unavailable" for v1 Functions (Windows, consumption) in UK South and Australia Southeast, more than 40 occurences in the last week. They are v1 because we're stuck on Framework SDKs for these two and we are waiting for the v4 Framework tooling to settle down and give that a go (last time I tried it the template HTTP function wouldn't even compile!)

bdorplatt commented 1 year ago

We are seeing this exact issue as well in the North Central US region.

markleavesley commented 1 year ago

I raised this with MS who basically said upgrading to v4 Isolated Functions fixes it, which is something we were already looking into anyway as v1 is no longer being maintained. We didn't feel inclined to push for any actual explanation as to what was going on as the v4 upgrade seems fairly quick and painless tbh.

https://learn.microsoft.com/en-us/azure/azure-functions/migrate-version-1-version-4?tabs=v4%2Cazure-cli%2Cwindows&pivots=programming-language-csharp

I found creating a v4 Isolated Function (http trigger, Framework 4.8) was a useful support to the instructions as they seem to be slightly out of date.