Azure / azure-functions-host

The host/runtime that powers Azure Functions
https://functions.azure.com
MIT License
1.94k stars 441 forks source link

EasyAuth middleware hang when restarting azure function app under load #7829

Open dhtek opened 3 years ago

dhtek commented 3 years ago

When the authentication feature (EasyAuth) is activated on a function app and setup with Azure Active Directory, if the function app restart when there are still HTTP requests incoming, the function app stays unhealthy.

Azure Function Setup:

Authentication feature setup:

Repro steps

Provide the steps required to reproduce the problem:

1) Create a function App with a HTTP trigger (can be just a return ok 200) hosted on an app service plan (P3v2) using docker image 2) Setup the app registration in Active Directory for this API with token OAUTH 2.0 3) Activate the authentication feature on this API with the identity provider Microsoft 4) Execute constant HTTP Request load (+-100 Requests Per Second) using artillery, loader-IO, or other loading test tools with a valid bearer token 5) Stop the function app during the load 6) Wait for some execution errors 7) Start the function app

At this stage, the function app will not reply anymore until a restart after end of load.

Expected behavior

The function app should restart even if there is constant HTTP request load and process incoming requests

Actual behavior

The function stays unhealthy

Known workarounds

Block the incoming requests and restart the function

Additional information

If we do the same test with authentication feature deactivated, the function app restart correctly and the load test continue without any issue...

When the function app is unhealthy we don't see any logs in kudu log stream for appsvc/middleware or msiProxy explaining the situation even if the have activated the filesystem application logging feature in the portal.

v-anvari commented 2 years ago

Hi @dhtek , Can you share the invocation id, Region name and timestamp

dhtek commented 2 years ago

Hi @v-anvari,

I can give you the region: West-Europe, but for the invocationId it will be difficult as the incoming requests aren't received by our function app after the stop-start so not really invoked and injected in Application Insights.

After investigation from my side, I discovered that +-30 minutes after the test load I got this error in the _easyauth_docker.log

2021-11-08T15:44:44.602139065Z [41m[30mfail[39m[22m[49m: Middleware[0] 2021-11-08T15:44:44.602200067Z Failed to forward request to application. Encountered a System.Net.Http.HttpRequestException exception after 2250710.892ms with message: An error occurred while sending the request.. Check application logs to verify the application is properly handling HTTP traffic.

2021-11-08T15:44:44.943409704Z [41m[30mfail[39m[22m[49m: Middleware[0] 2021-11-08T15:44:44.943558107Z Failed to forward request to application. Encountered a System.Threading.Tasks.TaskCanceledException exception after 1302139.557ms with message: The operation was canceled.. Check application logs to verify the application is properly handling HTTP traffic.

It seems that this log is generated by the RequestForwarder from Microsoft.Azure.AppService.Middleware.Forwarding

catch (Exception e)
{
string message = $"Failed to forward request to application. Encountered a {e.GetType()} exception after {requestTime.Elapsed.TotalMilliseconds:F3}ms with message: {e.Message}." + " Check application logs to verify the application is properly handling HTTP traffic.";
logger.LogError(message);
string internalTelemetryMessage = $"HTTP proxy request encountered exception.\nType: {e.GetType()}\nMessage: {e.Message}\n StackTrace: {e.StackTrace}";
if (e.InnerException != null)
{
    internalTelemetryMessage += $"\n\nInnerException\nType: {e.InnerException!.GetType()}\nMessage: {e.InnerException!.Message}\n StackTrace: {e.InnerException!.StackTrace}";
}
MiddlewareTrace.TraceError(internalTelemetryMessage);
context.Response.StatusCode = 500;
}

which is weird because the HTTP client is configured with a default Timeout of 300 seconds

RequestForwarder requestForwarder = new RequestForwarder(targetHostUri, rewrite, logger, TimeSpan.FromSeconds(300.0));

Unfortunately, I didn't found a solution to activate trace log level for the middleware which can be helpful, do you have any idea what I can do this ? There is two loggers one using TraceSource in this class MiddlewareTraceand the standard .net core logger

It looks like the middleware start to forward requests before that the destination container (my function app) is up and running

Thanks for the help

v-anvari commented 2 years ago

Thank you for the details. We are following up with the concerned team for further insights.

mandar-lad commented 2 years ago

We faced similar issue with Azure app service (4 instances and auto scaling enabled) with CORS settings as '*'. Below are log snippets from EasyAuth docker container log files "…easyauth_docker.log"

`2022-06-10T00:03:45.552868700Z [41m[30mfail[39m[22m[49m: Middleware[0] 2022-06-10T00:03:45.552922505Z Failed to forward request to {IP address based URL removed}. Encountered a System.Net.Http.HttpRequestException exception after 0.774ms with message: Error while copying content to a stream.. Check application logs to verify the application is properly handling HTTP traffic.

2022-06-10T00:00:00.706314821Z [41m[30mfail[39m[22m[49m: Middleware[0] 2022-06-10T00:00:00.706365525Z Failed to forward request to {IP address based URL removed}. Encountered a System.Net.Http.HttpRequestException exception after 1.802ms with message: Error while copying content to a stream.. Check application logs to verify the application is properly handling HTTP traffic.

2022-06-09T00:33:54.096796203Z [41m[30mfail[39m[22m[49m: Middleware[0] 2022-06-09T00:33:54.096842003Z Failed to forward request to {IP address based URL removed}. Encountered a System.Net.Http.HttpRequestException exception after 2.363ms with message: Error while copying content to a stream.. Check application logs to verify the application is properly handling HTTP traffic.`

v-anvari - Appreciate if you could share relevant product groups' findings on these similar issues. Thanks

kwasilka commented 2 years ago

We're facing the same issue. We're running Uvicorn with FastAPI in an azure app service and experiencing the exact same issues logged here. Most often, we see the error and the end user receives a non descriptive 500 error, when our requests hit above 100.

So interesting events we're seeing, with info logging enabled in Uvicorn, we see the requests being answered form a different IP than what is being reported on the error message. At the time of writing this, the errors being reported back are on .7, but the 200 OK requests are being answered by .8.

Any help would be greatly appreciated!

kwasilka commented 2 years ago

From further testing and research, I had found some other information how people had auto scale out enabled, without going into much details.

Today I started testing this a little further, with logged requests to the service being replied from IP .4. When the middleware error would show, it was being reported from .3. Setting the scale out manually to 2, we were no longer able to reproduce the error, pushing the requests beyond 600 sustained for 10-15 minutes and did not have another error occur. With a second instance running from the scale out, requests were being answered from .4 and .7.

We're unable to identify what .3 is within the container, is this a load balancer for the instances? Though I'm unable to confirm this, but what it looks like is .3 is the load balancer for instances and forwards the request onto the instance running on another IP. When only 1 instance is currently running and the requests reach a tipping point, in our case and above posts, seems to be around 100 requests at a time, the balancer might be attempting to send the request to another instance that isn't running, resulting in the error of no response to http traffic.

kwasilka commented 2 years ago

This appears to still be an issue for us. This morning we seen the error show up, on an IP that is answering requests, and even answered requests 50 ms after the error

2022-07-08T15:10:12.357743322Z fail: Middleware[0]

2022-07-08T15:10:12.357781122Z Failed to forward request to http://xxx.xxx.xxx.7:8000. Encountered a System.Net.Http.HttpRequestException exception after 7474.419ms with message: An error occurred while sending the request.. Check application logs to verify the application is properly handling HTTP traffic.

2022-07-08T15:10:12.359916350Z fail: Middleware[0]

2022-07-08T15:10:12.359940351Z Failed to forward request to http://xxx.xxx.xxx.7:8000. Encountered a System.Net.Http.HttpRequestException exception after 7251.366ms with message: An error occurred while sending the request.. Check application logs to verify the application is properly handling HTTP traffic.

2022-07-08T15:10:43.551433545Z INFO: xxx.xxx.xxx.7:58172 - "GET /xxx HTTP/1.1" 200 OK

2022-07-08T15:10:46.869278659Z INFO: xxx.xxx.xxx.7:58152 - "GET xxx HTTP/1.1" 200 OK

ltrain777 commented 2 years ago

We're hitting the same issue with a NodeJS web app with EasyAuth enabled. I notice that the port usage goes up when this error occurs. The image shows a restart at about 08-23 23:00, then the error happens again a bit later.

image

kwasilka commented 1 year ago

Following up what I've found. Simplifying: The root cause was using synchronous subroutines within an async function. All endpoints are async, but when having to make database calls, it then causes blocks and the async function to behave synchronously. Under load, we would then see the 500 errors show up because there were no worker processes available to receive the request from the app service layer, due to being blocked by other requests. The database calls the API would need to make is being done over an express route and not currently in azure which amplified the issue, as database calls would take longer than if they were in the same datacenter.

The app service should be reporting back a different error, a 408, 429 or even a 404 would be more helpful at troubleshooting these kinds of issues instead of an empty 500.

bchr02 commented 1 year ago

I too am having a similar issue. I have an Azure nodejs app deployed on Linux with easy authentication enabled. After about an hour with just casual app usage the application stops working. Below is what is logged. Restarting the app causes it to start working again for another hour or less before it starts having the same issue. The app is a webserver. Some pages pull data from an Oracle database (using oracledb driver) the is located on another Azure server on the same VNET. All works fine including accessing data from the db for at least 10-15 minutes before it randomly decides to error out.

2023-02-17T03:44:57.056Z INFO  - Container xxxxxxxxx_18_40_middleware for site xxxxxxxxx initialized successfully and is ready to serve requests.

/home/LogFiles/2023_02_17_pl0k000LOT_easyauth_docker.log  (https://xxxxxxxxx.scm.azurewebsites.net/api/vfs/LogFiles/2023_02_17_pl0s0LOT_easyauth_docker.log)

2023-02-17T03:08:43.131273977Z fail: Middleware[0]
2023-02-17T03:08:43.132546971Z       Failed to forward request to http://169.x.x.x:3000. Encountered a System.Threading.Tasks.TaskCanceledException exception after 300009.219ms with message: The operation was canceled.. Check application logs to verify the application is properly handling HTTP traffic.
jameel-shorosh-silk commented 1 year ago

any update on this? we are getting the same issue with Azure web app running gunicorn application.

hiniestic commented 9 months ago

Any news on this issue. I'm having the same issue on FASTAPI and not using Easy Auth but having this logs

2024-01-11T14:57:46.653855613Z fail: Middleware[0] 2024-01-11T14:57:46.653911917Z Failed to forward request to http://169.254.129.2:8000. Encountered a System.Threading.Tasks.TaskCanceledException exception after 300003.436ms with message: The request was canceled due to the configured HttpClient.Timeout of 300 seconds elapsing.. Check application logs to verify the application is properly handling HTTP traffic.

that block the APP. @kwasilka did you find any solution for fastapi?

kwasilka commented 9 months ago

@hiniestic, no I'm sorry I don't have anything that really corrected our problem. Initially adding more resources to the API (increasing the app service plan) reduced the errors from showing, but with a recent update to push a lot more traffic through the API, we're now seeing the errors return. From my research, others have abandoned the app service and have simply ran their API containerized. I haven't gone that route yet, I've containerized our API into a docker container but still host it in an app service since we rely heavily on MSAL and the app service make that easier to manage. We've also adjusted our front end http requests to the API to retry if they experience an empty message 500 error since this is the return result when this issue occurs. It normally responds on retries. Not ideal but it's how we're managing at the moment.

patnir411 commented 9 months ago

I solved a variant of this issue. For context my setup involved an Azure Web App service serving my FastAPI backend with Uvicorn which I deployed as a containerized instance. My solution involved the following, since it took me >48hrs of manual labor to look into this, thought I'd leave this here to add to the body of knowledge in case it helps anyone else:

Within my gunicorn.conf.py file, I needed to specify a timeout = value that was >30 (the default value). My API was expected to take >1 minute on certain requests, and I had to denote this.

Additionally I needed to set the line in my Dockerfile to the following to enable this parameter to be enforced.

CMD ["gunicorn", "-c", "gunicorn.conf.py", "main:app"]

kwasilka commented 9 months ago

For me, the timeout doesn't help. My errors reporting on the app service has the API not responding in milliseconds. yesterday it timed out as it didn't reply in 3.5ms, yet my timeout is already set to 120 seconds in my gunicorn config file.

patnir411 commented 9 months ago

Hmm. It seems to have let me use my API now, but I do notice now that 1/10 times it will fail and I see this error:

2024-02-02T21:21:23.192034572Z fail: Middleware[0]
2024-02-02T21:21:23.192079473Z       Failed to forward request to http://169.254.130.7. Encountered a System.Net.Http.HttpRequestException exception after 29623.230ms with message: An error occurred while sending the request.. Check application logs to verify the application is properly handling HTTP traffic.

I'm unable to see any logs that add more detail which makes this even more difficult to understand. I may just have to go ahead and implement something that retries similar to you.

ccueto36 commented 7 months ago

Any news on this issue. I'm having the same issue on FASTAPI and not using Easy Auth but having this logs

2024-01-11T14:57:46.653855613Z �[41m�[30mfail�[39m�[22m�[49m: Middleware[0] 2024-01-11T14:57:46.653911917Z Failed to forward request to http://169.254.129.2:8000. Encountered a System.Threading.Tasks.TaskCanceledException exception after 300003.436ms with message: The request was canceled due to the configured HttpClient.Timeout of 300 seconds elapsing.. Check application logs to verify the application is properly handling HTTP traffic.

that block the APP. @kwasilka did you find any solution for fastapi?

Did you ever figure this out? I'm running a uvicorn/gunicorn FastAPI app also, and running into random 60 sec response times when they should take less than a second. Seeing these same errors in my logs. I also don't have EasyAuth enabled either. Been researching this issue for many hours and running out of ideas.

slarionoff commented 7 months ago

Hello everyone, I've started to get the issue this Thursday. I have an FA that worked fine before, now it hangs and then fails with a timeout. In the log's beginning it shows 3-4 times this combination: 2024-04-03T07:02:31.408663644Z: [INFO] fail: Middleware[0] 2024-04-03T07:02:31.409223053Z: [INFO] Failed to forward request to http://169.254.N.P. Encountered a System.Threading.Tasks.TaskCanceledException exception after 300032.528ms with message: The request was canceled due to the configured HttpClient.Timeout of 300 seconds elapsing.. Check application logs to verify the application is properly handling HTTP traffic.

Not sure if this is a valid point, but when I run a

hostname -I

command on non-working Prod and working Dev, I get a little bit different results. I mean, on Prod I receive, let's say,

hostname -I
169.254.254.M 169.254.N.P

, but on Dev it is

hostname -I
169.254.N.P 169.254.254.M

, they are kind of switched. I'm not sure about how those addresses are prioritized etc.

Reboot of an App or a server itself didn't help.

Edit #2: On the other hand, a command

hostname -i
169.254.N.P

returns the same template for both Dev and Prod IPs, so maybe it is not relevant

P.S. Edited - I've mixed Dev and Prod command output