Apprunner hangs on long running requests with error message "upstream connect error or disconnect/reset before headers. reset reason: connection termination"

avivio commented 2 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do * not help prioritize the request If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request What do you want us to build? When running an app with a simple frontend but a long running backend the web client will suddenly hang with the error message "upstream connect error or disconnect/reset before headers. reset reason: connection termination" This doesn't seem to effect the app itself since I see in cloudwatch the logs keep behaving as if the request is being processed. This is probably a simple timeout configuration in the load balancer or the API Gateway (if there is one). Was wondering if you could add the option to configure this timeout, or at least provide visibility into what the configuration is.

Describe alternatives you've considered Use ECS where you can configure these parameters directly

Additional context Anything else we should know?

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

jparksecurity commented 2 years ago

I was able to fix this issue by pausing and resuming the service.

jparksecurity commented 2 years ago

I already have this issue again today. Do we know anyone we can tag here to get some attention from AWS?

jparksecurity commented 2 years ago

Tag #104

leocorelli commented 2 years ago

happening to me right now. after pausing and resuming still no fix. :/

Mugane commented 2 years ago

Same issue. No clear indication of what is going on anywhere. Is this a resource issue? Is there a problem with App Runner? None of this happens anywhere else we're running these containers, what's going on?

eeshan-dx commented 2 years ago

I am experiencing the same issue! Even if I reduce the processing time, the 503 gets hit.

francoisvdv commented 2 years ago

I have contacted AWS support about this, but so far the issue has been "Work in progress" for 8 days. I hope they get back to me about this soon.

eeshan-dx commented 2 years ago

I have contacted AWS support about this, but so far the issue has been "Work in progress" for 8 days. I hope they get back to me about this soon.

@francoisvdv May I know if you've got any replies back from the AWS Support regarding this issue?

francoisvdv commented 2 years ago

Sadly only a 'we are working on it and we have escalated it' but no solution or anything..

fracampit commented 1 year ago

@francoisvdv still nothing?

francoisvdv commented 1 year ago

After various back-and-forths the conclusion of AWS support was that it was a problem in the application. We did not agree with that conclusion and instead migrated away from App Runner to ECS. So sadly no solution other than not using App Runner.

mikaelcabot commented 1 year ago

Had this same issue about 1 year ago when testing out AWS AppRunner, and now experiencing the same issue again but not quite as often as 1 year ago, but still 😞 ... Will have to move back to ECS Fargate again.

Below is the reply I got from AWS Support July 3. 2021.

Hello,

After further investigation from the service team,

App Runner uses Fargate tasks in the backend to spinup the application instances. When the the application is not receiving any requests, Fargate automatically reduces the CPU allocated to the task. (Idle State) Once there are new active requests, task CPU allocation increases to be able to respond to incoming requests (Active State).

The issue is related to Fargate task not getting allocated CPU even after receiving new requests.

Backend Fargate tasks are put to sleep since they are not receiving any active requests for extended period of time. New incoming requests might face network timeout issues leading to 503's. Since the Fargate task is not being able to re-allocate >CPU for serving new incoming requests.

Unfortunately, there is no way to mitigate this issue at that point.

The internal service team are working to find a solution. However, it may take some time.

You can track any new release information at either of the following locations[1][2].

References: [1] https://aws.amazon.com/new/ [2] https://github.com/aws/apprunner-roadmap/issues

The active auto scaling policy for AppRunner service (screenshot from AWS UI): Screenshot 2022-08-15 at 11 17 52

I think this scaling policy ☝️ with Minimum size configured and the AWS Support response

Backend Fargate tasks are put to sleep since they are not receiving any active requests

Is contradicting/misleading (if true) as I would expect to have 5 instances ready/active at all times (with CPU) to handle incoming requests.

Mugane commented 1 year ago

Still getting this. When is a solution expected?? This is rendering AppRunner completely unusable. The whole point is to abstract scaling, but then the thing is incapable of scaling altogether?! What is the point? Why did you even launch this service?

SJANAKIVENKATA commented 1 year ago

use nginx proxy it will solve the issue, for mine it solved by using nginx.

amitgupta85 commented 1 year ago

Sorry for no response here for a long time on this issue. I would like to help on this. @Mugane @mikaelcabot @francoisvdv Is it possible that you can share service ARN for an affected service?

francoisvdv commented 1 year ago

Sorry for no response here for a long time on this issue. I would like to help on this. @Mugane @mikaelcabot @francoisvdv Is it possible that you can share service ARN for an affected service?

We moved away from AppRunner because of this issue so we no longer have ARNs available.

Mugane commented 1 year ago

Yes, but I'm not at my workstation, I'll update tomorrow

On Fri, Nov 18, 2022, 5:02 AM Francois van der Ven @.***> wrote:

Sorry for no response here for a long time on this issue. I would like to help on this. @Mugane https://github.com/Mugane @mikaelcabot https://github.com/mikaelcabot @francoisvdv https://github.com/francoisvdv Is it possible that you can share service ARN for an affected service?

We moved away from AppRunner because of this issue so we no longer have ARNs available.

— Reply to this email directly, view it on GitHub https://github.com/aws/apprunner-roadmap/issues/92#issuecomment-1319785932, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADDHIZ34PRJFQOZLI3AADLTWI5H5DANCNFSM5I7PUIAQ . You are receiving this because you were mentioned.Message ID: @.***>

mstoyanovv commented 1 year ago

I am experiencing the same issue when running a NextJS app on AppRunner. The Issue is happening when I run Google PageSpeed insights against the app. It is a shame really because it turns out that AppRunner is actually not scalling well enough and there is nothing I can do as a user. No matter what 'scaling policy' I use or what vCpu/Ram configuration the app is crashing from a simple google page speed test....

SJANAKIVENKATA commented 1 year ago

hi @mstoyanovv try to use nginx as a proxy it will resolve the issue

mstoyanovv commented 1 year ago

hi @SJANAKIVENKATA, how did you use nginx with AppRunner?

SJANAKIVENKATA commented 1 year ago

hi @mstoyanovv just use it for proxy only and static files, no need to configure certificate ehy because apprunner will provide https.

Mugane commented 1 year ago

hi @mstoyanovv try to use nginx as a proxy it will resolve the issue

How would this possibly make any difference? Wouldn't it just offload the error from the initial request to the internal proxy request? That doesn't solve apprunner hanging.

mstoyanovv commented 1 year ago

@Mugane I created another AppRunner instance that hosts Nginx configured as proxy and cache of static files. It solved the issue that I had with google PageSpeed Insights. Also, when stress testing the app with Ddosify it does handle the traffic better.

smeera381 commented 1 year ago

Hello @mstoyanovv, could you provide the service arn so that we can take a look?

mikaelcabot commented 1 year ago

Sorry for no response here for a long time on this issue. I would like to help on this. @Mugane @mikaelcabot @francoisvdv Is it possible that you can share service ARN for an affected service?

We have also moved away from AppRunner because of this issue.

But going cack to the response I got from AWS Support

The issue is related to Fargate task not getting allocated CPU even after receiving new requests.

Unfortunately, there is no way to mitigate this issue at that point. The internal service team are working to find a solution. However, it may take some time.

... So has a fix been applied targeting this issue? (Asking to know if it's worth spending time on testing this again).

mstoyanovv commented 1 year ago

Hello @mstoyanovv, could you provide the service arn so that we can take a look?

where can I contact you @smeera381 ?

msumithr commented 1 year ago

Hello @mstoyanovv If you could share the service arn here, I can take a look.

atrope commented 1 year ago

we are having the same issue on a prod app with NextJS.

With APIGW works and with App Runner it does not.

Not testing with google but with our own nextjs website. Sometime it hangs and we cant do anything about it.

request latency went up at 18:30 UTC

and we start having same errors

Screen Shot 2023-05-30 at 23 00 01

msumithr commented 1 year ago

Thank you @mstoyanovv. Taking a look. Hello @atrope, please feel free to share your service arn details here and we will take a look.

atrope commented 1 year ago

arn:aws:apprunner:us-east-1:384537834093:service/genuine-project-ffub8-app/e0d044541e0c43b38894b06e88c3b36c

msumithr commented 1 year ago

Hello @mstoyanovv On checking the past history I see a noticeable spike in the 5xx count in the last few days, but not in the past 3 weeks or so (when you had informed us about the issue). Would you be able to provide a specific timeline that we can further dive deep into? Approximately 3 weeks ago (05-16-2023 08:45 PDT ), we do see a scaling activity initiated. Is this the timeline when you faced scaling issues?

xinrfeng commented 1 year ago

Hey @atrope Thank you for using AppRunner service. We analyzed your application and identified some performance issues related to high CPU/Memory utilization. Around the time on your screenshot, the memory utilization reached to the limit. This resulted in the increased latency and eventually caused the backend instance crash and therefore caused the connection failures. To further investigation and address these issues, we recommend checking the metrics of your App Runner service. This will provide the insights into the performance of your application. To optimize your service, we suggest adjusting either the maximum concurrency setting or the instance memory/CPU configuration based on application performance requirements. This will help ensure that the service scales appropriately to meet your needs.

mstoyanovv commented 1 year ago

Hello @mstoyanovv On checking the past history I see a noticeable spike in the 5xx count in the last few days, but not in the past 3 weeks or so (when you had informed us about the issue). Would you be able to provide a specific timeline that we can further dive deep into? Approximately 3 weeks ago (05-16-2023 08:45 PDT ), we do see a scaling activity initiated. Is this the timeline when you faced scaling issues?

Thanks for investigating! Yes, that is when I was experiencing the issue with the same message as @atrope. Adding nginx in front of the instance fixed it for now but it is additional cost for the product. I tried adjusting the concurrency setting but it seems like AppRunner is slow to react on high usage and it crashes till new instance is created/started.

aws / apprunner-roadmap

Apprunner hangs on long running requests with error message "upstream connect error or disconnect/reset before headers. reset reason: connection termination" #92