aws-amplify / amplify-hosting

AWS Amplify Hosting provides a Git-based workflow for deploying and hosting fullstack serverless web applications.
https://aws.amazon.com/amplify/hosting/
Apache License 2.0
452 stars 113 forks source link

Cold start taking 10-12 seconds to render static page #3211

Closed mr-rpl closed 1 year ago

mr-rpl commented 1 year ago

Before opening, please confirm:

App Id

d15y9mlar87m44

AWS Region

us-east-1

Amplify Hosting feature

Not Applicable

Describe the bug

Upon new deployment and/or after the app goes idle, we are experiencing a 10-12 second delay on any page of our Next13 application. We are on the new Web Compute platform.

Mere speculation wants to blame Lambda Cold Start -- but I am not 100% sure if that is the case -- either way, a public facing UI should never be faced with a 10-12 second cold start.

Expected behavior

Loads in reasonable time

Reproduction steps

  1. Fresh deploy and visit the site
  2. Let sit idle for a period of time and visit the site

Build Settings

No response

Log output

No response

Additional information

We have a sample build up here

Screenshot shows a 10 second request Screenshot 2022-12-21 at 11 17 15 AM

mr-rpl commented 1 year ago

this looks to be identical to: https://github.com/aws-amplify/amplify-hosting/issues/2647

it was closed as part of launching support for Next 12/13 -- but the behavior still exists

mr-rpl commented 1 year ago

bump

LucasLemanowicz commented 1 year ago

Related to high TTFB issue here: https://github.com/aws-amplify/amplify-hosting/issues/3122 which is tagged as "investigating" (but no new update in nearly a month)

yuyokk commented 1 year ago

I was about to submit the bug as well but came across this one.

We are getting time_starttransfer ~14 / 13 seconds on latest Nextjs project (measured like this https://stackoverflow.com/questions/18215389/how-do-i-measure-request-and-response-times-at-once-using-curl) e.g.

time_namelookup:     0.152133
time_connect:        0.168539
time_appconnect:     0.210377
time_pretransfer:    0.210509
time_redirect:       0.000000
time_starttransfer:  14.235495
----------
time_total:          14.236349

14 seconds is way too much :(

Amplify app id = d3jjevefis395x Region = us-east-1

Screen Shot 2023-01-13 at 12 59 23 PM
stefanzier commented 1 year ago

I also have a Next13 app and enabling "performance mode" seemed to do nothing for me as TTFB was still quite bad (~5-10s). Upon inspection of the headers, I noticed the cache would always report Miss from CloudFront and I could not see s-maxage set.

Not sure if this will work for you but setting the header explicitly in next.config.js seemed to have worked for me and now my site loads much much faster. Would be great if someone from Amplify team can confirm this is a valid workaround.

const nextConfig = {
    headers: async () => {
        return [
            {
                source: '/(.*)',
                headers: [
                    {
                        key: 'Cache-Control',
                        value: 'public, s-maxage=86400'
                    }
                ]
            }
        ];
    }
};
ghost commented 1 year ago

Hi @mr-rpl @stefanzier @yuyokk 👋🏽 apologies for the delay here. Are you leveraging BugSnag in your package.json? We have seen instances where packages like BugSnag have contributed ~12 seconds to the app initialization time which accounts for the high TTFB.

We are continuing to investigate which other dependencies can also have this affect on TTFB and will update this issue accordingly.

yuyokk commented 1 year ago

@hloriana we dont use BugSnag for our project.

jdpst commented 1 year ago

@hloriana Same issue on our project. We do not use BugSnag. Here are the dependencies we use (I've omitted a few internal config packages):

Dependencies ``` "dependencies": { "@apollo/client": "^3.7.5", "@emotion/react": "^11.10.5", "@emotion/server": "^11.10.0", "@emotion/styled": "^11.10.5", "@mui/icons-material": "^5.11.0", "@mui/lab": "^5.0.0-alpha.117", "@mui/material": "^5.11.6", "@mui/x-data-grid": "^5.17.20", "@mui/x-date-pickers": "^5.0.15", "@react-pdf/renderer": "^3.0.2", "@sentry/integrations": "^7.33.0", "@sentry/nextjs": "^7.33.0", "@sentry/replay": "^7.33.0", "@sentry/tracing": "^7.33.0", "@turf/helpers": "^6.5.0", "async-retry": "^1.3.3", "aws-amplify": "^5.0.11", "cheerio": "^1.0.0-rc.5", "country-code-lookup": "^0.0.22", "docx": "^7.8.2", "file-saver": "^2.0.5", "flat": "^5.0.2", "graphql": "^16.6.0", "heic2any": "^0.0.3", "jszip": "^3.10.1", "lodash": "^4.17.21", "luxon": "^3.2.1", "mapbox-gl": "^2.12.0", "next": "^13.1.5", "pdfjs-dist": "^2.16.105", "prop-types": "^15.8.1", "react": "^18.2.0", "react-csv": "^2.2.2", "react-dom": "^18.2.0", "react-google-button": "^0.7.2", "use-debounce": "^8.0.4", "yup": "^0.32.11" }, "devDependencies": { "@graphql-codegen/cli": "^2.16.4", "@graphql-codegen/introspection": "^2.2.3", "@graphql-codegen/near-operation-file-preset": "^2.5.0", "@graphql-codegen/typescript": "^2.8.7", "@graphql-codegen/typescript-apollo-client-helpers": "^2.2.6", "@graphql-codegen/typescript-operations": "^2.5.12", "@types/async-retry": "^1.4.5", "@types/cheerio": "^0.22.31", "@types/file-saver": "^2.0.5", "@types/flat": "^5.0.2", "@types/google-map-react": "^2.1.7", "@types/luxon": "^3.2.0", "@types/mapbox-gl": "^2.7.10", "@types/react": "^18.0.27", "@types/react-csv": "^1.1.3", "@types/react-dom": "^18.0.10", "eslint": "^8.32.0", "get-graphql-schema": "^2.1.2", "standard-version": "^9.5.0", "typescript": "^4.9.4", "unimported": "^1.24.0" ```
stefanzier commented 1 year ago

@hloriana same, no bugsnag. I have a very simple nextjs application. Unfortunately, I was tired of +10min build times and these cache issues, so I just moved to Vercel but kept Amplify as my backend. Now things are great but I hope to use Amplify hosting in the future if these issues are addressed 🙏

mr-rpl commented 1 year ago

Hi @mr-rpl @stefanzier @yuyokk 👋🏽 apologies for the delay here. Are you leveraging BugSnag in your package.json? We have seen instances where packages like BugSnag have contributed ~12 seconds to the app initialization time which accounts for the high TTFB.

We are continuing to investigate which other dependencies can also have this affect on TTFB and will update this issue accordingly.

@hloriana we are not, our dep tree is:

  "dependencies": {
    "@apollo/client": "^3.7.0",
    "@datadog/browser-rum": "^4.21.2",
    "@emotion/react": "^11.10.0",
    "@emotion/styled": "^11.10.0",
    "@mui/icons-material": "^5.10.9",
    "@mui/lab": "^5.0.0-alpha.95",
    "@mui/material": "^5.10.1",
    "@optimizely/react-sdk": "^2.9.1",
    "formik": "^2.2.9",
    "graphql": "^16.6.0",
    "next": "^13.0.7",
    "react": "^18.2.0",
    "react-dom": "^18.2.0",
    "use-is-in-viewport": "^1.0.9",
    "uuid": "^9.0.0"
  },

fwiw, we are now running on Vercel and have no issues

yuyokk commented 1 year ago

@hloriana not sure if it helps but we use mui in our deps as well (common package with @mr-rpl and @jdpst)

jdpst commented 1 year ago

@yuyokk @mr-rpl @hloriana I have set up a CloudWatch synthetic canary to ping the relevant URL every few mins, which seems to be good enough since most requests take <10ms so it's rare that we need >1 hot lambda. Far from perfect, but sufficient as a workaround until this issue can be resolved.

mr-rpl commented 1 year ago

I wanted to pull the MUI thread -- I deployed a fresh shell next13 app to amplify and still experienced the 10-12second load. Zero additional dependencies.

@jdpst - i seem to get the lag even without waiting from time to time -- and especially after a deploy (first load) -- so for that, i personally wouldn't put anything into production

mr-rpl commented 1 year ago

another finding whilst triaging: I found that this only happens when using the pages directory. if using the new nextjs13 app dir, the 10-12second hang time goes away.

of course, it brought up a new bug: revalidate seems to happen at exactly 3 minutes no matter what I set the param to 😂

yuyokk commented 1 year ago

We setup Cloudwatch Heartbeat Monitor to ping page every minute. We dont see 14s for time to first byte anymore, but still see ~3/4s occasionally.

Screen Shot 2023-02-01 at 10 15 54 AM
mstoyanovv commented 1 year ago

another finding whilst triaging: I found that this only happens when using the pages directory. if using the new nextjs13 app dir, the 10-12second hang time goes away.

of course, it brought up a new bug: revalidate seems to happen at exactly 3 minutes no matter what I set the param to 😂

I switched from 'pages' to 'app' directory and there is still 3 seconds TTFB which is unacceptable for a production application. The same app deployed to vercel is getting under 100 ms TTFB ...

rapgodnpm commented 1 year ago

I get between 6 - 14 seconds TTFB. I use MUI. Is there any update on this? I've switched to netlify but would love to use amplify instead.

talaikis commented 1 year ago

People, if you're paying for Amplify hosting, why not enable Route 53 health check? That $1 solves the problem.

rapgodnpm commented 1 year ago

People, if you're paying for Amplify hosting, why not enable Route 53 health check? That $1 solves the problem.

Did you actually try? I’ve made a lambda that calls my website every 9 minutes and it didn’t seem to solve anything

talaikis commented 1 year ago

Did you actually try? I’ve made a lambda that calls my website every 9 minutes and it didn’t seem to solve anything

Sure, working since I've subscribed here, month+ on one site and another does not even require it. Both have service workers though, checking on clean uncached browser.

rapgodnpm commented 1 year ago

Did you actually try? I’ve made a lambda that calls my website every 9 minutes and it didn’t seem to solve anything

Sure, working since I've subscribed here, month+ on one site and another does not even require it. Both have service workers though, checking on clean uncached browser.

Hmm, thank you for your suggestion, I will try

mstoyanovv commented 1 year ago

People, if you're paying for Amplify hosting, why not enable Route 53 health check? That $1 solves the problem.

I am not exactly sure what do you mean but enabling health checks with frequency of 30 seconds does not fix the issue for me. There is still 4-5 seconds of TTFB..

talaikis commented 1 year ago

I am not exactly sure what do you mean but enabling health checks with frequency of 30 seconds does not fix the issue for me. There is still 4-5 seconds of TTFB..

How the same site works outside of the Amplify?

mstoyanovv commented 1 year ago

I am not exactly sure what do you mean but enabling health checks with frequency of 30 seconds does not fix the issue for me. There is still 4-5 seconds of TTFB..

How the same site works outside of the Amplify?

Works perfectly fine in vercel without any changes to the site.

sig003 commented 1 year ago

I found a case that I can solve. I used a next.js and amplify hosting. Redirect to another page when accessing the index of my service.

https://foo.com => https://foo.com/abcd

At this time, TTFB takes more than 3 seconds. But it is fast when approaching directly.

https://foo.com/abcd

Before that, do the following settings.

1) Enable amplify's performance mode. 2) Setting amplify's custom header.

` customHeaders:

Maybe there's a problem between Amplify and the redirect of next.js?

talaikis commented 1 year ago

I've experienced problems with some of the redirects when using Next 13. Downgrading to Next 12 solved the problem for me. And that was not just 3-13 seconds, but full browser freeze. Wasn't able to reproduce on a smaller app.

IvanCaceres commented 1 year ago

Getting extremely high TTFB ~15 second response time for SSR routes in our Next.js 13 amplify hosting compute app

abenzick commented 1 year ago

This thread details my exact problems. Trying to launch a site with nextJS + Amplify but this cold start issue >10 seconds is a show stopper.

sami-bt commented 1 year ago

But this doesn't seem to solve this issue. Yes After it is loaded for the first time, the next time I instantly get it.

15 second cold start for a small web app is unacceptable.

my dependencies:

{
  "dependencies": {
    "@emotion/react": "^11.4.1",
    "@emotion/styled": "^11.3.0",
    "@mui/icons-material": "^5.0.3",
    "@mui/material": "^5.0.3",
    "@mui/system": "^5.11.1",
    "@mui/x-date-pickers": "^5.0.0-alpha.7",
    "@sentry/nextjs": "^7.25.0",
    "@stripe/react-stripe-js": "^1.4.1",
    "@stripe/stripe-js": "^1.15.1",
    "@twilio/conversations": "^2.0.0",
    "@twilio/video-processors": "^1.0.2",
    "@typeform/embed": "^1.6.1",
    "axios": "^0.21.1",
    "chart.js": "^3.7.0",
    "core-js": "^3.9.1",
    "date-fns": "^2.29.3",
    "emoji-mart-next": "^2.11.2",
    "form-data": "^4.0.0",
    "jest-canvas-mock": "^2.4.0",
    "jsonwebtoken": "^9.0.0",
    "lottie-web": "^5.9.2",
    "material-ui-phone-number": "^3.0.0",
    "md5": "^2.3.0",
    "moment": "^2.29.1",
    "moment-timezone": "^0.5.34",
    "next": "12.3.4",
    "next-redux-wrapper": "^6.0.2",
    "nylas": "^6.4.2",
    "query-string": "^7.0.0",
    "rc-time-picker": "^3.7.3",
    "react": "^18.2.0",
    "react-calendar": "^3.3.1",
    "react-chartjs-2": "^4.0.1",
    "react-countup": "^6.0.0",
    "react-csv-reader": "^3.5.0",
    "react-data-table-component": "^7.0.0-alpha-5",
    "react-dom": "^18.2.0",
    "react-draggable": "^4.4.4",
    "react-google-login": "^5.2.2",
    "react-image-crop": "^10.0.4",
    "react-infinite-scroll-component": "^6.0.0",
    "react-linkedin-login-oauth2": "^1.0.9",
    "react-material-ui-form-validator": "^3.0.1",
    "react-modal": "^3.13.1",
    "react-phone-number-input": "^3.1.47",
    "react-quill": "^1.3.5",
    "react-redux": "^7.2.3",
    "react-responsive-carousel": "^3.2.16",
    "react-slick": "^0.28.1",
    "react-thunk": "^1.0.0",
    "react-time-picker": "^4.2.1",
    "react-visibility-sensor": "^5.1.1",
    "redux": "^4.0.5",
    "redux-thunk": "^2.3.0",
    "request": "^2.88.2",
    "slick-carousel": "^1.8.1",
    "stripe": "^8.157.0",
    "styled-components": "^5.2.1",
    "tslib": "2.5.0",
    "twilio": "^3.71.1",
    "twilio-video": "^2.18.1",
    "universal-emoji-parser": "^0.5.28",
    "winston-papertrail-transport": "^1.0.9",
    "wow.js": "^1.2.2"
  },
  "devDependencies": {
    "@testing-library/jest-dom": "^5.14.1",
    "@testing-library/react": "11.2.5",
    "@types/cron": "^2.0.0",
    "@types/express": "4.17.11",
    "@types/jest": "^27.0.1",
    "@types/node": "16.11.7",
    "@types/react": "17.0.3",
    "@types/react-dom": "17.0.3",
    "@types/react-material-ui-form-validator": "^2.1.1",
    "@types/react-modal": "^3.12.0",
    "@types/react-slick": "^0.23.4",
    "@types/supertest": "^2.0.12",
    "@typescript-eslint/eslint-plugin": "^5.46.1",
    "@typescript-eslint/parser": "^5.46.1",
    "babel-jest": "26.6.3",
    "cypress": "^6.8.0",
    "eslint": "^8.2.0",
    "eslint-config-airbnb": "19.0.4",
    "eslint-config-airbnb-typescript": "^17.0.0",
    "eslint-config-prettier": "^8.5.0",
    "eslint-import-resolver-typescript": "^3.5.2",
    "eslint-plugin-import": "^2.26.0",
    "eslint-plugin-jest": "^27.1.7",
    "eslint-plugin-jsx-a11y": "^6.5.1",
    "eslint-plugin-prettier": "^4.2.1",
    "eslint-plugin-react": "^7.28.0",
    "eslint-plugin-react-hooks": "^4.3.0",
    "husky": "^8.0.2",
    "jest": "26.6.3",
    "jest-canvas-mock": "^2.4.0",
    "lint-staged": "^13.0.2",
    "node-sass": "^8.0.0",
    "prettier": "^2.8.1",
    "supertest": "^6.2.3",
    "ts-jest": "26.5.4",
    "ts-node": "~9.1.1",
    "tslint": "~6.1.3",
    "typescript": "^4.3.2"
  },
  "husky": {
    "hooks": {
      "pre-commit": "lint-staged && echo '!! Husky is DONE Reviewing !!'"
    }
  },
  "lint-staged": {
    "*.{scss,css,md}": "prettier --write",
    "*.{ts,tsx}": [
      "yarn format",
      "yarn lint"
    ]
  }
}
rapgodnpm commented 1 year ago

The problem is that the resources allocated for SSR are insufficient. Checking the compute logs shows a 1024 MB lambda function. We should be free to increase it or have a bigger one by default. The only time the page is fast is after being cached in Cloudfront at the edge. But we cannot rely on caching to have good speed, there may be pages where you always want them to be fresh (account page, checkout page, payment, etc.).

sami-bt commented 1 year ago
Screenshot 2023-02-16 at 2 10 33 PM

as mentioned by @yuyokk my time_starttransfer is under 1 sec, which is fine. but cold start sucks big!

jwang-lilly commented 1 year ago

@mstoyanovv, when you use next 13 app folder, where do you specify Amplify.configure({ ...config, ssr: true })? I put it in layout.jsx but my app seems to have a hard time to find it. It from time to time complains about no credentials. I also tried Amplify.configure({ ...config}) without much success. No documentation on about this. Any insight would be greatly appreciated.

sami-bt commented 1 year ago

@rapgodnpm I observed in the build settings that you can enable performance mode for specific branch.. It gives a Docker image on a host with 4 vCPU, 7GB memory.

rapgodnpm commented 1 year ago

@rapgodnpm I observed in the build settings that you can enable performance mode for specific branch.. It gives a Docker image on a host with 4 vCPU, 7GB memory.

Already tried that, had the same issue. From what I’ve read in the docs the performance mode just increases the max cache time from a few minutes to a day. Still a 1024mb lambda function was used. I think that docker image that you mention is for the build env. I do remember when the build was in progress that it mentions that kind of configurarion

IvanCaceres commented 1 year ago

Is there any progress on this issue? Cold starts are a serious performance issue for Next.js SSR users on Amplify Hosting. Is there any confirmation that using the Nextjs 13 app folder improves performance versus the older pages directory?

mr-rpl commented 1 year ago

@IvanCaceres - app dir definitely helps with cold start -- but introduces a whole new set of issues :)

jdpst commented 1 year ago

@rapgodnpm what memory would you suggest?

@mr-rpl We ended up using open-next to package the NextJS build output and deploying it on Lambda ourselves. Cold starts remain–so it's probably not an Amplify issue specifically–but with more control over the infra we can do things like enable provisioned concurrency. It still works out substantially less expensive than Fargate, which is where we're coming from.

rapgodnpm commented 1 year ago

@rapgodnpm what memory would you suggest?

@mr-rpl We ended up using open-next to package the NextJS build output and deploying it on Lambda ourselves. Cold starts remain–so it's probably not an Amplify issue specifically–but with more control over the infra we can do things like enable provisioned concurrency. It still works out substantially less expensive than Fargate, which is where we're coming from.

I don’t know. I would need to test it and see which produces a smaller cold start but I can’t really do that since amplify has no setting for this

jbreemhaar commented 1 year ago

Experiencing this for three amplify hosting apps I've migrated from WEB_DYNAMIC to WEB_COMPUTE two days ago. TTFB is spiking all over the place since then. The apps still on WEB_DYNAMIC are having no problems with server response.

image

soplan commented 1 year ago

Same issue for us. Will switch to Vercel because this is unacceptable.

jbreemhaar commented 1 year ago

@hloriii Any update on this? This has now been an issue for a couple of months.

AdminHipoo commented 1 year ago

same thing here, on dev environment we dont have problem we can wait, but our users on prod can't.... waiting for this issue to be resolve pls update us @hloriii

danshev commented 1 year ago

Can we get confirmation that this is being worked / a status update? Will need to investigate other hosting providers if there is no solve around the corner.

choskas commented 1 year ago

A little update about this issue, i test enabling Route 53 health check (as posted by @talaikis ) and works!. I will wait on an update on this issue to remove it but in the mean time i recommend it

Nguyen-Huu-Huan commented 1 year ago

I enabled the health check but still got the cold start

IvanCaceres commented 1 year ago

I can confirm cold starts and very slow loading TTFB still exists even with a health check running every minute. It takes an enormously long time to load an SSR route, I would advise against expecting a high quality production experience for your Next.js app on Amplify Hosting until Amplify / AWS can give us a solution to SSR cold starts. This issue has persisted for months since the release of Amplify Hosting Compute. I will also add that the Amplify Hosting compute logs are lacking and opaque, I have experienced scenarios where they don't surface errors that occur during the server side rendering phase in a Next.js /pages route. These logs don't surface much of anything at all.

danshev commented 1 year ago

My experience has been:

Note: My Amplify's "Domain management" config is set to direct from https://myapp.co ==> https://www.myapp.co ... I suspect the health check request non-www URL wasn't actually spinning up the Lambdas that serve the app (or something).

Jay2113 commented 1 year ago

Hey everyone 👋 , thank you for your continued patience.

We are actively investigating and are working on narrowing down the root cause of the elevated latencies (high TTFB) with Compute apps. Please rest assured that this is our highest priority and we will keep you posted with any updates.

Apologies for the inconvenience caused due to this behavior with Compute apps.

afern247 commented 1 year ago

Waiting for this as well... my web feels like it's loading the whole internet at start up.

colin-chadwick commented 1 year ago

Need an update on this as well. Otherwise, we’ll have to switch to Vercel again. Such a high TTFB is unacceptable.