ApiStack deploy gets stuck on unhealthy ECS ApiService

gabrii commented 1 year ago

Describe the bug

The ApiStack deploy gets stuck on creating the service for ECS, as the service is not healthy so the cluster never finishes creation.

It's my 4th day trying to get the SaaS boilerplate to run on AWS, and managed to work around all previous problems and my mistakes and got stuck on this problem. Today I did a completely clean setup (except some minor fixes that I had to do on previous runs, like an -entrypoint typo that should have been --entrypoint, and some other typos that were causing errors on previous runs on Windows).

Steps to reproduce

Follow the steps on Getting Started > Run new project (locally everything works fine out of the box)
Follow steps on Deploy project to AWS, until pnpm saas deploy, where it gets stuck.

System Info

System:
    OS: Windows 11 10.0.22621
    CPU: (20) x64 12th Gen Intel(R) Core(TM) i7-12700H
    Memory: 34.46 GB / 63.69 GB
  Binaries:
    Node: 18.13.0 - C:\Program Files\nodejs\node.EXE
    Yarn: 1.22.19 - C:\Program Files\nodejs\yarn.CMD
    npm: 8.19.3 - C:\Program Files\nodejs\npm.CMD
    pnpm: 8.9.2 - ~\AppData\Local\pnpm\pnpm.EXE
  Browsers:
    Edge: Chromium (118.0.2088.61)
    Internet Explorer: 11.0.22621.1
  npmPackages:
    @apollo/client: ^3.8.4 => 3.8.4
    @apollo/rover: ^0.19.0 => 0.19.0
    @aws-sdk/client-cloudformation: ^3.414.0 => 3.414.0
    @aws-sdk/client-codebuild: ^3.414.0 => 3.414.0
    @aws-sdk/client-ecr: ^3.414.0 => 3.414.0
    @aws-sdk/client-ecs: ^3.414.0 => 3.414.0
    @aws-sdk/client-iam: ^3.414.0 => 3.414.0
    @aws-sdk/client-s3: ^3.417.0 => 3.417.0
    @aws-sdk/client-ses: ^3.414.0 => 3.414.0
    @aws-sdk/client-sfn: ^3.414.0 => 3.414.0
    @aws-sdk/client-sts: ^3.414.0 => 3.414.0
    @babel/preset-react: ^7.22.15 => 7.22.15
    @graphql-codegen/cli: ^5.0.0 => 5.0.0
    @graphql-typed-document-node/core: ^3.2.0 => 3.2.0
    @iconify-icons/ion: ^1.2.10 => 1.2.10
    @iconify/react: ^4.1.1 => 4.1.1
    @nx/devkit: 16.8.1 => 16.8.1
    @nx/eslint-plugin: 16.8.1 => 16.8.1
    @nx/jest: 16.8.1 => 16.8.1
    @nx/js: 16.8.1 => 16.8.1
    @nx/linter: 16.8.1 => 16.8.1
    @nx/node: 16.8.1 => 16.8.1
    @nx/plugin: 16.8.1 => 16.8.1
    @nx/react: 16.8.1 => 16.8.1
    @nx/web: 16.8.1 => 16.8.1
    @nx/webpack: 16.8.1 => 16.8.1
    @sb/cli: * => 2.3.0
    @sb/core: * => 2.3.0
    @sentry/react: ^7.70.0 => 7.70.0
    @storybook/addon-actions: ^7.4.3 => 7.4.3
    @storybook/react: ^7.4.3 => 7.4.3
    @supercharge/strings: ^2.0.0 => 2.0.0
    @svgr/webpack: ^8.1.0 => 8.1.0
    @tailwindcss/typography: ^0.5.10 => 0.5.10
    @testing-library/dom: ^9.3.3 => 9.3.3
    @testing-library/jest-dom: ^6.1.3 => 6.1.3
    @testing-library/react: 14.0.0 => 14.0.0
    @testing-library/react-hooks: ^8.0.1 => 8.0.1
    @testing-library/user-event: ^14.5.1 => 14.5.1
    @trivago/prettier-plugin-sort-imports: ^4.2.0 => 4.2.0
    @types/gtag.js: ^0.0.14 => 0.0.14
    @types/jest: ^29.5.5 => 29.5.5
    @types/node: ^18.15.3 => 18.17.17
    @types/ramda: ^0.28.25 => 0.28.25
    @types/react: 18.2.22 => 18.2.22
    @types/react-dom: 18.2.7 => 18.2.7
    @types/react-router: ^5.1.20 => 5.1.20
    @types/react-router-dom: 5.3.3 => 5.3.3
    @types/react-test-renderer: ^18.0.2 => 18.0.2
    @typescript-eslint/eslint-plugin: 5.62.0 => 5.62.0
    @typescript-eslint/parser: 5.62.0 => 5.62.0
    @typescript-eslint/scope-manager: 5.62.0 => 5.62.0
    @vitejs/plugin-react: ^4.0.4 => 4.0.4
    aws-cdk: ^2.96.2 => 2.96.2
    aws-cdk-lib: ^2.96.2 => 2.96.2
    babel-jest: 29.7.0 => 29.7.0
    constructs: ^10.2.70 => 10.2.70
    eslint: ^8.49.0 => 8.49.0
    eslint-config-prettier: ^9.0.0 => 9.0.0
    eslint-import-resolver-typescript: ^3.6.0 => 3.6.0
    eslint-plugin-formatjs: ^4.10.5 => 4.10.5
    eslint-plugin-import: 2.28.1 => 2.28.1
    eslint-plugin-jsx-a11y: ^6.7.1 => 6.7.1
    eslint-plugin-react: 7.33.2 => 7.33.2
    eslint-plugin-react-hooks: 4.6.0 => 4.6.0
    eslint-plugin-testing-library: ^6.0.1 => 6.0.1
    graphql: ^16.8.1 => 16.8.1
    husky: ^8.0.3 => 8.0.3
    jest: 29.7.0 => 29.7.0
    jest-environment-jsdom: 29.7.0 => 29.7.0
    jest-matcher-utils: ^29.7.0 => 29.7.0
    jest-watch-typeahead: ^2.2.2 => 2.2.2
    lint-staged: ^14.0.1 => 14.0.1
    nx: 16.8.1 => 16.8.1
    nx-cloud: 16.4.0 => 16.4.0
    plop: ^4.0.0 => 4.0.0
    prettier: ^3.0.3 => 3.0.3
    prettier-plugin-tailwindcss: ^0.5.4 => 0.5.4
    ramda: ^0.28.0 => 0.28.0
    react: 18.2.0 => 18.2.0
    react-dom: 18.2.0 => 18.2.0
    react-helmet-async: ^1.3.0 => 1.3.0
    react-hook-form: ^7.46.1 => 7.46.1
    react-intl: ^6.4.7 => 6.4.7
    react-loading-skeleton: ^3.3.1 => 3.3.1
    react-markdown: ^8.0.7 => 8.0.7
    react-router: ^6.16.0 => 6.16.0
    react-router-dom: 6.16.0 => 6.16.0
    regenerator-runtime: ^0.14.0 => 0.14.0
    styled-components: 6.0.8 => 6.0.8
    tailwindcss: ^3.3.3 => 3.3.3
    tailwindcss-animate: ^1.0.7 => 1.0.7
    ts-jest: 29.1.1 => 29.1.1
    ts-node: 10.9.1 => 10.9.1
    tsconfig-paths: ^4.2.0 => 4.2.0
    tslib: ^2.6.2 => 2.6.2
    typescript: 5.2.2 => 5.2.2
    vite: ^4.4.9 => 4.4.9
    vite-plugin-eslint: ^1.8.1 => 1.8.1
    vite-plugin-svgr: ^3.3.0 => 3.3.0
    vite-tsconfig-paths: ^4.2.1 => 4.2.1

Logs

pnmp saas deploy where it gets stuck (old to new)

Here are the logs from the ECS (from new to old)

```shell November 02, 2023 at 16:31 (UTC+1:00) Service Unavailable: /lbcheck a0bb0d81c5e64bc9bf291ef10fb9aaee backend November 02, 2023 at 16:31 (UTC+1:00) Service Unavailable: /lbcheck a0bb0d81c5e64bc9bf291ef10fb9aaee backend November 02, 2023 at 16:31 (UTC+1:00) Service Unavailable: /lbcheck a0bb0d81c5e64bc9bf291ef10fb9aaee backend November 02, 2023 at 16:31 (UTC+1:00) Service Unavailable: /lbcheck a0bb0d81c5e64bc9bf291ef10fb9aaee backend November 02, 2023 at 16:31 (UTC+1:00) Encountered an issue while polling targets. ab577365f6b24b6aa71d07d72d551ae0 backend November 02, 2023 at 16:31 (UTC+1:00) Traceback (most recent call last): ab577365f6b24b6aa71d07d72d551ae0 backend November 02, 2023 at 16:31 (UTC+1:00) File "/pkgs/__pypackages__/3.11/lib/urllib3/connection.py", line 174, in _new_conn ab577365f6b24b6aa71d07d72d551ae0 backend November 02, 2023 at 16:31 (UTC+1:00) conn = connection.create_connection( ab577365f6b24b6aa71d07d72d551ae0 backend November 02, 2023 at 16:31 (UTC+1:00) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .... full stack trace ... November 02, 2023 at 16:31 (UTC+1:00) raise EndpointConnectionError(endpoint_url=request.url, error=e) ab577365f6b24b6aa71d07d72d551ae0 backend November 02, 2023 at 16:31 (UTC+1:00) botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:2000/SamplingTargets" ab577365f6b24b6aa71d07d72d551ae0 backend November 02, 2023 at 16:31 (UTC+1:00) Service Unavailable: /lbcheck a0bb0d81c5e64bc9bf291ef10fb9aaee backend November 02, 2023 at 16:31 (UTC+1:00) Service Unavailable: /lbcheck ... many of these .... November 02, 2023 at 16:28 (UTC+1:00) Service Unavailable: /lbcheck ab577365f6b24b6aa71d07d72d551ae0 backend November 02, 2023 at 16:28 (UTC+1:00) Service Unavailable: /lbcheck ab577365f6b24b6aa71d07d72d551ae0 backend November 02, 2023 at 16:28 (UTC+1:00) No effective centralized sampling rule match. Fallback to local rules. ab577365f6b24b6aa71d07d72d551ae0 backend November 02, 2023 at 16:28 (UTC+1:00) No effective centralized sampling rule match. Fallback to local rules. ab577365f6b24b6aa71d07d72d551ae0 backend November 02, 2023 at 16:27 (UTC+1:00) [2023-11-02 15:27:44 +0000] [40] [INFO] Booting worker with pid: 40 ab577365f6b24b6aa71d07d72d551ae0 backend November 02, 2023 at 16:27 (UTC+1:00) [2023-11-02 15:27:44 +0000] [39] [INFO] Booting worker with pid: 39 ab577365f6b24b6aa71d07d72d551ae0 backend November 02, 2023 at 16:27 (UTC+1:00) [2023-11-02 15:27:44 +0000] [36] [INFO] Starting gunicorn 21.2.0 ab577365f6b24b6aa71d07d72d551ae0 backend November 02, 2023 at 16:27 (UTC+1:00) [2023-11-02 15:27:44 +0000] [36] [INFO] Listening at: http://0.0.0.0:80 (36) ab577365f6b24b6aa71d07d72d551ae0 backend November 02, 2023 at 16:27 (UTC+1:00) [2023-11-02 15:27:44 +0000] [36] [INFO] Using worker: gevent ab577365f6b24b6aa71d07d72d551ae0 backend November 02, 2023 at 16:27 (UTC+1:00) Starting app server... ```

What's interesting is that the worker is running and responding to request (from ECS servcie task logs)...

```shell November 02, 2023 at 16:39 (UTC+1:00) Service Unavailable: /lbcheck a9ccfe57d4f4416eb096ab0625a009da backend November 02, 2023 at 16:39 (UTC+1:00) 10.0.1.175 - - [02/Nov/2023:15:39:39 +0000] "POST /api/graphql/ HTTP/1.1" 200 298 "-" "Amazon CloudFront" a9ccfe57d4f4416eb096ab0625a009da backend November 02, 2023 at 16:39 (UTC+1:00) [2023-11-02 15:39:38 +0000] [40] [INFO] Booting worker with pid: 40 a3d652eae7464c9eb4954b7a35dba52f backend November 02, 2023 at 16:39 (UTC+1:00) [2023-11-02 15:39:38 +0000] [39] [INFO] Booting worker with pid: 39 a3d652eae7464c9eb4954b7a35dba52f backend November 02, 2023 at 16:39 (UTC+1:00) [2023-11-02 15:39:38 +0000] [38] [INFO] Starting gunicorn 21.2.0 a3d652eae7464c9eb4954b7a35dba52f backend November 02, 2023 at 16:39 (UTC+1:00) [2023-11-02 15:39:38 +0000] [38] [INFO] Listening at: http://0.0.0.0:80 (38) a3d652eae7464c9eb4954b7a35dba52f backend November 02, 2023 at 16:39 (UTC+1:00) [2023-11-02 15:39:38 +0000] [38] [INFO] Using worker: gevent ```

Even though it returns 200, there is an error on login as it seems the database migrations are not run (from the webapp login error message):

```shell relation "users_user" does not exist LINE 1: ...r"."otp_base32", "users_user"."otp_auth_url" FROM "users_use... ^ ``` (which I'm guessing will be done in further steps of `npm saas deploy`, but please let me know if that's not the case).

The only other issue I had in this clean run setting up a new SaaS, is this error from the workers which suspiciously are being deployed as local (???): (old to new)

```bash workers: > workers@2.3.0 sls /app/packages/workers workers: > sls "--version" workers: Framework Core: 3.35.2 (local) workers: Plugin: 7.0.3 workers: SDK: 4.4.0 workers: > workers@2.3.0 sls /app/packages/workers workers: > sls "deploy" "--stage" "local" workers: Warning: Invalid configuration encountered workers: at 'functions.ExportUsers.vpc': must have required property 'securityGroupIds' workers: at 'functions.ExportUsers.vpc': must have required property 'subnetIds' workers: at 'functions.SynchronizeContentfulContent.vpc': must have required property 'securityGroupIds' workers: at 'functions.SynchronizeContentfulContent.vpc': must have required property 'subnetIds' workers: at 'functions.WebSocketsConnectHandler.environment': must be object workers: at 'functions.WebSocketsConnectHandler.vpc': must have required property 'securityGroupIds' workers: at 'functions.WebSocketsConnectHandler.vpc': must have required property 'subnetIds' workers: at 'functions.WebSocketsMessageHandler.environment': must be object workers: at 'functions.WebSocketsMessageHandler.vpc': must have required property 'securityGroupIds' workers: at 'functions.WebSocketsMessageHandler.vpc': must have required property 'subnetIds' workers: at 'functions.WebSocketsDisconnectHandler.environment': must be object workers: at 'functions.WebSocketsDisconnectHandler.vpc': must have required property 'securityGroupIds' workers: at 'functions.WebSocketsDisconnectHandler.vpc': must have required property 'subnetIds' workers: Learn more about configuration validation here: http://slss.io/configuration-validation workers: Deploying coherent-workers to stage local (us-east-1) workers: Using serverless-localstack workers: serverless-localstack: Reconfigured endpoints workers: Error: workers: Inaccessible host: `localstack' at port `undefined'. This service may not be available in the `us-east-1' region. workers: × Stack coherent-workers failed to deploy (211s) workers: Environment: linux, node 18.18.2, framework 3.35.2 (local), plugin 7.0.3, SDK 4.4.0 workers: Credentials: Local, environment variables workers: Docs: docs.serverless.com workers: Support: forum.serverless.com workers: Bugs: github.com/serverless/serverless/issues workers: 3 deprecations found: run 'serverless doctor' for more details workers: ELIFECYCLE Command failed with exit code 1. workers: Warning: run-commands command "docker-compose run --rm --entrypoint /bin/bash workers /app/packages/workers/scripts/runtime/run_deploy.sh" exited with non-zero status code ```

Validations

[X] Follow our Code of Conduct.
[X] Read the Contributing Guidelines.
[X] Read the docs.
[X] Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
[X] Check that this is a concrete bug. For Q&A open a GitHub Discussion or join our Discord Chat Server.

gabrii commented 1 year ago

To add more details, the task keep on restarting with the closing error of "Task failed ELB health checks":

Eventually, the token for the CLI expires and it crashes. But the stack deployment continues for a long time, until it gets automatically rolled back:

mkleszcz commented 12 months ago

There are two issues:

The database is probably not initialized properly (missing migrations) - this is why there is an error that relation does not exist
The local sls environment is read by the docker-compose from the .env file in the root of the repository.

Thank you for finding this out! We need to fix both of them.

For now I would suggest to use CI to deploy the environment

apptension / saas-boilerplate