coherenceplatform / cnc

CNC is the first framework for precision platform engineering
https://cncframework.com
GNU General Public License v3.0
92 stars 6 forks source link

`cnc deploy perform` reports failure when deploy succeeds #160

Open aheitzmann opened 1 month ago

aheitzmann commented 1 month ago

Description

cnc deploy start reports "Deploy failed... check events and stopped tasks in the ea8a1c02af-backe-main-demo-app ecs service for more information." But in the AWS console the deploy shows as successful.

Repro Details

Machine: Apple M1 Pro OS: Sonoma 14.6.1 (23G93)

cnc.yml

services:
  app:
    ports:
      - "8080:8080"
    build:
      context: .
      dockerfile: Dockerfile
    deploy:
      resources:
        limits:
          cpus: 0.5
          memory: 2g
    x-cnc:
      type: backend
      system:
        health_check: /health
        platform_settings:
          min_scale: 1
          max_scale: 1
  db:
    x-cnc:
      type: database
      version: 16
    image: postgres

environments.yml

name: backend
provider: aws
flavor: ecs
version: 1

collections:
  - name: main
    region: us-east-2
    base_domain: app.groundedft.com
    account_id: "009160027816"
    environments:
      - name: demo
        environment_variables:
          - name: LOG_LEVEL
            value: INFO
          - name: FRAGMENT_API_URL
            value: https://api.us-east-1.fragment.dev/graphql
          - name: FRAGMENT_AUTH_URL
            value: https://auth.us-east-1.fragment.dev/oauth2/token
          - name: FRAGMENT_AUTH_SCOPE
            value: https://api.us-east-1.fragment.dev/*
          - name: FRAGMENT_LEDGER_ID
            value: 4cd9af29-23c2-4bff-94d7-db3ea0d63462
          - name: FRAGMENT_API_KEY
            value: 3cqj3rmep2btb6g51bett11t63
          - name: DB_MIGRATIONS_STARTUP_CHECK
             # TODO: change to "verify"
            value: skip
    # Aliases
    # Note: DATABASE_URL is Automatically provided by cnc but uses "postgres" 
    # instead of "postgresql" as the scheme. Rather than correct this in code for 
    # sqlalchemy compat, we'll alias the other DB-related env vars to match our expectations
          - name: DATABASE_HOST
            alias: DB_HOST
          - name: DATABASE_PORT
            alias: DB_PORT
          - name: DATABASE_NAME
            alias: DB_NAME
          - name: DATABASE_USERNAME
            alias: DB_USER
          - name: DATABASE_PASSWORD
            alias: DB_PASSWORD
    # Manually added secrets:
          - name: FRAGMENT_API_SECRET
            secret_id: "main/backend/demo/3p_access:main-backend-demo-fragment_api_secret::"
          - name: ADMIN_API_SECRET
            secret_id: "main/backend/demo/3p_access:main-backend-demo-admin_api_secret::"

Command cnc deploy perform demo --service-tag app=v1

Output:

Sending {'name': 'deploy.perform'} to RS for 79CC224E-1B37-536A-8659-88B23899344A
DEBUG (cnc.models.application:125) no default provided - returning first collection as default for <Application (name: backend | provider: aws (ecs/1))>
DEBUG (cnc.models.deployer:69) Performing deploy for <DeployStageManager:<Environment (name: demo | collection: main | ['app', 'db'])> @ {'app': 'v1'}> @
/tmp/.cnc_tmp_backend/app_backend/aws_ecs/1/d572e3f7765cc827262d96eafd8e756fff32090a02573d02df757b977be73e55/deploy
DEBUG (cnc.models.environment_collection:332) Going to get outputs for <EnvironmentCollection (main | 009160027816) [1 envs]>: {}
DEBUG (cnc.models.provisioner:58) Cleaning up & setting up at start for <ProvisionStageManager: <EnvironmentCollection (main | 009160027816) [1 envs]> | output_only: True>
DEBUG (cnc.models.provisioner:67) Writing main.tf.j2 for <ProvisionStageManager: <EnvironmentCollection (main | 009160027816) [1 envs]> | output_only: True>
DEBUG (cnc.models.provisioner:147) Installing TF modules/providers for <ProvisionStageManager: <EnvironmentCollection (main | 009160027816) [1 envs]> | output_only: True>
INFO (cnc.models.provisioner:398) TF RUN (<ProvisionStageManager: <EnvironmentCollection (main | 009160027816) [1 envs]> | output_only: True>): [['terraform', 'init']] 0 in 6 seconds
INFO (cnc.models.provisioner:398) TF RUN (<ProvisionStageManager: <EnvironmentCollection (main | 009160027816) [1 envs]> | output_only: True>): [['terraform', 'output', '-json']] 0 in 0 seconds
DEBUG (cnc.models.deployer:75) Done rendering deploy for <DeployStageManager:<Environment (name: demo | collection: main | ['app', 'db'])> @ {'app': 'v1'}> (svcs: dict_keys(['app'])), going to execute...
Updating task definition for ea8a1c02af-backe-main-demo-app...
Task definition for ea8a1c02af-backe-main-demo-app updated successfully.
INFO (cnc.models.deployer:99) All done with perform for <DeployStageManager:<Environment (name: demo | collection: main | ['app', 'db'])> @ {'app': 'v1'}>

Sending deploy status webhook...

===== app deploy status =====
{
  "token": "None",
  "status": "working",
  "stage": "deploy",
  "service": "app",
  "revision_id": "4"
}
===== app deploy status =====
/usr/bin/curl

No webhook URL provided. Skipping deploy status webhook...

Deploying ea8a1c02af-backe-main-demo-app to Amazon ECS...

Deploy failed... check events and stopped tasks in the ea8a1c02af-backe-main-demo-app ecs service for more information.

Sending deploy status webhook...

===== app deploy status =====
{
  "token": "None",
  "status": "failed",
  "stage": "deploy",
  "service": "app",
  "revision_id": "4"
}
===== app deploy status =====
/usr/bin/curl

No webhook URL provided. Skipping deploy status webhook...
DEBUG (cnc.commands.deploy:72) All set deploying for /tmp/.cnc_tmp_backend/app_backend/aws_ecs/1/d572e3f7765cc827262d96eafd8e756fff32090a02573d02df757b977be73e55/deploy/_cnc_output in 10 seconds

Service events from the AWS Console: CleanShot 2024-09-03 at 16 43 24@2x

zach-withcoherence commented 1 month ago

@aheitzmann thanks so much for the details here

is it possible that the deploy did not actually work? like did you verify that the service is actually updated? when the deploy fails (usually due to container crashing or health checks failing), ECS rolls back the service to the last pervious version, which still results in the steady state message in the events. my guess is that the deploy was unsuccessful, and the cnc CLI is reporting that accurately.

aheitzmann commented 1 month ago

@zach-withcoherence Yes, I verified that the deployment was successful. The event logs don't show any rollback, and the task details of the currently running healthy task show that it was launched at the expected time, and has an updated revision number.

zach-withcoherence commented 1 month ago

interesting, thanks so much for confirming that.

will investigate, one thing it could be is that the build succeeded after the allowed timeout. that can be configured as per the docs https://docs.withcoherence.com/configuration/cnc-yml/:

x-cnc:
      type: backend
      # Need to define the URL path to route to the service
      # Unless you customize IaC to subdomain routing
      url_path: /api
      # For DB migrations, what command to run?
      # Runs in the container
      migrate: ["prisma", "migrate"]
      # For DB seeding, what command to run?
      # MUST be idempotent!!
      # Runs in the container
      seed: ["prisma", "seed"]

      # CI pipeline timeout limits before failing
      timeouts:
        # default is 20 for both
        deploy: 10
        build: 10

e.g. setting to 20 would allow 20 mins to pass before failing

aheitzmann commented 1 month ago

To be clear, I'm not using a CI pipeline. I run the build and deploy commands with CNC from my local machine. The build and publish to ECR succeeds. The deploy command outputs the failure message mentioned within about 15 seconds. It does take several minutes for the new deployment to reach a stable state in ECS.

interesting, thanks so much for confirming that.

will investigate, one thing it could be is that the build succeeded after the allowed timeout. that can be configured as per the docs https://docs.withcoherence.com/configuration/cnc-yml/:

x-cnc:
      type: backend
      # Need to define the URL path to route to the service
      # Unless you customize IaC to subdomain routing
      url_path: /api
      # For DB migrations, what command to run?
      # Runs in the container
      migrate: ["prisma", "migrate"]
      # For DB seeding, what command to run?
      # MUST be idempotent!!
      # Runs in the container
      seed: ["prisma", "seed"]

      # CI pipeline timeout limits before failing
      timeouts:
        # default is 20 for both
        deploy: 10
        build: 10

e.g. setting to 20 would allow 20 mins to pass before failing