aws / copilot-cli

The AWS Copilot CLI is a tool for developers to build, release and operate production ready containerized applications on AWS App Runner or Amazon ECS on AWS Fargate.
https://aws.github.io/copilot-cli/
Apache License 2.0
3.53k stars 417 forks source link

Pipeline deployment of service with EFS gets stuck #5083

Closed vicpara closed 1 year ago

vicpara commented 1 year ago

In an copilot app I have two services: 1 stateless backend in nodejs and 1 database service that uses efs for persistent storage. I successfully setup two envs: rc and prod using copilot . I also used to be able to deploy to both envs the services using copilot svc deploy... command. Everything worked fine.

I run a monorepo where fe+db+backend+copilot is all in one git repo. So any merges to 'main' trigger a redeployment. That's fine.

Enter copilot pipeline. Vanilla pipeline auto created by the CLI v1.28.0 on MacOS. Runs successfully first time. Second time gets stuck at deploying the db service even though nothing changed at DB service definition or code.

How do I get this 'DB' service to deploy successfully ? (Most of the times there are no changes to the DB and yet it fails)

Between deployment, when the only change is related to code deployed in backend, the DB TaskDefinition gets recreated and stuck in this 'Update in progress' stage for a long time then it fails:

Screenshot 2023-07-14 at 13 03 49

Interestingly some tasks from the new revision (that failed to update) are being recreated and keep failing. The prev good running task doesn't shut down. I don't even know where the problem is or how it should work given AWS primitives:

The db service manifest:

# The manifest for the "db" service.
name: db
type: Backend Service

# Configuration for your containers and service.
image:
  port: 8529
  build:
    dockerfile: ./images/arango.db.Dockerfile
    context: .

  healthcheck:
    command: ["CMD-SHELL", "curl -f http://localhost:8529/_db/_system/_admin/aardvark/favicon.ico || exit 1"]
    interval: 30s
    retries: 3
    timeout: 5s
    start_period: 30s

entrypoint: ["/bin/bash", "-c", "/etc/arangodb3/start.fargate.sh"]

network:
  connect: true

cpu: 1024 # Number of CPU units for the task.
memory: 2048 # Amount of memory in MiB used by the task.
platform: linux/x86_64 # See https://aws.github.io/copilot-cli/docs/manifest/backend-service/#platform
count: 1 # Number of tasks that should be running in your service.
exec: true # Enable running commands in your container.

variables: # Pass environment variables as key value pairs.
  LOG_LEVEL: debug

environments:
  rc:
    env_file: ./conf/arangodb.dev.env
    storage:
      volumes:
        dbData:
          path: /db-data
          read_only: false
          efs: true
          # efs:
          #   id: fs-0264ce94945a8695f
          #   auth:
          #     access_point_id: fsap-0cac715eaf397b0ed
dannyrandall commented 1 year ago

Hey @vicpara!

Is there a way to only update the docker images and restart the services? `copilot svc package --diff --name be --env rc' never returns empty diff

Could you share an example of the diff that's getting created? And if it's the image tag that's changing, could you share your Dockerfile? My guess is that your Dockerfile is COPYing more data than it actually needs, and is thus causing a new image tag to be created every time you run package/deploy.

Interestingly some tasks from the new revision (that failed to update) are being recreated and keep failing

Interesting! Is there any indication in the new task's logs or the current task's logs as to why it's failing to start up? You should be able to find those logs in the ECS console during the deployment or the CloudWatch console after a deployment has failed.

vicpara commented 1 year ago

Thanks for your quick reply. Yes, each build has a new image tag. It looks like regardless of what happens in the code, the docker images are build again independently of the previous build. On the local machine most of the docker builds are skipped. I included last in this message the buildspec.yaml from the pipeline as it was generated.

I triggered a new build by changing a github action. In principle both services are identical.

The Task logs during deployment are not showing any error. The task looks like healthy and then gets shut down with failure. The previous revision's Task stays online regardless of how many attempts are. For some reason it doesn't shut down like when using copilot svc deploy --name db --env rc

The ECR

Screenshot 2023-07-15 at 00 10 13

Cloud Formation

Screenshot 2023-07-15 at 00 06 23

The DIFF below produced by updating the repo with a nop in github action: copilot svc package --name db --env rc --dif

No changes.

# Copyright Amazon.com Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
AWSTemplateFormatVersion: 2010-09-09
Description: CloudFormation template that represents a backend service on Amazon ECS.
Metadata:
  Manifest: |
    # The manifest for the "db" service.
    # Read the full specification for the "Backend Service" type at:
    #  https://aws.github.io/copilot-cli/docs/manifest/backend-service/

    # Your service name will be used in naming your resources like log groups, ECS services, etc.
    name: db
    type: Backend Service
    # EnvFileARN: arn:aws:s3:::broadn-dev-works/config/atfirst.arangodb.dev.env
    # Your service does not allow any traffic.

    # Configuration for your containers and service.
    image:
      # Docker build arguments. For additional overrides: https://aws.github.io/copilot-cli/docs/manifest/backend-service/#image-build
      # build: 655551053286.dkr.ecr.eu-west-1.amazonaws.com/atfirst.arangodb
      # build: images/arango.db.Dockerfile
      # platform: linux/x86_64
      port: 8529
      build:
        dockerfile: ./images/arango.db.Dockerfile
        context: .

      healthcheck:
        command: ["CMD-SHELL", "curl -f http://localhost:8529/_db/_system/_admin/aardvark/favicon.ico || exit 1"]
        interval: 30s
        retries: 3
        timeout: 5s
        start_period: 30s

    entrypoint: ["/bin/bash", "-c", "/etc/arangodb3/start.fargate.sh"]

    network:
      connect: true

    cpu: 1024 # Number of CPU units for the task.
    memory: 2048 # Amount of memory in MiB used by the task.
    platform: linux/x86_64 # See https://aws.github.io/copilot-cli/docs/manifest/backend-service/#platform
    count: 1 # Number of tasks that should be running in your service.
    exec: true # Enable running commands in your container.

    variables: # Pass environment variables as key value pairs.
      LOG_LEVEL: debug

    #secrets:                      # Pass secrets from AWS Systems Manager (SSM) Parameter Store.
    #  GITHUB_TOKEN: GITHUB_TOKEN  # The key is the name of the environment variable, the value is the name of the SSM parameter.

    # You can override any of the values defined above by environment.
    environments:
      rc:
        env_file: ./conf/arangodb.dev.env
        storage:
          volumes:
            dbData:
              path: /db-data
              read_only: false
              efs: true

      prod:
        env_file: ./conf/arangodb.prod.env
        variables: # Pass environment variables as key value pairs.
          LOG_LEVEL: info

Parameters:
  AppName:
    Type: String
  EnvName:
    Type: String
  WorkloadName:
    Type: String
  ContainerImage:
    Type: String
  ContainerPort:
    Type: Number
  TaskCPU:
    Type: String
  TaskMemory:
    Type: String
  TaskCount:
    Type: Number
  AddonsTemplateURL:
    Description: 'URL of the addons nested stack template within the S3 bucket.'
    Type: String
    Default: ""
  EnvFileARN:
    Description: 'URL of the environment file.'
    Type: String
    Default: ""
  LogRetention:
    Type: Number
    Default: 30
  TargetContainer:
    Type: String
  TargetPort:
    Type: Number
Conditions:
  IsGovCloud: !Equals [!Ref "AWS::Partition", "aws-us-gov"]
  HasAddons: !Not [!Equals [!Ref AddonsTemplateURL, ""]]
  HasEnvFile: !Not [!Equals [!Ref EnvFileARN, ""]]
  ExposePort: !Not [!Equals [!Ref TargetPort, -1]]
Resources:
  LogGroup:
    Metadata:
      'aws:copilot:description': 'A CloudWatch log group to hold your service logs'
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Join ['', [/copilot/, !Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName]]
      RetentionInDays: !Ref LogRetention
  TaskDefinition:
    Metadata:
      'aws:copilot:description': 'An ECS task definition to group your containers and run them on ECS'
    Type: AWS::ECS::TaskDefinition
    DependsOn: LogGroup
    Properties:
      Family: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName]]
      NetworkMode: awsvpc
      RequiresCompatibilities:
        - FARGATE
      Cpu: !Ref TaskCPU
      Memory: !Ref TaskMemory
      ExecutionRoleArn: !GetAtt ExecutionRole.Arn
      TaskRoleArn: !GetAtt TaskRole.Arn
      ContainerDefinitions:
        - Name: !Ref WorkloadName
          Image: !Ref ContainerImage
          Environment:
            - Name: COPILOT_APPLICATION_NAME
              Value: !Sub '${AppName}'
            - Name: COPILOT_SERVICE_DISCOVERY_ENDPOINT
              Value: rc.atfirst.local
            - Name: COPILOT_ENVIRONMENT_NAME
              Value: !Sub '${EnvName}'
            - Name: COPILOT_SERVICE_NAME
              Value: !Sub '${WorkloadName}'
            - Name: LOG_LEVEL
              Value: "debug"
            - Name: COPILOT_MOUNT_POINTS
              Value: '{"dbData":"/db-data"}'
          EnvironmentFiles:
            - !If
              - HasEnvFile
              - Type: s3
                Value: !Ref EnvFileARN
              - !Ref AWS::NoValue
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-region: !Ref AWS::Region
              awslogs-group: !Ref LogGroup
              awslogs-stream-prefix: copilot
          EntryPoint:
            - /bin/bash
            - -c
            - /etc/arangodb3/start.fargate.sh
          MountPoints:
            - ContainerPath: '/db-data'
              ReadOnly: false
              SourceVolume: dbData
          PortMappings:
            - ContainerPort: 8529
              Protocol: tcp
              Name: target
          HealthCheck:
            Command: ["CMD-SHELL", "curl -f http://localhost:8529/_db/_system/_admin/aardvark/favicon.ico || exit 1"]
            Interval: 30
            Retries: 3
            StartPeriod: 30
            Timeout: 5
      Volumes:
        - Name: dbData
          EFSVolumeConfiguration:
            FilesystemId: !GetAtt EnvControllerAction.ManagedFileSystemID
            RootDirectory: "/"
            TransitEncryption: ENABLED
            AuthorizationConfig:
              AccessPointId: !Ref AccessPoint
              IAM: ENABLED
  ExecutionRole:
    Metadata:
      'aws:copilot:description': 'An IAM Role for the Fargate agent to make AWS API calls on your behalf'
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: ecs-tasks.amazonaws.com
            Action: 'sts:AssumeRole'
      Policies:
        - PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, SecretsPolicy]]
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: 'Allow'
                Action:
                  - 'ssm:GetParameters'
                Resource:
                  - !Sub 'arn:${AWS::Partition}:ssm:${AWS::Region}:${AWS::AccountId}:parameter/*'
                Condition:
                  StringEquals:
                    'ssm:ResourceTag/copilot-application': !Sub '${AppName}'
                    'ssm:ResourceTag/copilot-environment': !Sub '${EnvName}'
              - Effect: 'Allow'
                Action:
                  - 'secretsmanager:GetSecretValue'
                Resource:
                  - !Sub 'arn:${AWS::Partition}:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:*'
                Condition:
                  StringEquals:
                    'secretsmanager:ResourceTag/copilot-application': !Sub '${AppName}'
                    'secretsmanager:ResourceTag/copilot-environment': !Sub '${EnvName}'
              - Effect: 'Allow'
                Action:
                  - 'kms:Decrypt'
                Resource:
                  - !Sub 'arn:${AWS::Partition}:kms:${AWS::Region}:${AWS::AccountId}:key/*'
        - !If
          # Optional IAM permission required by ECS task def env file
          # https://docs.aws.amazon.com/AmazonECS/latest/developerguide/taskdef-envfiles.html#taskdef-envfiles-iam
          # Example EnvFileARN: arn:aws:s3:::stackset-demo-infrastruc-pipelinebuiltartifactbuc-11dj7ctf52wyf/manual/1638391936/env
          - HasEnvFile
          - PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, GetEnvFilePolicy]]
            PolicyDocument:
              Version: '2012-10-17'
              Statement:
                - Effect: 'Allow'
                  Action:
                    - 's3:GetObject'
                  Resource:
                    - !Ref EnvFileARN
                - Effect: 'Allow'
                  Action:
                    - 's3:GetBucketLocation'
                  Resource:
                    - !Join
                      - ''
                      - - 'arn:'
                        - !Ref AWS::Partition
                        - ':s3:::'
                        - !Select [0, !Split ['/', !Select [5, !Split [':', !Ref EnvFileARN]]]]
          - !Ref AWS::NoValue
      ManagedPolicyArns:
        - !Sub 'arn:${AWS::Partition}:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy'
  TaskRole:
    Metadata:
      'aws:copilot:description': 'An IAM role to control permissions for the containers in your tasks'
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: ecs-tasks.amazonaws.com
            Action: 'sts:AssumeRole'
      Policies:
        - PolicyName: 'DenyIAMExceptTaggedRoles'
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: 'Deny'
                Action: 'iam:*'
                Resource: '*'
              - Effect: 'Allow'
                Action: 'sts:AssumeRole'
                Resource:
                  - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/*'
                Condition:
                  StringEquals:
                    'iam:ResourceTag/copilot-application': !Sub '${AppName}'
                    'iam:ResourceTag/copilot-environment': !Sub '${EnvName}'
        - PolicyName: 'ExecuteCommand'
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: 'Allow'
                Action: ["ssmmessages:CreateControlChannel", "ssmmessages:OpenControlChannel", "ssmmessages:CreateDataChannel", "ssmmessages:OpenDataChannel"]
                Resource: "*"
              - Effect: 'Allow'
                Action: ["logs:CreateLogStream", "logs:DescribeLogGroups", "logs:DescribeLogStreams", "logs:PutLogEvents"]
                Resource: "*"
        - PolicyName: 'GrantAccessCopilotManagedEFS'
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: 'Allow'
                Action:
                  - 'elasticfilesystem:ClientMount'
                  - 'elasticfilesystem:ClientWrite'
                Condition:
                  StringEquals:
                    'elasticfilesystem:AccessPointArn': !GetAtt AccessPoint.Arn
                Resource:
                  - Fn::Sub:
                      - 'arn:${partition}:elasticfilesystem:${region}:${account}:file-system/${fsid}'
                      - partition: !Ref AWS::Partition
                        region: !Ref AWS::Region
                        account: !Ref AWS::AccountId
                        fsid: !GetAtt EnvControllerAction.ManagedFileSystemID
  DiscoveryService:
    Metadata:
      'aws:copilot:description': 'Service discovery for your services to communicate within the VPC'
    Type: AWS::ServiceDiscovery::Service
    Properties:
      Description: Discovery Service for the Copilot services
      DnsConfig:
        RoutingPolicy: MULTIVALUE
        DnsRecords:
          - TTL: 10
            Type: A
          - TTL: 10
            Type: SRV
      HealthCheckCustomConfig:
        FailureThreshold: 1
      Name: !Ref WorkloadName
      NamespaceId:
        Fn::ImportValue: !Sub '${AppName}-${EnvName}-ServiceDiscoveryNamespaceID'
  Service:
    Metadata:
      'aws:copilot:description': 'An ECS service to run and maintain your tasks in the environment cluster'
    Type: AWS::ECS::Service
    DependsOn:
      - EnvControllerAction
    Properties:
      PlatformVersion: LATEST
      Cluster:
        Fn::ImportValue: !Sub '${AppName}-${EnvName}-ClusterId'
      TaskDefinition: !Ref TaskDefinition
      DesiredCount: !Ref TaskCount
      DeploymentConfiguration:
        DeploymentCircuitBreaker:
          Enable: true
          Rollback: true
        MinimumHealthyPercent: 100
        MaximumPercent: 200
        Alarms: !If
          - IsGovCloud
          - !Ref AWS::NoValue
          - Enable: false
            AlarmNames: []
            Rollback: true
      PropagateTags: SERVICE
      EnableExecuteCommand: true
      LaunchType: FARGATE
      ServiceConnectConfiguration:
        Enabled: True
        Namespace: rc.atfirst.local
        LogConfiguration:
          LogDriver: awslogs
          Options:
            awslogs-region: !Ref AWS::Region
            awslogs-group: !Ref LogGroup
            awslogs-stream-prefix: copilot
        Services:
          - PortName: target
            # Avoid using the same service with Service Discovery in a namespace.
            DiscoveryName: !Join ["-", [!Ref WorkloadName, "sc"]]
            ClientAliases:
              - Port: !Ref TargetPort
                DnsName: !Ref WorkloadName
      NetworkConfiguration:
        AwsvpcConfiguration:
          AssignPublicIp: ENABLED
          Subnets:
            Fn::Split:
              - ','
              - Fn::ImportValue: !Sub '${AppName}-${EnvName}-PublicSubnets'
          SecurityGroups:
            - Fn::ImportValue: !Sub '${AppName}-${EnvName}-EnvironmentSecurityGroup'
      ServiceRegistries: !If [ExposePort, [{RegistryArn: !GetAtt DiscoveryService.Arn, Port: !Ref TargetPort}], !Ref "AWS::NoValue"]
  AccessPoint:
    Metadata:
      'aws:copilot:description': 'An EFS access point to handle POSIX permissions'
    Type: AWS::EFS::AccessPoint
    Properties:
      ClientToken: !Sub ${AppName}-${EnvName}-${WorkloadName}
      FileSystemId: !GetAtt EnvControllerAction.ManagedFileSystemID
      PosixUser:
        Uid: 3824466984
        Gid: 3824466984
      RootDirectory:
        Path: !Sub '/db'
        CreationInfo:
          OwnerUid: 3824466984
          OwnerGid: 3824466984
          Permissions: '0755'
  AddonsStack:
    Metadata:
      'aws:copilot:description': 'An Addons CloudFormation Stack for your additional AWS resources'
    Type: AWS::CloudFormation::Stack
    DependsOn: EnvControllerAction
    Condition: HasAddons
    Properties:
      Parameters:
        App: !Ref AppName
        Env: !Ref EnvName
        Name: !Ref WorkloadName
      TemplateURL: !Ref AddonsTemplateURL
  EnvControllerAction:
    Metadata:
      'aws:copilot:description': "Update your environment's shared resources"
    Type: Custom::EnvControllerFunction
    Properties:
      ServiceToken: !GetAtt EnvControllerFunction.Arn
      Workload: !Ref WorkloadName
      EnvStack: !Sub '${AppName}-${EnvName}'
      Parameters: [EFSWorkloads]
      EnvVersion: v1.13.0
  EnvControllerFunction:
    Type: AWS::Lambda::Function
    Properties:
      Code:
        S3Bucket: stackset-atfirst-infrast-pipelinebuiltartifactbuc-jozarb4yzjqj
        S3Key: manual/scripts/custom-resources/envcontrollerfunction/3ffcf03598029891816b7ce2d1ff14fdd8079af4406a0cfeff1d4aa0109dcd7d.zip
      Handler: "index.handler"
      Timeout: 900
      MemorySize: 512
      Role: !GetAtt 'EnvControllerRole.Arn'
      Runtime: nodejs16.x
  EnvControllerRole:
    Metadata:
      'aws:copilot:description': "An IAM role to update your environment stack"
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - lambda.amazonaws.com
            Action:
              - sts:AssumeRole
      Path: /
      Policies:
        - PolicyName: "EnvControllerStackUpdate"
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - cloudformation:DescribeStacks
                  - cloudformation:UpdateStack
                Resource: !Sub 'arn:${AWS::Partition}:cloudformation:${AWS::Region}:${AWS::AccountId}:stack/${AppName}-${EnvName}/*'
                Condition:
                  StringEquals:
                    'cloudformation:ResourceTag/copilot-application': !Sub '${AppName}'
                    'cloudformation:ResourceTag/copilot-environment': !Sub '${EnvName}'
        - PolicyName: "EnvControllerRolePass"
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - iam:PassRole
                Resource: !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${AppName}-${EnvName}-CFNExecutionRole'
                Condition:
                  StringEquals:
                    'iam:ResourceTag/copilot-application': !Sub '${AppName}'
                    'iam:ResourceTag/copilot-environment': !Sub '${EnvName}'
      ManagedPolicyArns:
        - !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Outputs:
  DiscoveryServiceARN:
    Description: ARN of the Discovery Service.
    Value: !GetAtt DiscoveryService.Arn
    Export:
      Name: !Sub ${AWS::StackName}-DiscoveryServiceARN

pipline-name/buildspec.yaml

# Buildspec runs in the build stage of your pipeline.
version: 0.2
phases:
  install:
    runtime-versions:
      ruby: 3.1
      nodejs: 16
    commands:
      - echo "cd into $CODEBUILD_SRC_DIR"
      - cd $CODEBUILD_SRC_DIR
      # Download the copilot linux binary.
      - wget -q https://ecs-cli-v2-release.s3.amazonaws.com/copilot-linux-v1.28.0
      - mv ./copilot-linux-v1.28.0 ./copilot-linux
      - chmod +x ./copilot-linux
  build:
    commands:
      - echo "Run your tests"
      # - make test
  post_build:
    commands:
      - ls -l
      - export COLOR="false"
      - pipeline=$(cat $CODEBUILD_SRC_DIR/copilot/pipelines/deploy-services-pipeline/manifest.yml | ruby -ryaml -rjson -e 'puts JSON.pretty_generate(YAML.load(ARGF))')
      - pl_envs=$(echo $pipeline | jq -r '.stages[].name')
      # Find all the local services in the workspace.
      - svc_ls_result=$(./copilot-linux svc ls --local --json)
      - svc_list=$(echo $svc_ls_result | jq '.services')
      - >
        if [ ! "$svc_list" = null ]; then
          svcs=$(echo $svc_ls_result | jq -r '.services[].name');
        fi
      # Find all the local jobs in the workspace.
      - job_ls_result=$(./copilot-linux job ls --local --json)
      - job_list=$(echo $job_ls_result | jq '.jobs')
      - >
        if [ ! "$job_list" = null ]; then
          jobs=$(echo $job_ls_result | jq -r '.jobs[].name');
        fi
      # Raise error if no services or jobs are found.
      - >
        if [ "$svc_list" = null ] && [ "$job_list" = null ]; then
          echo "No services or jobs found for the pipeline to deploy. Please create at least one service or job and push the manifest to the remote." 1>&2;
          exit 1;
        fi
      # Generate the cloudformation templates.
      # The tag is the build ID but we replaced the colon ':' with a dash '-'.
      # We truncate the tag (from the front) to 128 characters, the limit for Docker tags
      # (https://docs.docker.com/engine/reference/commandline/tag/)
      # Check if the `svc package` commanded exited with a non-zero status. If so, echo error msg and exit.
      - >
        for env in $pl_envs; do
          tag=$(echo ${CODEBUILD_BUILD_ID##*:}-$env | sed 's/:/-/g' | rev | cut -c 1-128 | rev)
          for svc in $svcs; do
          ./copilot-linux svc package -n $svc -e $env --output-dir './infrastructure' --tag $tag --upload-assets;
          if [ $? -ne 0 ]; then
            echo "Cloudformation stack and config files were not generated. Please check build logs to see if there was a manifest validation error." 1>&2;
            exit 1;
          fi
          done;
          for job in $jobs; do
          ./copilot-linux job package -n $job -e $env --output-dir './infrastructure' --tag $tag --upload-assets;
          if [ $? -ne 0 ]; then
            echo "Cloudformation stack and config files were not generated. Please check build logs to see if there was a manifest validation error." 1>&2;
            exit 1;
          fi
          done;
        done;
      - ls -lah ./infrastructure
artifacts:
  files:
    - "infrastructure/*"
iamhopaul123 commented 1 year ago

Hello @vicpara.

Yes, each build has a new image tag. It looks like regardless of what happens in the code, the docker images are build again independently of the previous build. On the local machine most of the docker builds are skipped. I included last in this message the buildspec.yaml from the pipeline as it was generated.

I think it is because in pipeline by default we always use a different tag "./copilot-linux svc package -n $svc -e $env --output-dir './infrastructure' --tag $tag --upload-assets;" and you can get rid of --tag $tag if you don't want the task definition to change because of that.

Secondly, can the EFS endpoint be mounted by multiple instances (the prev version that runs in the curent task and the new task that is being deployed)? It feels that the prev revision task fails to shutdown somehow?

I don't believe this is the case. However, it failed to update because of the container health check which could be related to EFS. More helpful failure message can be observed at the deployment time if you go to the ecs console -> service -> event or click into any failed task and see the reason (terribly sorry we don't have native support for this but we are actually actively working on it now)

vicpara commented 1 year ago

What happens if I replace the copilot svc package --name xxx --env yyy with copilot svc deploy --name xxx --env yyy ?

Isn't this post-build pipeline doing too much in one command?

And then, at the end, we have to put up with an avalanche messages, slow CloudFormation messages, sometimes the roll-back is impossible due to an error => delete the stack and start over , big delays in service deployments etc.

To me it makes more sense to break down the entire process into smaller, transparent steps where the user actually has more control over what happens, when and why.

iamhopaul123 commented 1 year ago

What happens if I replace the copilot svc package --name xxx --env yyy with copilot svc deploy --name xxx --env yyy ?

The codePipeline is supposed to the the cfn template generated by svc package and make workload/environment deployment for you. To me svc package --upload-assets and svc deploy are essentially the same and you don't have control over the post-build steps you referred to.

And then, at the end, we have to put up with an avalanche messages, slow CloudFormation messages, sometimes the roll-back is impossible due to an error => delete the stack and start over , big delays in service deployments etc.

I agreed with these problems but they are controlled by codePipeline and CloudFormation (just not relevant to the post build pipeline in one command)

To me it makes more sense to break down the entire process into smaller, transparent steps where the user actually has more control over what happens, when and why.

It makes sense to me. Unfortunately we don't generate our buildspec in this way. Actually our svc package only does "upsert env and services stack definition etc". And the rest of your steps are included in the buildspec as separate items. Created https://github.com/aws/copilot-cli/issues/5102 to track this issue.

Thank you for bringing this issue up! Also, is the original question on EFS solved?

vicpara commented 1 year ago

Thanks for your answer.

No, the EFS problem is still ongoing. I don't even know if the service of the previous revision should automatically relinquish the EFS on it's own before the next revision come up. Can the next revision come up if the EFS Access Point is held by a different service? I did expect the magic hidden in the box to be able to come up with a plan when one svc is using the same EFS Access Point and another revision is coming up.

iamhopaul123 commented 1 year ago

Can the next revision come up if the EFS Access Point is held by a different service?

The service will always be the same. It's just different tasks will be spun up which should be ok. One thing I would recommend is going to the ECS console and see why those tasks stopped (the reason why the deployment failed is because ECS was trying to roll out a new revision but kept failing). I think that'll help us triage why we have such problem.

Never mind! I saw your task log in the other issue. Let's continue there.

vicpara commented 1 year ago

It turns out the new task was waiting for the previous task to relinquish the EFS Access Point while the previous Task was minding it's own business as the new task was never achieving healthy status to kick out the prev task.

Setting the db service manifest in the environment to include rolling: "recreate" fixed the deployment issues. The previous revision DB shuts down, and a new task from the new revision is spun up. So it works.

Thanks!

iamhopaul123 commented 1 year ago

That's really good to know and glad you solved the problem! Thank you for your patience.