Second (and following) deployments of services fail after copilot upgrade

schm commented 1 year ago

Hey,

last week we upgraded our main copilot app by running app upgrade. Since then we're running into strange issues when redeploying any kind of service in the same app.

We were creating the app with copilot 1.22.0. We've been using the latest release of the copilot-cli for each deployment. And we've only triggered app upgrade last week in order to use static sites.

For App Runner based services we see the following error:

deploy service retro to environment staging: deploy service: determine image repository type: image is not supported by App Runner: @sha256:96b7d5824ba87ef965f74db9a4f7babd95832852d3ce9b3b27219b7aa308a2ef

For ECS based services we see this error:

- Updating the infrastructure for stack tech-staging-oreo-dl                  [update rollback complete]  [15.3s]
  The following resource(s) failed to update: [TaskDefinition].                                           
  - An ECS service to run and maintain your tasks in the environment cluster  [not started]               
  - An ECS task definition to group your containers and run them on ECS       [delete complete]           [0.0s]
    Resource handler returned message: "Invalid request provided: Create T                                
    askDefinition: Container.image repository should not be null or empty.                                
     (Service: AmazonECS; Status Code: 400; Error Code: ClientException; R                                
    equest ID: abc12f74-a49a-42f9-ac85-418debf2f7b2; Proxy: null)" (Reques                                
    tToken: 1d5e884d-bb98-74b6-fddb-8f2bc2265329, HandlerErrorCode: Invali                                
    dRequest)

So both errors seem to be related to ECR.

What we found out already:

we can create new services and deploy them once. The second and all following attempts to deploy will result in the same error.
It's hard to verify now. But we think for some services we were able to deploy once after the app upgrade. but for others the next deployment directly failed.

I would love to get any feedback on how we can further debug this issue as this is blocking our teams. I'll happily provide more information, if you tell me which.

bvtujo commented 1 year ago

Uh oh, this looks like a new bug for us. I'll look into reproducing; we're hearing from multiple customers with ECR-related problems.

In the meantime, could you share a few things? In particular, what would be useful are:

the template of the Cloudformation stack called StackSet-${APPNAME}-infrastructure-${UUID}/${UUID}. This is where we actually store the ECR repositories; they're deployed once per region in the tools account (the account where you ran copilot app init or copilot init originally).
the manifest of a problematic service, and the failed cloudformation template

This will help greatly in debugging for us.

schm commented 1 year ago

This would be the template of the stack set

Stack set template

```yml # Copyright Amazon.com Inc. or its affiliates. All Rights Reserved. # SPDX-License-Identifier: MIT-0 AWSTemplateFormatVersion: '2010-09-09' # Cross-regional resources deployed via a stackset in the tools account # to support the CodePipeline for a workspace Description: Cross-regional resources to support the CodePipeline for a workspace Metadata: TemplateVersion: 'v1.1.0' Version: 89 Services: - routing-import Accounts: - 068640317972 Resources: KMSKey: Metadata: 'aws:copilot:description': 'KMS key to encrypt pipeline artifacts between stages' # Used by the CodePipeline in the tools account to en/decrypt the # artifacts between stages Type: AWS::KMS::Key Properties: EnableKeyRotation: true KeyPolicy: Version: '2012-10-17' Id: !Ref AWS::StackName Statement: - # Allows the key to be administered in the tools account Effect: Allow Principal: AWS: !Sub arn:${AWS::Partition}:iam::${AWS::AccountId}:root Action: - "kms:Create*" - "kms:Describe*" - "kms:Enable*" - "kms:List*" - "kms:Put*" - "kms:Update*" - "kms:Revoke*" - "kms:Disable*" - "kms:Get*" - "kms:Delete*" - "kms:ScheduleKeyDeletion" - "kms:CancelKeyDeletion" - "kms:Tag*" - "kms:UntagResource" Resource: "*" - # Allow use of the key in the tools account and all environment accounts Effect: Allow Principal: AWS: - !Sub arn:${AWS::Partition}:iam::${AWS::AccountId}:root - !Sub arn:${AWS::Partition}:iam::068640317972:root Action: - kms:Encrypt - kms:Decrypt - kms:ReEncrypt* - kms:GenerateDataKey* - kms:DescribeKey Resource: "*" PipelineBuiltArtifactBucketPolicy: Metadata: 'aws:copilot:description': 'S3 Bucket to store local artifacts' Type: AWS::S3::BucketPolicy DependsOn: PipelineBuiltArtifactBucket Properties: Bucket: !Ref PipelineBuiltArtifactBucket PolicyDocument: Version: '2012-10-17' Statement: - Action: - s3:* Effect: Allow Resource: - !Sub arn:${AWS::Partition}:s3:::${PipelineBuiltArtifactBucket} - !Sub arn:${AWS::Partition}:s3:::${PipelineBuiltArtifactBucket}/* Principal: AWS: - !Sub arn:${AWS::Partition}:iam::${AWS::AccountId}:root - !Sub arn:${AWS::Partition}:iam::068640317972:root PipelineBuiltArtifactBucket: Type: AWS::S3::Bucket Properties: VersioningConfiguration: Status: Enabled BucketEncryption: ServerSideEncryptionConfiguration: - ServerSideEncryptionByDefault: SSEAlgorithm: AES256 OwnershipControls: Rules: - ObjectOwnership: BucketOwnerEnforced ECRReporoutingDASHimport: Metadata: 'aws:copilot:description': 'ECR container image repository for "routing-import"' Type: AWS::ECR::Repository Properties: RepositoryName: tech/routing-import Tags: - Key: copilot-service Value: routing-import RepositoryPolicyText: Version: '2012-10-17' Statement: - Sid: AllowPushPull Effect: Allow Principal: AWS: - !Sub arn:${AWS::Partition}:iam::${AWS::AccountId}:root - !Sub arn:${AWS::Partition}:iam::068640317972:root Action: - ecr:GetDownloadUrlForLayer - ecr:BatchGetImage - ecr:BatchCheckLayerAvailability - ecr:PutImage - ecr:InitiateLayerUpload - ecr:UploadLayerPart - ecr:CompleteLayerUpload Outputs: KMSKeyARN: Description: KMS Key used by CodePipeline for encrypting artifacts. Value: !GetAtt KMSKey.Arn Export: Name: tech-ArtifactKey PipelineBucket: Description: "A bucket used for any Copilot artifacts that must be stored in S3 (pipelines, env files, etc)." Value: !Ref PipelineBuiltArtifactBucket ECRReporoutingDASHimport: Description: ECR Repo used to store images of the routing-import service. Value: !GetAtt ECRReporoutingDASHimport.Arn TemplateVersion: Description: Required output to force the stackset to update if mutating version. Value: v1.1.0 ```

App Runner

This is the manifest of the AppRunner based service mentioned above:

``` # The manifest for the "retro" service. # Read the full specification for the "Request-Driven Web Service" type at: # https://aws.github.io/copilot-cli/docs/manifest/rd-web-service/ # Your service name will be used in naming your resources like log groups, App Runner services, etc. name: retro # The "architecture" of the service you're running. type: Request-Driven Web Service image: # Docker build arguments. # For additional overrides: https://aws.github.io/copilot-cli/docs/manifest/rd-web-service/#image-build build: Dockerfile # Port exposed through your container to route traffic to it. port: 3000 http: alias: retro.tlservers.com healthcheck: '/check' # Number of CPU units for the task. cpu: 1024 # Amount of memory in MiB used by the task. memory: 2048 # # Connect your App Runner service to your environment's VPC. # network: # vpc: # placement: private # Enable tracing for the service. # observability: # tracing: awsxray # Optional fields for more advanced use-cases. # # variables: # Pass environment variables as key value pairs. # LOG_LEVEL: info # # tags: # Pass tags as key value pairs. # project: project-name # You can override any of the values defined above by environment. # environments: # test: # variables: # LOG_LEVEL: debug # Log level for the "test" environment. ```

I cannot share a cloudformation template in this case, as this deployment fails before even creating the template.

ECS

And this would be the manifest for the failing ECS service

```yml # The manifest for the "oreo-dl" service. # Read the full specification for the "Load Balanced Web Service" type at: # https://aws.github.io/copilot-cli/docs/manifest/lb-web-service/ # Your service name will be used in naming your resources like log groups, ECS services, etc. name: oreo-dl type: Load Balanced Web Service # Distribute traffic to your service. http: # Requests to this path will be forwarded to your service. # To match all requests you can use the "/" path. path: '/' # You can specify a custom health check path. The default is "/". healthcheck: '/check' alias: 'oreo-dl.tlservers.com' hosted_zone: Z2YCD3NGL5278X # Configuration for your containers and service. image: # Docker build arguments. For additional overrides: https://aws.github.io/copilot-cli/docs/manifest/lb-web-service/#image-build build: Dockerfile # Port exposed through your container to route traffic to it. port: 80 cpu: 256 # Number of CPU units for the task. memory: 512 # Amount of memory in MiB used by the task. platform: linux/x86_64 # See https://aws.github.io/copilot-cli/docs/manifest/lb-web-service/#platform count: 1 # Number of tasks that should be running in your service. exec: true # Enable running commands in your container. # Optional fields for more advanced use-cases. # variables: # Pass environment variables as key value pairs. PORT: 80 REPOSITORIES: coral,estimate,gecko-api,gecko-api-doc,goaliath,trips.lionprint,retro,trips.routing,salesforce-components,suite,suitedashboardapi,suiteproxy,suitesfproxy,wallaby REPOSITORIES_OWNER: oreo APP_ORIGIN: https://oreo-dl.tlservers.com AUTH0_CALLBACK: https://oreo-dl.tlservers.com/ secrets: # ... lots of secrets referencing entries in the parameter store ... # You can override any of the values defined above by environment. #environments: # test: # count: 2 # Number of tasks to run for the "test" environment. # deployment: # The deployment strategy for the "test" environment. # rolling: 'recreate' # Stops existing tasks before new ones are started for faster deployments. ```

Stack template for that same service

```yml # Copyright Amazon.com Inc. or its affiliates. All Rights Reserved. # SPDX-License-Identifier: MIT-0 AWSTemplateFormatVersion: 2010-09-09 Description: CloudFormation template that represents a load balanced web service on Amazon ECS. Metadata: Manifest: | # The manifest for the "oreo-dl" service. # Read the full specification for the "Load Balanced Web Service" type at: # https://aws.github.io/copilot-cli/docs/manifest/lb-web-service/ # Your service name will be used in naming your resources like log groups, ECS services, etc. name: oreo-dl type: Load Balanced Web Service # Distribute traffic to your service. http: # Requests to this path will be forwarded to your service. # To match all requests you can use the "/" path. path: '/' # You can specify a custom health check path. The default is "/". healthcheck: '/check' alias: 'oreo-dl.tlservers.com' hosted_zone: Z2YCD3NGL5278X # Configuration for your containers and service. image: # Docker build arguments. For additional overrides: https://aws.github.io/copilot-cli/docs/manifest/lb-web-service/#image-build build: Dockerfile # Port exposed through your container to route traffic to it. port: 80 cpu: 256 # Number of CPU units for the task. memory: 512 # Amount of memory in MiB used by the task. platform: linux/x86_64 # See https://aws.github.io/copilot-cli/docs/manifest/lb-web-service/#platform count: 1 # Number of tasks that should be running in your service. exec: true # Enable running commands in your container. # Optional fields for more advanced use-cases. # variables: # Pass environment variables as key value pairs. PORT: 80 REPOSITORIES: coral,estimate,gecko-api,gecko-api-doc,goaliath,trips.lionprint,retro,trips.routing,salesforce-components,suite,suitedashboardapi,suiteproxy,suitesfproxy,wallaby REPOSITORIES_OWNER: oreo APP_ORIGIN: https://oreo-dl.tlservers.com AUTH0_CALLBACK: https://oreo-dl.tlservers.com/ secrets: # ... lots of secrets referencing entries in the parameter store ... # You can override any of the values defined above by environment. #environments: # test: # count: 2 # Number of tasks to run for the "test" environment. # deployment: # The deployment strategy for the "test" environment. # rolling: 'recreate' # Stops existing tasks before new ones are started for faster deployments. Parameters: AppName: Type: String EnvName: Type: String WorkloadName: Type: String ContainerImage: Type: String ContainerPort: Type: Number TaskCPU: Type: String TaskMemory: Type: String TaskCount: Type: Number DNSDelegated: Type: String AllowedValues: [true, false] LogRetention: Type: Number AddonsTemplateURL: Description: 'URL of the addons nested stack template within the S3 bucket.' Type: String Default: "" EnvFileARN: Description: 'URL of the environment file.' Type: String Default: "" TargetContainer: Type: String TargetPort: Type: Number HTTPSEnabled: Type: String AllowedValues: [true, false] RulePath: Type: String Conditions: IsGovCloud: !Equals [!Ref "AWS::Partition", "aws-us-gov"] HasAssociatedDomain: !Equals [!Ref DNSDelegated, true] HasAddons: !Not [!Equals [!Ref AddonsTemplateURL, ""]] HasEnvFile: !Not [!Equals [!Ref EnvFileARN, ""]] Resources: # If a bucket URL is specified, that means the template exists. LogGroup: Metadata: 'aws:copilot:description': 'A CloudWatch log group to hold your service logs' Type: AWS::Logs::LogGroup Properties: LogGroupName: !Join ['', [/copilot/, !Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName]] RetentionInDays: !Ref LogRetention TaskDefinition: Metadata: 'aws:copilot:description': 'An ECS task definition to group your containers and run them on ECS' Type: AWS::ECS::TaskDefinition DependsOn: LogGroup Properties: Family: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName]] NetworkMode: awsvpc RequiresCompatibilities: - FARGATE Cpu: !Ref TaskCPU Memory: !Ref TaskMemory ExecutionRoleArn: !GetAtt ExecutionRole.Arn TaskRoleArn: !GetAtt TaskRole.Arn ContainerDefinitions: - Name: !Ref WorkloadName Image: !Ref ContainerImage Secrets: - Name: APP_SECRET ValueFrom: /copilot/tech/staging/secrets/dl/APP_SECRET - Name: AUTH0_CLIENT_ID ValueFrom: /copilot/tech/staging/secrets/dl/AUTH0_CLIENT_ID - Name: AUTH0_CLIENT_SECRET ValueFrom: /copilot/tech/staging/secrets/dl/AUTH0_CLIENT_SECRET - Name: AUTH0_SUBDOMAIN ValueFrom: /copilot/tech/staging/secrets/dl/AUTH0_SUBDOMAIN - Name: GITHUB_APP_ID ValueFrom: /copilot/tech/staging/secrets/dl/GITHUB_APP_ID - Name: GITHUB_APP_INSTALLATION_ID ValueFrom: /copilot/tech/staging/secrets/dl/GITHUB_APP_INSTALLATION_ID - Name: GITHUB_APP_INSTALLATION_OWNER ValueFrom: /copilot/tech/staging/secrets/dl/GITHUB_APP_INSTALLATION_OWNER - Name: GITHUB_APP_PRIVATE_KEY ValueFrom: /copilot/tech/staging/secrets/dl/GITHUB_APP_PRIVATE_KEY Environment: - Name: COPILOT_APPLICATION_NAME Value: !Sub '${AppName}' - Name: COPILOT_SERVICE_DISCOVERY_ENDPOINT Value: staging.tech.local - Name: COPILOT_ENVIRONMENT_NAME Value: !Sub '${EnvName}' - Name: COPILOT_SERVICE_NAME Value: !Sub '${WorkloadName}' - Name: COPILOT_LB_DNS Value: !GetAtt EnvControllerAction.PublicLoadBalancerDNSName - Name: APP_ORIGIN Value: "https://oreo-dl.tlservers.com" - Name: AUTH0_CALLBACK Value: "https://oreo-dl.tlservers.com/" - Name: PORT Value: "80" - Name: REPOSITORIES Value: "coral,estimate,gecko-api,gecko-api-doc,goaliath,trips.lionprint,retro,trips.routing,salesforce-components,suite,suitedashboardapi,suiteproxy,suitesfproxy,wallaby" - Name: REPOSITORIES_OWNER Value: "oreo" EnvironmentFiles: - !If - HasEnvFile - Type: s3 Value: !Ref EnvFileARN - !Ref AWS::NoValue LogConfiguration: LogDriver: awslogs Options: awslogs-region: !Ref AWS::Region awslogs-group: !Ref LogGroup awslogs-stream-prefix: copilot PortMappings: - ContainerPort: 80 Protocol: tcp Name: target ExecutionRole: Metadata: 'aws:copilot:description': 'An IAM Role for the Fargate agent to make AWS API calls on your behalf' Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Service: ecs-tasks.amazonaws.com Action: 'sts:AssumeRole' Policies: - PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, SecretsPolicy]] PolicyDocument: Version: '2012-10-17' Statement: - Effect: 'Allow' Action: - 'ssm:GetParameters' Resource: - !Sub 'arn:${AWS::Partition}:ssm:${AWS::Region}:${AWS::AccountId}:parameter/*' Condition: StringEquals: 'ssm:ResourceTag/copilot-application': !Sub '${AppName}' 'ssm:ResourceTag/copilot-environment': !Sub '${EnvName}' - Effect: 'Allow' Action: - 'secretsmanager:GetSecretValue' Resource: - !Sub 'arn:${AWS::Partition}:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:*' Condition: StringEquals: 'secretsmanager:ResourceTag/copilot-application': !Sub '${AppName}' 'secretsmanager:ResourceTag/copilot-environment': !Sub '${EnvName}' - Effect: 'Allow' Action: - 'kms:Decrypt' Resource: - !Sub 'arn:${AWS::Partition}:kms:${AWS::Region}:${AWS::AccountId}:key/*' - !If # Optional IAM permission required by ECS task def env file # https://docs.aws.amazon.com/AmazonECS/latest/developerguide/taskdef-envfiles.html#taskdef-envfiles-iam # Example EnvFileARN: arn:aws:s3:::stackset-demo-infrastruc-pipelinebuiltartifactbuc-11dj7ctf52wyf/manual/1638391936/env - HasEnvFile - PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, GetEnvFilePolicy]] PolicyDocument: Version: '2012-10-17' Statement: - Effect: 'Allow' Action: - 's3:GetObject' Resource: - !Ref EnvFileARN - Effect: 'Allow' Action: - 's3:GetBucketLocation' Resource: - !Join - '' - - 'arn:' - !Ref AWS::Partition - ':s3:::' - !Select [0, !Split ['/', !Select [5, !Split [':', !Ref EnvFileARN]]]] - !Ref AWS::NoValue ManagedPolicyArns: - !Sub 'arn:${AWS::Partition}:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy' TaskRole: Metadata: 'aws:copilot:description': 'An IAM role to control permissions for the containers in your tasks' Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Service: ecs-tasks.amazonaws.com Action: 'sts:AssumeRole' Policies: - PolicyName: 'DenyIAMExceptTaggedRoles' PolicyDocument: Version: '2012-10-17' Statement: - Effect: 'Deny' Action: 'iam:*' Resource: '*' - Effect: 'Allow' Action: 'sts:AssumeRole' Resource: - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/*' Condition: StringEquals: 'iam:ResourceTag/copilot-application': !Sub '${AppName}' 'iam:ResourceTag/copilot-environment': !Sub '${EnvName}' - PolicyName: 'ExecuteCommand' PolicyDocument: Version: '2012-10-17' Statement: - Effect: 'Allow' Action: ["ssmmessages:CreateControlChannel", "ssmmessages:OpenControlChannel", "ssmmessages:CreateDataChannel", "ssmmessages:OpenDataChannel"] Resource: "*" - Effect: 'Allow' Action: ["logs:CreateLogStream", "logs:DescribeLogGroups", "logs:DescribeLogStreams", "logs:PutLogEvents"] Resource: "*" DiscoveryService: Metadata: 'aws:copilot:description': 'Service discovery for your services to communicate within the VPC' Type: AWS::ServiceDiscovery::Service Properties: Description: Discovery Service for the Copilot services DnsConfig: RoutingPolicy: MULTIVALUE DnsRecords: - TTL: 10 Type: A - TTL: 10 Type: SRV HealthCheckCustomConfig: FailureThreshold: 1 Name: !Ref WorkloadName NamespaceId: Fn::ImportValue: !Sub '${AppName}-${EnvName}-ServiceDiscoveryNamespaceID' EnvControllerAction: Metadata: 'aws:copilot:description': "Update your environment's shared resources" Type: Custom::EnvControllerFunction Properties: ServiceToken: !GetAtt EnvControllerFunction.Arn Workload: !Ref WorkloadName Aliases: ["oreo-dl.tlservers.com"] EnvStack: !Sub '${AppName}-${EnvName}' Parameters: [ALBWorkloads, Aliases] EnvVersion: v1.13.0 EnvControllerFunction: Type: AWS::Lambda::Function Properties: Code: S3Bucket: stackset-tech-infrastruc-pipelinebuiltartifactbuc-14u7t4eswywca S3Key: manual/scripts/custom-resources/envcontrollerfunction/3ffcf03598029891816b7ce2d1ff14fdd8079af4406a0cfeff1d4aa0109dcd7d.zip Handler: "index.handler" Timeout: 900 MemorySize: 512 Role: !GetAtt 'EnvControllerRole.Arn' Runtime: nodejs16.x EnvControllerRole: Metadata: 'aws:copilot:description': "An IAM role to update your environment stack" Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Service: - lambda.amazonaws.com Action: - sts:AssumeRole Path: / Policies: - PolicyName: "EnvControllerStackUpdate" PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - cloudformation:DescribeStacks - cloudformation:UpdateStack Resource: !Sub 'arn:${AWS::Partition}:cloudformation:${AWS::Region}:${AWS::AccountId}:stack/${AppName}-${EnvName}/*' Condition: StringEquals: 'cloudformation:ResourceTag/copilot-application': !Sub '${AppName}' 'cloudformation:ResourceTag/copilot-environment': !Sub '${EnvName}' - PolicyName: "EnvControllerRolePass" PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - iam:PassRole Resource: !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${AppName}-${EnvName}-CFNExecutionRole' Condition: StringEquals: 'iam:ResourceTag/copilot-application': !Sub '${AppName}' 'iam:ResourceTag/copilot-environment': !Sub '${EnvName}' ManagedPolicyArns: - !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole Service: Metadata: 'aws:copilot:description': 'An ECS service to run and maintain your tasks in the environment cluster' Type: AWS::ECS::Service DependsOn: - HTTPListenerRuleWithDomain - HTTPSListenerRule Properties: PlatformVersion: LATEST Cluster: Fn::ImportValue: !Sub '${AppName}-${EnvName}-ClusterId' TaskDefinition: !Ref TaskDefinition DesiredCount: !Ref TaskCount DeploymentConfiguration: DeploymentCircuitBreaker: Enable: true Rollback: true MinimumHealthyPercent: 100 MaximumPercent: 200 Alarms: !If - IsGovCloud - !Ref AWS::NoValue - Enable: false AlarmNames: [] Rollback: true PropagateTags: SERVICE EnableExecuteCommand: true LaunchType: FARGATE ServiceConnectConfiguration: !If - IsGovCloud - !Ref AWS::NoValue - Enabled: False NetworkConfiguration: AwsvpcConfiguration: AssignPublicIp: ENABLED Subnets: Fn::Split: - ',' - Fn::ImportValue: !Sub '${AppName}-${EnvName}-PublicSubnets' SecurityGroups: - Fn::ImportValue: !Sub '${AppName}-${EnvName}-EnvironmentSecurityGroup' # This may need to be adjusted if the container takes a while to start up HealthCheckGracePeriodSeconds: 60 LoadBalancers: - ContainerName: oreo-dl ContainerPort: 80 TargetGroupArn: !Ref TargetGroup ServiceRegistries: - RegistryArn: !GetAtt DiscoveryService.Arn Port: !Ref TargetPort TargetGroup: Metadata: 'aws:copilot:description': "A target group to connect the load balancer to your service on port 80" Type: AWS::ElasticLoadBalancingV2::TargetGroup Properties: HealthCheckPath: /check # Default is '/'. Port: 80 Protocol: HTTP TargetGroupAttributes: - Key: deregistration_delay.timeout_seconds Value: 60 # ECS Default is 300; Copilot default is 60. - Key: stickiness.enabled Value: false TargetType: ip VpcId: Fn::ImportValue: !Sub "${AppName}-${EnvName}-VpcId" RulePriorityFunction: Type: AWS::Lambda::Function Properties: Code: S3Bucket: stackset-tech-infrastruc-pipelinebuiltartifactbuc-14u7t4eswywca S3Key: manual/scripts/custom-resources/rulepriorityfunction/ac6830d3d4de8167bed1ce48eaf073ccbffe41076a1f88ea5c09b7b0ad71cb14.zip Handler: "index.nextAvailableRulePriorityHandler" Timeout: 600 MemorySize: 512 Role: !GetAtt "RulePriorityFunctionRole.Arn" Runtime: nodejs16.x RulePriorityFunctionRole: Metadata: 'aws:copilot:description': "An IAM Role to describe load balancer rules for assigning a priority" Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Service: - lambda.amazonaws.com Action: - sts:AssumeRole Path: / ManagedPolicyArns: - !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole Policies: - PolicyName: "RulePriorityGeneratorAccess" PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - elasticloadbalancing:DescribeRules Resource: "*" LoadBalancerDNSAliasZ2YCD3NGL5278X: Metadata: 'aws:copilot:description': 'Alias records for the application load balancer in hosted zone Z2YCD3NGL5278X' Type: AWS::Route53::RecordSetGroup Properties: HostedZoneId: Z2YCD3NGL5278X Comment: !Sub "LoadBalancer aliases for service ${WorkloadName} in hosted zone Z2YCD3NGL5278X" RecordSets: - Name: "oreo-dl.tlservers.com" Type: A AliasTarget: HostedZoneId: !GetAtt EnvControllerAction.PublicLoadBalancerHostedZone DNSName: !GetAtt EnvControllerAction.PublicLoadBalancerDNSName HTTPSRulePriorityAction: Metadata: 'aws:copilot:description': 'A custom resource assigning priority for HTTPS listener rules' Type: Custom::RulePriorityFunction Properties: ServiceToken: !GetAtt RulePriorityFunction.Arn RulePath: ["/"] ListenerArn: !GetAtt EnvControllerAction.HTTPSListenerArn HTTPRuleWithDomainPriorityAction: Metadata: 'aws:copilot:description': 'A custom resource assigning priority for HTTP listener rules' Type: Custom::RulePriorityFunction Properties: ServiceToken: !GetAtt RulePriorityFunction.Arn RulePath: ["/"] ListenerArn: !GetAtt EnvControllerAction.HTTPListenerArn HTTPListenerRuleWithDomain: Metadata: 'aws:copilot:description': 'An HTTP listener rule for path `/` that redirects HTTP to HTTPS' Type: AWS::ElasticLoadBalancingV2::ListenerRule Properties: Actions: - Type: redirect RedirectConfig: Protocol: HTTPS Port: 443 Host: "#{host}" Path: "/#{path}" Query: "#{query}" StatusCode: HTTP_301 Conditions: - Field: 'host-header' HostHeaderConfig: Values: ["oreo-dl.tlservers.com"] - Field: 'path-pattern' PathPatternConfig: Values: - /* ListenerArn: !GetAtt EnvControllerAction.HTTPListenerArn Priority: !GetAtt HTTPRuleWithDomainPriorityAction.Priority HTTPSListenerRule: Metadata: 'aws:copilot:description': 'An HTTPS listener rule for path `/` that forwards HTTPS traffic to your tasks' Type: AWS::ElasticLoadBalancingV2::ListenerRule Properties: Actions: - TargetGroupArn: !Ref TargetGroup Type: forward Conditions: - Field: 'host-header' HostHeaderConfig: Values: ["oreo-dl.tlservers.com"] - Field: 'path-pattern' PathPatternConfig: Values: - /* ListenerArn: !GetAtt EnvControllerAction.HTTPSListenerArn Priority: !GetAtt HTTPSRulePriorityAction.Priority AddonsStack: Metadata: 'aws:copilot:description': 'An Addons CloudFormation Stack for your additional AWS resources' Type: AWS::CloudFormation::Stack DependsOn: EnvControllerAction Condition: HasAddons Properties: Parameters: App: !Ref AppName Env: !Ref EnvName Name: !Ref WorkloadName TemplateURL: !Ref AddonsTemplateURL Outputs: DiscoveryServiceARN: Description: ARN of the Discovery Service. Value: !GetAtt DiscoveryService.Arn Export: Name: !Sub ${AWS::StackName}-DiscoveryServiceARN ```

Thanks a lot for looking into this.

bvtujo commented 1 year ago

I think I see what's happening here. It looks like there's only one ECR repo that's being created after app upgrade. I think this comes from a bug in our code which opts out of ECR repo creation for static site patterns. But I will need to narrow things down to be sure.

In the meantime, while we work to fix it, is it possible for you work around this by creating an ECR repo outside of copilot management, building and pushing to it manually, and specifying it in image.location in your manifests? I realize this is a lot of work and it's totally our fault that you're blocked, but I don't know if we have a good story for downgrading an app.

I will update this issue with details as I work through it.

tsogbadrakh-ch commented 1 year ago

@bvtujo Hey, I wonder how you guys are handling the priorities of the issue, Is there any rules for that?

bvtujo commented 1 year ago

@schm In the meantime you can try this customer's workaround; they seem to have the same problem as you where the ECR repos got deleted improperly.

acamb commented 1 year ago

Same issue here without running app upgrade: we are currently unable to deploy new versions of our applications in production without deleting the copilot job and recreate it. We tryed with both version 1.28.0 and 1.26.0. @bvtujo this is a major issue for us, we are really worryed that this issue is 2 weeks old with no significant updates

Edit: just to clarify, the ECR repository exists when the error appears. You have to delete it manually after copilot job delete. The workaround posted is not applicable to Scheduled jobs.

iamhopaul123 commented 1 year ago

Hello @acamb.

we are really worryed that this issue is 2 weeks old with no significant updates

This only happens when trying to use an older version of Copilot to update Copilot application that was lastly updated by a newer version of Copilot. We are actually working on an enhancement to prevent users from doing that avoids any template downgrade. Sorry again for the inconvenience and please let us know if you are still worried about this.

acamb commented 1 year ago

Hi @iamhopaul123 I'm not sure this is happening only if you use an older version of Copilot: i use v 1.28 and i haven't switched back to any older version (i used the 1.26 only to test it while reporting the issue). Maybe the issue is triggered also by deploying a new version when the older one was created with an older version of Copilot?

Please let me know if there is a better workaround than deleting the job with copilot delete .

Thanks, Andrea

iamhopaul123 commented 1 year ago

Maybe the issue is triggered also by deploying a new version when the older one was created with an older version of Copilot?

I've tested to create and deploy something with v1.26 and then switched to v1.28 to create and deploy a new job and then did job run, but there doesn't seem to be any backward-incompatible issue.

Please let me know if there is a better workaround than deleting the job with copilot delete.

One workaround I think would be

At AWS console, go to SSM parameter store and delete the record for the job that you had issues (ECR repo missing) with.
Run copilot job init again to add the job to the application again (this should recreate the ECR repo)
Rerun copilot job deploy/run

acamb commented 1 year ago

@iamhopaul123 it's strange because im using v.1.28 and today i've got this problem 3 times (without touching the manifest). In one case the deploy failed and in the other two the state machine failed with error "failed to normalize image reference ..." when running the job(issue #5032 ). They where jobs that I haven't touched for weeks/months and the previous task version was likely deployed with an older Copilot version. On the jobs where i did copilot job delete ecc the following deploys and runs are going fine.

Monday morning I will try the workaround you suggested.

acamb commented 1 year ago

@iamhopaul123 the workaround only avoids the error while running copilot job init after copilot job delete, but after deploying and running the ECS i'm still getting the error:

InternalError: failed to create container model: failed to normalize image reference [...]

Instead if i manually delete the ECR the job runs fine after another cycle of delete-init-deploy.

acamb commented 1 year ago

@iamhopaul123 I can confirm that the issue is happening also with scheduled jobs created with Copilot v1.28.0 and re-deployed with the same version.

Lou1415926 commented 1 year ago

@acamb Hello! I'm sorry that you are still facing the issues :(

it's strange because im using v.1.28 and today i've got this problem 3 times (without touching the manifest). In one case the deploy failed and in the other two the state machine failed with error "failed to normalize image reference ..." when running the job(issue https://github.com/aws/copilot-cli/issues/5032 ).

You mentioned that "in one case the deploy failed", do you happen to know what the error message was? I think you could still find the record in the CloudFormation console (or aws cli, whichever you prefer) - go to the stack's "Events" tab and locate the event with an UPDATE_FAILED state. I'm hoping to get more clues by knowing this error message.

In addition, can you confirm the value of the ContainerImage parameter in your job's stack? Is it something like ": fae9f246" instead of ".dkr.ecr..amazonaws.com/:fae9f246"?

acamb commented 1 year ago

Hello @Lou1415926 When the deploy fails we see an error like

- Updating the infrastructure for stack tech-staging-oreo-dl                  [update rollback complete]  [15.3s]
  The following resource(s) failed to update: [TaskDefinition].                                           
  - An ECS service to run and maintain your tasks in the environment cluster  [not started]               
  - An ECS task definition to group your containers and run them on ECS       [delete complete]           [0.0s]
    Resource handler returned message: "Invalid request provided: Create T                                
    askDefinition: Container.image repository should not be null or empty.                                
     (Service: AmazonECS; Status Code: 400; Error Code: ClientException; R                                
    equest ID: abc12f74-a49a-42f9-ac85-418debf2f7b2; Proxy: null)" (Reques                                
    tToken: 1d5e884d-bb98-74b6-fddb-8f2bc2265329, HandlerErrorCode: Invali                                
    dRequest)

For the other case (issue #5032) i can confirm that the ContainerImage in the task definition is in the format ":xxxx" without the ".idk.ecr...." prefix.

huanjani commented 1 year ago

The enhancement that prevents version downgrades has been released in v1.29.0: https://github.com/aws/copilot-cli/releases/tag/v1.29.0!

schm commented 1 year ago

@huanjani Thanks for the update.

Is this a server side check or is this built into the CLI. I.e. will this now block clients < 1.29 from interacting with my updated app? Or will this check only work in the future for all clients >= 1.29 (e.g. blocking a 1.29 client from accessing a 1.30 app)

iamhopaul123 commented 1 year ago

Hello @schm.

Is this a server side check or is this built into the CLI.

It is built into the CLI.

will this now block clients < 1.29 from interacting with my updated app? Or will this check only work in the future for all clients >= 1.29 (e.g. blocking a 1.29 client from accessing a 1.30 app)

I think "blocking a 1.29 client from accessing a 1.30 app" this one is a correct statement (if by "client" you meant Copilot CLI), so that your 1.29 client won't be able to accidentally downgrade your 1.30 app (however, this can be overridden by passing --allow-downgrade flag).

schm commented 1 year ago

That's very good to know. Thanks for addressing this issue.

For me this issue is resolved right now as we know what was causing the problems and how we can avoid them in the future. Therefore I'm going to close it even though we didn't find a good solution to fix services affected by this problem but completely delete and recreate them.

Thanks again for your support. That's much appreciated.

aws / copilot-cli

Second (and following) deployments of services fail after copilot upgrade #4963

App Runner

ECS