aws / copilot-cli

The AWS Copilot CLI is a tool for developers to build, release and operate production ready containerized applications on AWS App Runner or Amazon ECS on AWS Fargate.
https://aws.github.io/copilot-cli/
Apache License 2.0
3.53k stars 417 forks source link

Worker Service getting permission errors when reading from its own queue #4770

Closed marytal closed 1 year ago

marytal commented 1 year ago

Hi! I used copilot to add a scheduled job and a worker service. The scheduled job manifest looks something like this:

name: scheduled-job
type: Scheduled Job

on:
  schedule: "@every 1m"

image:
  build: scheduler-service/Dockerfile

cpu: 256 
memory: 512 
platform: linux/x86_64

publish:
  topics:
    - name: jobs

The worker service manifest looks something like this:

name: worker-service
type: Worker Service

image:
  build: worker-service/Dockerfile

count: 1
exec: true

# storage:
# readonly_fs: true   

# The events can be received from an SQS queue via the env var $COPILOT_QUEUE_URI.
subscribe:
  topics:
    - name: jobs
      service: scheduled-job

In my worker-service/Dockerfile I run my node script. It uses the AWS SDK to interact with the SQS queue created for the worker service. Here is a simplified version of the server:

import { prisma } from "./prisma/prisma-client";
import AWS from "aws-sdk";

const queueUrl = process.env.COPILOT_QUEUE_URI;
const sqs = new AWS.SQS();

const receiveAndProcessMessage = async () => {
  const params = {
    MaxNumberOfMessages: 1,
    QueueUrl: queueUrl,
    VisibilityTimeout: 20,
    WaitTimeSeconds: 0,
  };

  try {
    const data = await sqs.receiveMessage(params).promise();
    if (data.Messages) {
      const message = data.Messages[0];
      await processMessage(message);
      await deleteMessage(message);
    } else {
      console.log("No messages in queue.");
    }
  } catch (err) {
    console.error(err);
  }
};

async function main() {
  await receiveAndProcessMessage();
}

main()

I am seeing a permission error saying:

AccessDenied: Access to the resource https://sqs.us-east-2.amazonaws.com/ is denied.

I checked and it appears that the SQS queue's permissions (the default one that got created for this worker service) has the correct IAM role (the same one that the created ECS has)

I haven't messed with any permissions outside of copilot. Is my setup incorrect, am I missing something? Seems the service should have access to this queue without any additional trouble.

Thanks!

bvtujo commented 1 year ago

Hey @marytal, you're right that this shouldn't be an issue. Let me look into this and see if there's a quick fix.

marytal commented 1 year ago

Thanks Austin! Is there any additional information that I could provide you that would be helpful? Here is the request ID for one of the failed attempts to access the SQS queue: 94c9f3be-9766-5612-94a2-af05d04d245b

Also, I don't know if this is relevant, but the deployment for the worker service has been flakey. I just tried to re-deploy and in cloud formation I saw the following events:

CREATE_FAILED: Resource handler returned message: "Error occurred during operation 'ECS Deployment Circuit Breaker was triggered'." (RequestToken: d23c1f89-7317-7bf7-70f7-8d4f5b68b775, HandlerErrorCode: GeneralServiceException)

ROLLBACK_IN_PROGRESS: The following resource(s) failed to create: [Service]. Rollback requested by user.

To be clear, it deployed properly the first time, so it rolls back to that. The code changes since have been minimal, basically just some logs added, so I'm not sure why that keeps happening.

bvtujo commented 1 year ago

Sorry for the wait; I've been trying to repro this issue. It does look like by the way our permissions are set up that worker services "shouldn't" be able to read from their queues, but I've seen it work in a golang service as recently as yesterday. I'm still investigating to see if I can't figure this out.

I say "shouldn't" because the queue policy is set up to allow the task role to read messages, but the task role doesn't include explicit allow permissions for sqs:ReceiveMessage.

is there any way you could check the logs of the failed tasks? If the circuit breaker is triggered there are a lot of tasks failing to start up. You can find them in the ECS console under the "configuration and tasks" section of the service page, if you scroll down to "Tasks" and select "Stopped Tasks" from the dropdown.

marytal commented 1 year ago

Hmm thanks! I did as you said and saw all these tasks:

image

But when I clicked into them and went to "Logs" there were no longs to display in any time range. Odd.

As for the sqs:ReceiveMessage permissions, is there a recommended way for me to go about doing that through the copilot-cli?

marytal commented 1 year ago

Oh actually, now that I've tried to redeploy again I'm seeing the logs for the stopped tasks are the AccessDenied: Access to the resource https://sqs.us-east-2.amazonaws.com/ is denied. errors, so likely I'm not rescuing them properly and they are causing the server to to crash. 🤷 Okay, won't worry about that until I get the permissions sorted.

bvtujo commented 1 year ago

Yes! You'd use an addon resource. You may have to hardcode the queue ARN in this policy template, but you should be able to look it up either in the console or via copilot svc show --resources.

Then you'd add the following template into copilot/svcname/addons/sqs-iam.yml.

Parameters:
  App:
    Type: String
    Description: Your application's name.
  Env:
    Type: String
    Description: The environment name your service, job, or workflow is being deployed to.
  Name:
    Type: String
    Description: The name of the service, job, or workflow being deployed.

Resources:
  SQSAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Sid: SQSActions
            Effect: Allow
            Action:
              - sqs:ReceiveMessage
              - sqs:DeleteMessage
            Resource: ${REPLACE_ME_YOUR_QUEUE_ARN}

Outputs:
  # 1. You need to output the IAM ManagedPolicy so that Copilot can add it as a managed policy to your ECS task role.
  SQSAccessPolicyArn:
    Description: "The ARN of the ManagedPolicy to attach to the task role."
    Value: !Ref SQSAccessPolicy
bvtujo commented 1 year ago

To thicken the plot here, I've just deployed a worker service using copilot 1.27 and been able to see it process messages from its own SQS queue. I'm not sure exactly what's going on with your service. Is there any way that perhaps the worker service is in private subnets? You didn't specify that in your manifest, but I'm trying to find other reasons that the AWS SDK would fail to reach SQS. In this case, it'd be a literal lack of connectivity between the ECS tasks and the SQS endpoint.

Is there any way that the credentials that the node SDK for AWS is using aren't those of the task role? I don't see any way that this is the case in the code you've provided, but this could also be the problem.

marytal commented 1 year ago

Hmm I don't think so.. especially since the scheduled job is able to connect via the SDK just fine, and it's all configured in the same way as the worker service.

I see now that there are actually two SQS queues getting created:

image

In my scheduled-job execution, I publish messages to an SNS topic that gets created with my scheduled job:

const topicArn = process.env.TOPIC_ARN;
const sns = new AWS.SNS();

const sendMessage = async (messageBody) => {
  var params = {
    Message: messageBody,
    TopicArn: topicArn,
  };

  try {
    const data = await sns.publish(params).promise();
    console.log(`Message sent to ${topicArn}: ${data.MessageId}`);
  } catch (err) {
    console.error(err);
  }
};

async function main() {
  await sendMessage(
    JSON.stringify({ jobType: "scheduler", someData: "run stuff" })
  );
}

Is that possibly why there are permission issues? Why would there be two SQS queues? Does a new (non-default one) get created because I subscribed to the SNS topic created by my scheduled job?

bvtujo commented 1 year ago

I don't understand why there are two sqs queues; that behavior is supposed to be opt-in like so:

subscribe:
  topics:
    - name: jobs
      service: scheduled-job
      queue: true

I hesitate to tell you to set queue:false explicitly since it seems like you may lose some messages from the topic-specific queue, but I wonder if the queue policy on that queue is correct.

Can you share the resource definition in the CFN template (accessible with copilot svc package or via the CFN console) for the portal-staging-queue-worker-scheduledjobsjobEventsQueue-wHOmtq8dvoDi queue? And the TaskRolePolicy?

marytal commented 1 year ago

The messages are all test, so they can be reset.

So I didn't create this portal-staging-queue-worker so I assumed it was auto-created, but given the date I'm thinking maybe someone else on the team was playing around and created it.

image

I couldn't find a resource definition.. is this the information you wanted?

Side note: I've been trying to deploy with the new sqs-iam.yml and I saw that there was briefly a new queue with 6 messages in it, but now it's gone.

image

The deploys still fail with the tasks showing the same permission error.

I added the addOns and they are getting deployed without issue.

Parameters:
  App:
    Type: String
    Description: Your application's name.
  Env:
    Type: String
    Description: The environment name your service, job, or workflow is being deployed to.
  Name:
    Type: String
    Description: The name of the service, job, or workflow being deployed.

Resources:
  SQSAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Sid: SQSActions
            Effect: Allow
            Action:
              - sqs:ReceiveMessage
              - sqs:DeleteMessage
            Resource: arn:aws:sqs:us-east-2:014491063547:portal-staging-queue-worker-EventsQueue-kjpVYuQKgALp
            # arn:aws:sqs:us-east-2:014491063547:portal-staging-queue-worker-scheduledjobjobsEventsQueue-wHOmtq8dvoDi // full of messages
            # arn:aws:sqs:us-east-2:014491063547:portal-staging-queue-worker-EventsQueue-kjpVYuQKgALp // empty but default one?

Outputs:
  SQSAccessPolicyArn:
    Description: "The ARN of the ManagedPolicy to attach to the task role."
    Value: !Ref SQSAccessPolicy
marytal commented 1 year ago

Thanks for all of your help by the way, really appreciate it!

marytal commented 1 year ago

Oh.. I think I see what's going. When I created the deploy the "worker-service" queue got briefly created but then it got deleted because the deploy was rolled back. The other two queues I think just belong to a separate worker service (queue-worker) that someone else created.

So what I'm going to try to do is deploy without referencing the SQS queue. Then hopefully the deploy will succeed and I can copy the ARN of the created queue and add it to my addon. Then deploy again with the code accessing the queue.

🧐🧐🧐

bvtujo commented 1 year ago

If the deployment succeeds the first time, you may not have to update the IAM addon policy at all. If those queues you saw were left over from another worker service, that explains all the permissions problems you're experiencing.

marytal commented 1 year ago

The deployment keeps failing :'( I tried to deploy and it failed (logs in the stopped tasks all seem to be fine, just logging "deployed successfully" and exiting)

So I deleted the service completely and then re-init and re-deployed it, and it's still failing.. Do you have any other tips for figuring out why the ECS deployment fails? The stopped task logs aren't telling me much.

bvtujo commented 1 year ago

I don't have much node experience--is there any chance these tasks are spinning up, looking for one message to consume, then exiting? If they are not staying alive, that could explain some of your issues. Are they supposed to emit any logs when they consume a message?

bvtujo commented 1 year ago

My only hint here is to check the Cloudformation deployment events to see if there's a reason that the service can't create, then check to make sure that the service can deploy without subscribing to any SNS topics.

bvtujo commented 1 year ago

@marytal I am still confused by your error; the permissions issue you're describing shouldn't be happening. We grant the task role access to each queue in the QueuePolicy object. Can you by any chance share the full Cloudformation template for this service? It'd be the output of copilot svc package or you can grab it from the Template tab of the Cloudformation Stack page. Sorry for so many requests!

marytal commented 1 year ago

Oh don't be sorry, I'm grateful for the help. I'm having some issues deploying the service at all right now (unrelated, that was working fine before), so I don't know if that will affect any of your research but here it is:

Service name: worker-service
Environment: staging
# Copyright Amazon.com Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
AWSTemplateFormatVersion: 2010-09-09
Description: CloudFormation template that represents a worker service on Amazon ECS.
Metadata:
  Manifest: |
    # The manifest for the "worker-service" service.
    # Read the full specification for the "Worker Service" type at:
    # https://aws.github.io/copilot-cli/docs/manifest/worker-service/

    # Your service name will be used in naming your resources like log groups, ECS services, etc.
    name: worker-service
    type: Worker Service

    # Configuration for your containers and service.
    image:
      # Docker build arguments.
      build: worker-service/Dockerfile

    platform: linux/x86_64 # See https://aws.github.io/copilot-cli/docs/manifest/worker-service/#platform
    count:
      range:
        min: 1
        max: 1
        spot_from: 1
      queue_delay:
        acceptable_latency: 10m
        msg_processing_time: 60s
    exec: true # Enable running commands in your container.

    # storage:
    # readonly_fs: true       # Limit to read-only access to mounted root filesystems.

    # You can register to topics from other services.
    # The events can be received from an SQS queue via the env var $COPILOT_QUEUE_URI.

    subscribe:
      topics:
        - name: jobs
          service: scheduled-job
          queue: false

    # Optional fields for more advanced use-cases.
    #
    #variables:                    # Pass environment variables as key value pairs.
    #  LOG_LEVEL: info

    #secrets:                      # Pass secrets from AWS Systems Manager (SSM) Parameter Store.
    #  GITHUB_TOKEN: GITHUB_TOKEN  # The key is the name of the environment variable, the value is the name of the SSM parameter.

    # You can override any of the values defined above by environment.
    environments:
      dev:
        secrets:
          DOPPLER_TOKEN: /copilot/portal/dev/secrets/DOPPLER_TOKEN_GRAPHQL
        cpu: 256 # Number of CPU units for the task.
        memory: 512 # Amount of memory in MiB used by the task.

      staging:
        secrets:
          DOPPLER_TOKEN: /copilot/portal/staging/secrets/DOPPLER_TOKEN_GRAPHQL
        cpu: 256 # Number of CPU units for the task.
        memory: 512 # Amount of memory in MiB used by the task.

      prod:
        secrets:
          DOPPLER_TOKEN: /copilot/portal/prod/secrets/DOPPLER_TOKEN_GRAPHQL
        cpu: 512 # Number of CPU units for the task.
        memory: 1024 # Amount of memory in MiB used by the task.
Parameters:
  AppName:
    Type: String
  EnvName:
    Type: String
  WorkloadName:
    Type: String
  ContainerImage:
    Type: String
  TaskCPU:
    Type: String
  TaskMemory:
    Type: String
  TaskCount:
    Type: Number
  AddonsTemplateURL:
    Description: 'URL of the addons nested stack template within the S3 bucket.'
    Type: String
    Default: ""
  EnvFileARN:
    Description: 'URL of the environment file.'
    Type: String
    Default: ""
  LogRetention:
    Type: Number
    Default: 30
Conditions:
  IsGovCloud: !Equals [!Ref "AWS::Partition", "aws-us-gov"]
  HasAddons: !Not [!Equals [!Ref AddonsTemplateURL, ""]]
  HasEnvFile: !Not [!Equals [!Ref EnvFileARN, ""]]
Resources:
  LogGroup:
    Metadata:
      'aws:copilot:description': 'A CloudWatch log group to hold your service logs'
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Join ['', [/copilot/, !Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName]]
      RetentionInDays: !Ref LogRetention
  TaskDefinition:
    Metadata:
      'aws:copilot:description': 'An ECS task definition to group your containers and run them on ECS'
    Type: AWS::ECS::TaskDefinition
    DependsOn: LogGroup
    Properties:
      Family: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName]]
      NetworkMode: awsvpc
      RequiresCompatibilities:
        - FARGATE
      Cpu: !Ref TaskCPU
      Memory: !Ref TaskMemory
      ExecutionRoleArn: !GetAtt ExecutionRole.Arn
      TaskRoleArn: !GetAtt TaskRole.Arn
      ContainerDefinitions:
        - Name: !Ref WorkloadName
          Image: !Ref ContainerImage
          Secrets:
            - Name: DOPPLER_TOKEN
              ValueFrom: /copilot/portal/staging/secrets/DOPPLER_TOKEN_GRAPHQL
          Environment:
            - Name: COPILOT_APPLICATION_NAME
              Value: !Sub '${AppName}'
            - Name: COPILOT_SERVICE_DISCOVERY_ENDPOINT
              Value: staging.portal.local
            - Name: COPILOT_ENVIRONMENT_NAME
              Value: !Sub '${EnvName}'
            - Name: COPILOT_SERVICE_NAME
              Value: !Sub '${WorkloadName}'
            - Name: COPILOT_QUEUE_URI
              Value: !Ref EventsQueue
          EnvironmentFiles:
            - !If
              - HasEnvFile
              - Type: s3
                Value: !Ref EnvFileARN
              - !Ref AWS::NoValue
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-region: !Ref AWS::Region
              awslogs-group: !Ref LogGroup
              awslogs-stream-prefix: copilot
  ExecutionRole:
    Metadata:
      'aws:copilot:description': 'An IAM Role for the Fargate agent to make AWS API calls on your behalf'
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: ecs-tasks.amazonaws.com
            Action: 'sts:AssumeRole'
      Policies:
        - PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, SecretsPolicy]]
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: 'Allow'
                Action:
                  - 'ssm:GetParameters'
                Resource:
                  - !Sub 'arn:${AWS::Partition}:ssm:${AWS::Region}:${AWS::AccountId}:parameter/*'
                Condition:
                  StringEquals:
                    'ssm:ResourceTag/copilot-application': !Sub '${AppName}'
                    'ssm:ResourceTag/copilot-environment': !Sub '${EnvName}'
              - Effect: 'Allow'
                Action:
                  - 'secretsmanager:GetSecretValue'
                Resource:
                  - !Sub 'arn:${AWS::Partition}:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:*'
                Condition:
                  StringEquals:
                    'secretsmanager:ResourceTag/copilot-application': !Sub '${AppName}'
                    'secretsmanager:ResourceTag/copilot-environment': !Sub '${EnvName}'
              - Effect: 'Allow'
                Action:
                  - 'kms:Decrypt'
                Resource:
                  - !Sub 'arn:${AWS::Partition}:kms:${AWS::Region}:${AWS::AccountId}:key/*'
        - !If
          # Optional IAM permission required by ECS task def env file
          # https://docs.aws.amazon.com/AmazonECS/latest/developerguide/taskdef-envfiles.html#taskdef-envfiles-iam
          # Example EnvFileARN: arn:aws:s3:::stackset-demo-infrastruc-pipelinebuiltartifactbuc-11dj7ctf52wyf/manual/1638391936/env
          - HasEnvFile
          - PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, GetEnvFilePolicy]]
            PolicyDocument:
              Version: '2012-10-17'
              Statement:
                - Effect: 'Allow'
                  Action:
                    - 's3:GetObject'
                  Resource:
                    - !Ref EnvFileARN
                - Effect: 'Allow'
                  Action:
                    - 's3:GetBucketLocation'
                  Resource:
                    - !Join
                      - ''
                      - - 'arn:'
                        - !Ref AWS::Partition
                        - ':s3:::'
                        - !Select [0, !Split ['/', !Select [5, !Split [':', !Ref EnvFileARN]]]]
          - !Ref AWS::NoValue
      ManagedPolicyArns:
        - !Sub 'arn:${AWS::Partition}:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy'
  TaskRole:
    Metadata:
      'aws:copilot:description': 'An IAM role to control permissions for the containers in your tasks'
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: ecs-tasks.amazonaws.com
            Action: 'sts:AssumeRole'
      Policies:
        - PolicyName: 'DenyIAMExceptTaggedRoles'
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: 'Deny'
                Action: 'iam:*'
                Resource: '*'
              - Effect: 'Allow'
                Action: 'sts:AssumeRole'
                Resource:
                  - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/*'
                Condition:
                  StringEquals:
                    'iam:ResourceTag/copilot-application': !Sub '${AppName}'
                    'iam:ResourceTag/copilot-environment': !Sub '${EnvName}'
        - PolicyName: 'ExecuteCommand'
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: 'Allow'
                Action: ["ssmmessages:CreateControlChannel", "ssmmessages:OpenControlChannel", "ssmmessages:CreateDataChannel", "ssmmessages:OpenDataChannel"]
                Resource: "*"
              - Effect: 'Allow'
                Action: ["logs:CreateLogStream", "logs:DescribeLogGroups", "logs:DescribeLogStreams", "logs:PutLogEvents"]
                Resource: "*"
  DynamicDesiredCountAction:
    Metadata:
      'aws:copilot:description': "A custom resource returning the ECS service's running task count"
    Type: Custom::DynamicDesiredCountFunction
    Properties:
      ServiceToken: !GetAtt DynamicDesiredCountFunction.Arn
      Cluster:
        Fn::ImportValue: !Sub '${AppName}-${EnvName}-ClusterId'
      App: !Ref AppName
      Env: !Ref EnvName
      Svc: !Ref WorkloadName
      DefaultDesiredCount: !Ref TaskCount
      # We need to force trigger this lambda function on all deployments, so we give it a random ID as input on all event types.
      UpdateID: 73191530-9589-4126-8a33-90dbd274c63f
  DynamicDesiredCountFunction:
    Type: AWS::Lambda::Function
    Properties:
      Code:
        S3Bucket: stackset-portal-infrastr-pipelinebuiltartifactbuc-xv6y38wdclzb
        S3Key: manual/scripts/custom-resources/dynamicdesiredcountfunction/acd1f00a18ceccc32a780fb208be61f3f62274d775f987fd9feec37493d9173c.zip
      Handler: "index.handler"
      Timeout: 600
      MemorySize: 512
      Role: !GetAtt 'DynamicDesiredCountFunctionRole.Arn'
      Runtime: nodejs16.x
  DynamicDesiredCountFunctionRole:
    Metadata:
      'aws:copilot:description': "An IAM Role for describing number of running tasks in your ECS service"
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - lambda.amazonaws.com
            Action:
              - sts:AssumeRole
      Path: /
      ManagedPolicyArns:
        - !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      Policies:
        - PolicyName: "DelegateDesiredCountAccess"
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Sid: ECS
                Effect: Allow
                Action:
                  - ecs:DescribeServices
                Resource: "*"
                Condition:
                  ArnEquals:
                    'ecs:cluster':
                      Fn::Sub:
                        - arn:${AWS::Partition}:ecs:${AWS::Region}:${AWS::AccountId}:cluster/${ClusterName}
                        - ClusterName:
                            Fn::ImportValue: !Sub '${AppName}-${EnvName}-ClusterId'
              - Sid: ResourceGroups
                Effect: Allow
                Action:
                  - resource-groups:GetResources
                Resource: "*"
              - Sid: Tags
                Effect: Allow
                Action:
                  - "tag:GetResources"
                Resource: "*"
  AutoScalingRole:
    Metadata:
      'aws:copilot:description': 'An IAM role for container auto scaling'
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: ecs-tasks.amazonaws.com
            Action: 'sts:AssumeRole'
      ManagedPolicyArns:
        - !Sub 'arn:${AWS::Partition}:iam::aws:policy/service-role/AmazonEC2ContainerServiceAutoscaleRole'
  AutoScalingTarget:
    Metadata:
      'aws:copilot:description': "An autoscaling target to scale your service's desired count"
    Type: AWS::ApplicationAutoScaling::ScalableTarget
    Properties:
      MinCapacity: 1
      MaxCapacity: 1
      ResourceId:
        Fn::Join:
          - '/'
          - - 'service'
            - Fn::ImportValue: !Sub '${AppName}-${EnvName}-ClusterId'
            - !GetAtt Service.Name
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs
      RoleARN: !GetAtt AutoScalingRole.Arn
  BacklogPerTaskCalculatorLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName:
        Fn::Join:
          - '/'
          - - '/aws'
            - 'lambda'
            - Fn::Sub: "${BacklogPerTaskCalculatorFunction}"
      RetentionInDays: 3
  BacklogPerTaskCalculatorFunction:
    Metadata:
      'aws:copilot:description': "A Lambda function to emit BacklogPerTask metrics to CloudWatch"
    Type: AWS::Lambda::Function
    Properties:
      Code:
        S3Bucket: stackset-portal-infrastr-pipelinebuiltartifactbuc-xv6y38wdclzb
        S3Key: manual/scripts/custom-resources/backlogpertaskcalculatorfunction/bf3100e33cd3034c18d5085d79928ebca40a6ef289ce6a36bf3934e59c528275.zip
      Handler: "index.handler"
      Timeout: 600
      MemorySize: 512
      Role: !GetAtt BacklogPerTaskCalculatorRole.Arn
      Runtime: nodejs16.x
      Environment:
        Variables:
          CLUSTER_NAME:
            Fn::ImportValue: !Sub '${AppName}-${EnvName}-ClusterId'
          SERVICE_NAME: !Ref Service
          NAMESPACE: !Sub '${AppName}-${EnvName}-${WorkloadName}'
          QUEUE_NAMES:
            Fn::Join:
              - ','
              - - !GetAtt EventsQueue.QueueName
  BacklogPerTaskCalculatorRole:
    Metadata:
      'aws:copilot:description': 'An IAM role for BacklogPerTaskCalculatorFunction'
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - lambda.amazonaws.com
            Action:
              - sts:AssumeRole
      Path: /
      Policies:
        - PolicyName: "BacklogPerTaskCalculatorAccess"
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Sid: ECS
                Effect: Allow
                Action:
                  - ecs:DescribeServices
                Resource: "*"
                Condition:
                  ArnEquals:
                    'ecs:cluster':
                      Fn::Sub:
                        - arn:${AWS::Partition}:ecs:${AWS::Region}:${AWS::AccountId}:cluster/${ClusterName}
                        - ClusterName:
                            Fn::ImportValue: !Sub '${AppName}-${EnvName}-ClusterId'
              - Sid: SQS
                Effect: Allow
                Action:
                  - sqs:GetQueueAttributes
                  - sqs:GetQueueUrl
                Resource:
                  - !GetAtt EventsQueue.Arn
      ManagedPolicyArns:
        - !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
  BacklogPerTaskScheduledRule:
    Metadata:
      'aws:copilot:description': "A trigger to invoke the BacklogPerTaskCalculator Lambda function every minute"
    DependsOn:
      - BacklogPerTaskCalculatorLogGroup # Ensure log group is created before invoking.
    Type: AWS::Events::Rule
    Properties:
      ScheduleExpression: "rate(1 minute)"
      State: "ENABLED"
      Targets:
        - Arn: !GetAtt BacklogPerTaskCalculatorFunction.Arn
          Id: "BacklogPerTaskCalculatorFunctionTrigger"
  PermissionToInvokeBacklogPerTaskCalculatorLambda:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !Ref BacklogPerTaskCalculatorFunction
      Action: lambda:InvokeFunction
      Principal: events.amazonaws.com
      SourceArn: !GetAtt BacklogPerTaskScheduledRule.Arn
  AutoScalingPolicyEventsQueue:
    Metadata:
      'aws:copilot:description': "An autoscaling policy to maintain 10 messages/task for EventsQueue"
    Type: AWS::ApplicationAutoScaling::ScalingPolicy
    Properties:
      PolicyName: !Join ['-', [!Ref WorkloadName, BacklogPerTask, !GetAtt EventsQueue.QueueName]]
      PolicyType: TargetTrackingScaling
      ScalingTargetId: !Ref AutoScalingTarget
      TargetTrackingScalingPolicyConfiguration:
        ScaleInCooldown: 120
        ScaleOutCooldown: 60
        CustomizedMetricSpecification:
          Namespace: !Sub '${AppName}-${EnvName}-${WorkloadName}'
          MetricName: BacklogPerTask
          Statistic: Average
          Dimensions:
            - Name: QueueName
              Value: !GetAtt EventsQueue.QueueName
          Unit: Count
        TargetValue: 10
  Service:
    DependsOn:
      - EnvControllerAction
    Metadata:
      'aws:copilot:description': 'An ECS service to run and maintain your tasks in the environment cluster'
    Type: AWS::ECS::Service
    Properties:
      PlatformVersion: LATEST
      Cluster:
        Fn::ImportValue: !Sub '${AppName}-${EnvName}-ClusterId'
      TaskDefinition: !Ref TaskDefinition
      DesiredCount: !GetAtt DynamicDesiredCountAction.DesiredCount
      DeploymentConfiguration:
        DeploymentCircuitBreaker:
          Enable: true
          Rollback: true
        MinimumHealthyPercent: 100
        MaximumPercent: 200
        Alarms:
          AlarmNames: []
          Enable: false
          Rollback: true
      PropagateTags: SERVICE
      EnableExecuteCommand: true
      CapacityProviderStrategy:
        - CapacityProvider: FARGATE_SPOT
          Weight: 1
        - CapacityProvider: FARGATE
          Weight: 0
          Base: 0
      ServiceConnectConfiguration: !If
        - IsGovCloud
        - !Ref AWS::NoValue
        - Enabled: False
      NetworkConfiguration:
        AwsvpcConfiguration:
          AssignPublicIp: ENABLED
          Subnets:
            Fn::Split:
              - ','
              - Fn::ImportValue: !Sub '${AppName}-${EnvName}-PublicSubnets'
          SecurityGroups:
            - Fn::ImportValue: !Sub '${AppName}-${EnvName}-EnvironmentSecurityGroup'
      ServiceRegistries: !Ref 'AWS::NoValue'
  EventsKMSKey:
    Metadata:
      'aws:copilot:description': 'A KMS key to encrypt messages in your queues'
    Type: AWS::KMS::Key
    Properties:
      KeyPolicy:
        Version: '2012-10-17'
        Statement:
          - Sid: "Allow key use"
            Effect: Allow
            Principal:
              AWS: !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:root'
            Action:
              - "kms:Create*"
              - "kms:Describe*"
              - "kms:Enable*"
              - "kms:List*"
              - "kms:Put*"
              - "kms:Update*"
              - "kms:Revoke*"
              - "kms:Disable*"
              - "kms:Get*"
              - "kms:Delete*"
              - "kms:ScheduleKeyDeletion"
              - "kms:CancelKeyDeletion"
              - "kms:Tag*"
              - "kms:UntagResource"
              - "kms:Encrypt"
              - "kms:Decrypt"
              - "kms:ReEncrypt*"
              - "kms:GenerateDataKey*"
            Resource: '*'
          - Sid: "Allow SNS encryption"
            Effect: "Allow"
            Principal:
              Service: sns.amazonaws.com
            Action:
              - "kms:Decrypt"
              - "kms:GenerateDataKey*"
            Resource: '*'
          - Sid: "Allow SQS encryption"
            Effect: "Allow"
            Principal:
              Service: sqs.amazonaws.com
            Action:
              - "kms:Encrypt"
              - "kms:Decrypt"
              - "kms:ReEncrypt*"
              - "kms:GenerateDataKey*"
            Resource: '*'
          - Sid: "Allow task role encrypt/decrypt"
            Effect: "Allow"
            Principal:
              AWS:
                - !GetAtt TaskRole.Arn
            Action:
              - "kms:Encrypt"
              - "kms:Decrypt"
            Resource: '*'
  EventsQueue:
    Metadata:
      'aws:copilot:description': 'An events SQS queue to buffer messages'
    Type: AWS::SQS::Queue
    Properties:
      KmsMasterKeyId: !Ref EventsKMSKey
  QueuePolicy:
    Type: AWS::SQS::QueuePolicy
    Properties:
      Queues: [!Ref 'EventsQueue']
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              AWS:
                - !GetAtt TaskRole.Arn
            Action:
              - sqs:ReceiveMessage
              - sqs:DeleteMessage
            Resource: !GetAtt EventsQueue.Arn
          - Effect: Allow
            Principal:
              Service: sns.amazonaws.com
            Action:
              - sqs:SendMessage
            Resource: !GetAtt EventsQueue.Arn
            Condition:
              ArnEquals:
                aws:SourceArn: !Join ['', [!Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:', !Ref AppName, '-', !Ref EnvName, '-scheduled-job-jobs']]
  scheduledjobjobsSNSTopicSubscription:
    Metadata:
      'aws:copilot:description': 'A SNS subscription to topic jobs from service scheduled-job'
    Type: AWS::SNS::Subscription
    Properties:
      TopicArn: !Join ['', [!Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:', !Ref AppName, '-', !Ref EnvName, '-scheduled-job-jobs']]
      Protocol: 'sqs'
      Endpoint: !GetAtt EventsQueue.Arn
  AddonsStack:
    Metadata:
      'aws:copilot:description': 'An Addons CloudFormation Stack for your additional AWS resources'
    Type: AWS::CloudFormation::Stack
    Condition: HasAddons
    Properties:
      Parameters:
        App: !Ref AppName
        Env: !Ref EnvName
        Name: !Ref WorkloadName
      TemplateURL: !Ref AddonsTemplateURL
  EnvControllerAction:
    Metadata:
      'aws:copilot:description': "Update your environment's shared resources"
    Type: Custom::EnvControllerFunction
    Properties:
      ServiceToken: !GetAtt EnvControllerFunction.Arn
      Workload: !Ref WorkloadName
      EnvStack: !Sub '${AppName}-${EnvName}'
      Parameters: []
      EnvVersion: v1.13.0
  EnvControllerFunction:
    Type: AWS::Lambda::Function
    Properties:
      Code:
        S3Bucket: stackset-portal-infrastr-pipelinebuiltartifactbuc-xv6y38wdclzb
        S3Key: manual/scripts/custom-resources/envcontrollerfunction/3ffcf03598029891816b7ce2d1ff14fdd8079af4406a0cfeff1d4aa0109dcd7d.zip
      Handler: "index.handler"
      Timeout: 900
      MemorySize: 512
      Role: !GetAtt 'EnvControllerRole.Arn'
      Runtime: nodejs16.x
  EnvControllerRole:
    Metadata:
      'aws:copilot:description': "An IAM role to update your environment stack"
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - lambda.amazonaws.com
            Action:
              - sts:AssumeRole
      Path: /
      Policies:
        - PolicyName: "EnvControllerStackUpdate"
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - cloudformation:DescribeStacks
                  - cloudformation:UpdateStack
                Resource: !Sub 'arn:${AWS::Partition}:cloudformation:${AWS::Region}:${AWS::AccountId}:stack/${AppName}-${EnvName}/*'
                Condition:
                  StringEquals:
                    'cloudformation:ResourceTag/copilot-application': !Sub '${AppName}'
                    'cloudformation:ResourceTag/copilot-environment': !Sub '${EnvName}'
        - PolicyName: "EnvControllerRolePass"
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - iam:PassRole
                Resource: !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${AppName}-${EnvName}-CFNExecutionRole'
                Condition:
                  StringEquals:
                    'iam:ResourceTag/copilot-application': !Sub '${AppName}'
                    'iam:ResourceTag/copilot-environment': !Sub '${EnvName}'
      ManagedPolicyArns:
        - !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Lou1415926 commented 1 year ago

Heyy @marytal! Thanks for sharing the CFN snippets! Very helpful ❤️

I read through the thread, and found this particularly interesting:

I tried to deploy and it failed (logs in the stopped tasks all seem to be fine, just logging "deployed successfully" and exiting)

It looks like the service was successfully deployed & was running briefly; and then it exited (peacefully). However, because it was running for such a short period of time, the health check was never able to tell that the service was stable. After a few "unhealthy" checks, the health check decided that the deployment was unsuccessful.

For example, if you deploy a service with this Python application code ⬇️ What I described above would happen - this program takes literally less than 1 second to run, so it confuses the health check.

print("yo!")

Then I looked at the application code you posted in the main post.

async function main() {
  await receiveAndProcessMessage();
}

main()

Perhaps the program executed the async function main() in a different thread and then just exited immediately. I wonder if it would help if we add await in front of main() ?

Lou1415926 commented 1 year ago

In addition, it'd probably help to have a for loop (maybe with a sleep) like this ⬇️

async function main() {
  while (true) { 
      await receiveAndProcessMessage();
     // Maybe add a sleep here
  }
}

Otherwise, even if we await, the program will retrieve once and just exit after that. Still probably too short for health check to know it's stable!

marytal commented 1 year ago

You're so helpful, thank you!

Okay! Updates! I've finally gotten the deploy to succeed.

I contacted AWS support and it turns out the worker services need to run continuously (similar to a backend service). I was under the impression that we wanted to run a quick script to read a single message and then exit. So now I've got:

let continuePolling = true;
async function main() {
  while (continuePolling) {
    await receiveAndProcessMessage();
  }
}

main().catch((e) => {
  console.log("An error caused the worker service to stop.", e);
});

process.on("SIGTERM", () => {
  console.log("Received SIGTERM signal, will quit when all work is done");
  continuePolling = false;
});

But, now that it's up and running again, we're back to AccessDenied, so I'm going to try to add the addon again and see if that solves that!

marytal commented 1 year ago

Okay.. deployed successfully with the addons. Same error.

worker-service/addons/sqs-iam.yml:

Parameters:
  App:
    Type: String
    Description: Your application's name.
  Env:
    Type: String
    Description: The environment name your service, job, or workflow is being deployed to.
  Name:
    Type: String
    Description: The name of the service, job, or workflow being deployed.

Resources:
  SQSAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Sid: SQSActions
            Effect: Allow
            Action:
              - sqs:ReceiveMessage
              - sqs:DeleteMessage
            Resource: "arn:aws:sqs:us-east-2:014491063547:portal-staging-worker-service-EventsQueue-bPDGV3YdVmzb"

Outputs:
  SQSAccessPolicyArn:
    Description: "The ARN of the ManagedPolicy to attach to the task role."
    Value: !Ref SQSAccessPolicy

Using process.env.COPILOT_QUEUE_URI for the queue URL.

worker-service/manifest.yml:

name: worker-service
type: Worker Service

image:
  build: worker-service/Dockerfile

platform: linux/x86_64
count: 1 
exec: true 

subscribe:
  topics:
    - name: jobs
      service: scheduled-job
      queue: false

environments:
  dev:
    secrets:
      DOPPLER_TOKEN: /copilot/portal/dev/secrets/DOPPLER_TOKEN_GRAPHQL
    cpu: 256
    memory: 512

  staging:
    secrets:
      DOPPLER_TOKEN: /copilot/portal/staging/secrets/DOPPLER_TOKEN_GRAPHQL
    cpu: 256 
    memory: 512 

  prod:
    secrets:
      DOPPLER_TOKEN: /copilot/portal/prod/secrets/DOPPLER_TOKEN_GRAPHQL
    cpu: 512
    memory: 1024 
image

I'll ask AWS support to have a look!

Lou1415926 commented 1 year ago

How weird! I tried to reproduce the issue, but my worker service was able to receive messages even without the addons 🤔. What is the version of the AWS SDK that you are using? Is it v2 or v3 🤔

marytal commented 1 year ago

I was using v2 but I switch to v3 and I see the same issue:

import {
  DeleteMessageCommand,
  Message,
  ReceiveMessageCommand,
  SQSClient,
} from "@aws-sdk/client-sqs";

const sqs = new SQSClient({ region: "use-east-2" });

const receiveAndProcessMessage = async () => {
  console.log("Attempting to receive message...");

  try {
    const receiveMessageResponse = await sqs.send(
      new ReceiveMessageCommand({
        QueueUrl: queueUrl,
        MaxNumberOfMessages: 1,
        WaitTimeSeconds: 20,
        VisibilityTimeout: 60,
      })
    );

    if (receiveMessageResponse.Messages) {
      const message = receiveMessageResponse.Messages[0];
      await processMessage(message);
      await deleteMessage(message);
    } else {
      console.log("No messages in queue.");
    }
  } catch (err) {
    console.error(err);
  }
};

I'm waiting on a response from AWS support. I'll let you know if they find anything, but it's likely not a copilot-cli issue, so we can close this if you'd like!

Lou1415926 commented 1 year ago

Yeah AWS support should be able to help! Probably not related but just in case 💭 In const sqs = new SQSClient({ region: "use-east-2" });, "us-east-2" was mistyped!

marytal commented 1 year ago

(not the issue, but thanks! :) )

Lou1415926 commented 1 year ago

@marytal While we wait on AWS support, my teammate @dannyrandall thought of this possibility: do you happen to have set up environment variables from inside of you Dockerfile? Like setting AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY 🤔

marytal commented 1 year ago

Hmm I use Doppler for secret management and I set up the doppler key like so:

  staging:
    secrets:
      DOPPLER_TOKEN: /copilot/portal/prod/secrets/DOPPLER_TOKEN_GRAPHQL

And then in my docker file I run doppler run -- npm run start

After your comment I wanted to double check that my secret/access key are available in the app so I logged them and they seem to be accessible with process.env.AWS_ACCESS_KEY_ID, etc.

Are you thinking maybe they weren't accessible?

Lou1415926 commented 1 year ago

umm huh interesting!

I am not familiar with Doppler, so I'm not sure what secrets is it injecting exactly. The service that I used for testing doesn't have AWS_ACCESS_KEY_ID as env var, and the credential chain falls through to "ECS credential provider" (See this doc for the order in which credentials are selected).

Therefore, if there are other AWS credentials present in your container, it is possible that the SQS client is making calls from that credential (which doesn't have access to the SQS queue), instead of from the TaskRole.

Lou1415926 commented 1 year ago

My awesome teammate @dannyrandall had this snippet that you can use to look at the identity being used to make calls:

  try {
      const sts = new aws.STS();
      const id = await sts.getCallerIdentity().promise();
      console.log("id:", id);
} catch (err) {
      console.log("error getting identity", err);
}

This should give you information such as the ARN of the identity (so we know whether it's the task role or not) and the account ID! We can give it a try and see what we can find.

marytal commented 1 year ago

Hi! I will try that, thank you! I got a response from amazon support:

I worked with our SQS internal service team with the request id you shared and they found that the receive message api call recorded in the request id was made from the IAM User arn:aws:iam::014491063547:user/tbt-portal-staging. This IAM User has no SQS Permissions on an IAM Level, or on the SQS Queue Access Policy level. Hence, we might need to check whether we are configuring this user at any point during the ECS cluster configurations

So there is some progress!

marytal commented 1 year ago

Hi again! I received a response from AWS:

As suggested by the internal SQS team, could you please provide enough permission to the IAM User arn:aws:iam::014491063547:user/tbt-portal-staging to do the receive message and delete message actions on sqs queue. I am attaching a sample SQS policy for your reference.

   {
      "Sid": "__owner_statement",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::014491063547:user/tbt-portal-staging"
      },
      "Action": [
        "sqs:ReceiveMessage",
        "sqs:DeleteMessage"
      ],
      "Resource": “<queue arn>”
    }

I updated my addon to be:

Parameters:
  App:
    Type: String
    Description: Your application's name.
  Env:
    Type: String
    Description: The environment name your service, job, or workflow is being deployed to.
  Name:
    Type: String
    Description: The name of the service, job, or workflow being deployed.

Resources:
  SQSAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      PolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Sid: SQSActions
            Effect: Allow
            Principal:
              AWS: "arn:aws:iam::014491063547:user/tbt-portal-staging"
            Action:
              - sqs:ReceiveMessage
              - sqs:DeleteMessage
            Resource: "arn:aws:sqs:us-east-2:014491063547:portal-staging-worker-service-EventsQueue-bPDGV3YdVmzb"

Outputs:
  SQSAccessPolicyArn:
    Description: "The ARN of the ManagedPolicy to attach to the task role."
    Value: !Ref SQSAccessPolicy

But when I tried to deploy, I got an error: Policy document should not specify a principal.

Is there a way to add the permissions that they've asked me to add via copilot?

Lou1415926 commented 1 year ago

Hello @marytal - glad to hear back from you!

I think the problem here is two-fold.

The user arn:aws:iam::014491063547:user/tbt-portal-staging is probably managed outside of Copilot, because we don't create IAM users as a part of the infra by default. For a non-Copilot IAM identity, you should add the permissions through the interface where your team manages that IAM user through, for example, the IAM console, aws cli, CloudFormation, etc., instead of through Copilot addons.

However, a typical Copilot set up is to just use the TaskRole that Copilot creates for you to receive/delete messages in the ECS tasks. This is the default behavior. In your case, the presence of the process.env.AWS_ACCESS_KEY_ID (discussed in https://github.com/aws/copilot-cli/issues/4770#issuecomment-1533510192) -e which likely points to user/tbt-portal-staging - prevents ECS from making calls from TaskRole, because AWS_ACCESS_KEY_ID takes higher priority over TaskRole.

If you are able to remove that AWS_ACCESS_KEY_ID env var, your task should be able to make calls as the TaskRole, which wouldn't have any permission issue in the first place.

If you are certain that user/tbt-portal-staging is expected to be the identity to receive/delete message, then please go ahead and add the permissions through the interface where user/tbt-portal-staging is managed. Otherwise, you can also try to remove the AWS_ACCESS_KEY_ID environment variable, so that your ECS tasks can use the Copilot TaskRole without permissions issues.

marytal commented 1 year ago

Hi! Thanks so much for your help!!

Therefore, if there are other AWS credentials present in your container, it is possible that the SQS client is making calls from that credential (which doesn't have access to the SQS queue), instead of from the TaskRole.

^ You were totally right about this, that is exactly what was going on!

Everything is working as expected now 🎉 Haha finally 😅

Thanks again. 🎉 It feels almost sad to close this issue, been going on for so long! I'll miss you :P !