Closed marytal closed 1 year ago
Hey @marytal, you're right that this shouldn't be an issue. Let me look into this and see if there's a quick fix.
Thanks Austin!
Is there any additional information that I could provide you that would be helpful?
Here is the request ID for one of the failed attempts to access the SQS queue: 94c9f3be-9766-5612-94a2-af05d04d245b
Also, I don't know if this is relevant, but the deployment for the worker service has been flakey. I just tried to re-deploy and in cloud formation I saw the following events:
CREATE_FAILED: Resource handler returned message: "Error occurred during operation 'ECS Deployment Circuit Breaker was triggered'." (RequestToken: d23c1f89-7317-7bf7-70f7-8d4f5b68b775, HandlerErrorCode: GeneralServiceException)
ROLLBACK_IN_PROGRESS: The following resource(s) failed to create: [Service]. Rollback requested by user.
To be clear, it deployed properly the first time, so it rolls back to that. The code changes since have been minimal, basically just some logs added, so I'm not sure why that keeps happening.
Sorry for the wait; I've been trying to repro this issue. It does look like by the way our permissions are set up that worker services "shouldn't" be able to read from their queues, but I've seen it work in a golang service as recently as yesterday. I'm still investigating to see if I can't figure this out.
I say "shouldn't" because the queue policy is set up to allow the task role to read messages, but the task role doesn't include explicit allow permissions for sqs:ReceiveMessage.
is there any way you could check the logs of the failed tasks? If the circuit breaker is triggered there are a lot of tasks failing to start up. You can find them in the ECS console under the "configuration and tasks" section of the service page, if you scroll down to "Tasks" and select "Stopped Tasks" from the dropdown.
Hmm thanks! I did as you said and saw all these tasks:
But when I clicked into them and went to "Logs" there were no longs to display in any time range. Odd.
As for the sqs:ReceiveMessage
permissions, is there a recommended way for me to go about doing that through the copilot-cli?
Oh actually, now that I've tried to redeploy again I'm seeing the logs for the stopped tasks are the AccessDenied: Access to the resource https://sqs.us-east-2.amazonaws.com/ is denied.
errors, so likely I'm not rescuing them properly and they are causing the server to to crash. 🤷 Okay, won't worry about that until I get the permissions sorted.
Yes! You'd use an addon resource. You may have to hardcode the queue ARN in this policy template, but you should be able to look it up either in the console or via copilot svc show --resources
.
Then you'd add the following template into copilot/svcname/addons/sqs-iam.yml
.
Parameters:
App:
Type: String
Description: Your application's name.
Env:
Type: String
Description: The environment name your service, job, or workflow is being deployed to.
Name:
Type: String
Description: The name of the service, job, or workflow being deployed.
Resources:
SQSAccessPolicy:
Type: AWS::IAM::ManagedPolicy
Properties:
PolicyDocument:
Version: '2012-10-17'
Statement:
- Sid: SQSActions
Effect: Allow
Action:
- sqs:ReceiveMessage
- sqs:DeleteMessage
Resource: ${REPLACE_ME_YOUR_QUEUE_ARN}
Outputs:
# 1. You need to output the IAM ManagedPolicy so that Copilot can add it as a managed policy to your ECS task role.
SQSAccessPolicyArn:
Description: "The ARN of the ManagedPolicy to attach to the task role."
Value: !Ref SQSAccessPolicy
To thicken the plot here, I've just deployed a worker service using copilot 1.27 and been able to see it process messages from its own SQS queue. I'm not sure exactly what's going on with your service. Is there any way that perhaps the worker service is in private subnets? You didn't specify that in your manifest, but I'm trying to find other reasons that the AWS SDK would fail to reach SQS. In this case, it'd be a literal lack of connectivity between the ECS tasks and the SQS endpoint.
Is there any way that the credentials that the node SDK for AWS is using aren't those of the task role? I don't see any way that this is the case in the code you've provided, but this could also be the problem.
Hmm I don't think so.. especially since the scheduled job is able to connect via the SDK just fine, and it's all configured in the same way as the worker service.
I see now that there are actually two SQS queues getting created:
In my scheduled-job
execution, I publish messages to an SNS topic that gets created with my scheduled job:
const topicArn = process.env.TOPIC_ARN;
const sns = new AWS.SNS();
const sendMessage = async (messageBody) => {
var params = {
Message: messageBody,
TopicArn: topicArn,
};
try {
const data = await sns.publish(params).promise();
console.log(`Message sent to ${topicArn}: ${data.MessageId}`);
} catch (err) {
console.error(err);
}
};
async function main() {
await sendMessage(
JSON.stringify({ jobType: "scheduler", someData: "run stuff" })
);
}
Is that possibly why there are permission issues? Why would there be two SQS queues? Does a new (non-default one) get created because I subscribed to the SNS topic created by my scheduled job?
I don't understand why there are two sqs queues; that behavior is supposed to be opt-in like so:
subscribe:
topics:
- name: jobs
service: scheduled-job
queue: true
I hesitate to tell you to set queue:false explicitly since it seems like you may lose some messages from the topic-specific queue, but I wonder if the queue policy on that queue is correct.
Can you share the resource definition in the CFN template (accessible with copilot svc package
or via the CFN console) for the portal-staging-queue-worker-scheduledjobsjobEventsQueue-wHOmtq8dvoDi
queue? And the TaskRolePolicy?
The messages are all test, so they can be reset.
So I didn't create this portal-staging-queue-worker
so I assumed it was auto-created, but given the date I'm thinking maybe someone else on the team was playing around and created it.
I couldn't find a resource definition.. is this the information you wanted?
Side note: I've been trying to deploy with the new sqs-iam.yml
and I saw that there was briefly a new queue with 6 messages in it, but now it's gone.
The deploys still fail with the tasks showing the same permission error.
I added the addOns and they are getting deployed without issue.
Parameters:
App:
Type: String
Description: Your application's name.
Env:
Type: String
Description: The environment name your service, job, or workflow is being deployed to.
Name:
Type: String
Description: The name of the service, job, or workflow being deployed.
Resources:
SQSAccessPolicy:
Type: AWS::IAM::ManagedPolicy
Properties:
PolicyDocument:
Version: "2012-10-17"
Statement:
- Sid: SQSActions
Effect: Allow
Action:
- sqs:ReceiveMessage
- sqs:DeleteMessage
Resource: arn:aws:sqs:us-east-2:014491063547:portal-staging-queue-worker-EventsQueue-kjpVYuQKgALp
# arn:aws:sqs:us-east-2:014491063547:portal-staging-queue-worker-scheduledjobjobsEventsQueue-wHOmtq8dvoDi // full of messages
# arn:aws:sqs:us-east-2:014491063547:portal-staging-queue-worker-EventsQueue-kjpVYuQKgALp // empty but default one?
Outputs:
SQSAccessPolicyArn:
Description: "The ARN of the ManagedPolicy to attach to the task role."
Value: !Ref SQSAccessPolicy
Thanks for all of your help by the way, really appreciate it!
Oh.. I think I see what's going.
When I created the deploy the "worker-service" queue got briefly created but then it got deleted because the deploy was rolled back. The other two queues I think just belong to a separate worker service (queue-worker
) that someone else created.
So what I'm going to try to do is deploy without referencing the SQS queue. Then hopefully the deploy will succeed and I can copy the ARN of the created queue and add it to my addon. Then deploy again with the code accessing the queue.
🧐🧐🧐
If the deployment succeeds the first time, you may not have to update the IAM addon policy at all. If those queues you saw were left over from another worker service, that explains all the permissions problems you're experiencing.
The deployment keeps failing :'( I tried to deploy and it failed (logs in the stopped tasks all seem to be fine, just logging "deployed successfully" and exiting)
So I deleted the service completely and then re-init and re-deployed it, and it's still failing.. Do you have any other tips for figuring out why the ECS deployment fails? The stopped task logs aren't telling me much.
I don't have much node experience--is there any chance these tasks are spinning up, looking for one message to consume, then exiting? If they are not staying alive, that could explain some of your issues. Are they supposed to emit any logs when they consume a message?
My only hint here is to check the Cloudformation deployment events to see if there's a reason that the service can't create, then check to make sure that the service can deploy without subscribing to any SNS topics.
@marytal I am still confused by your error; the permissions issue you're describing shouldn't be happening. We grant the task role access to each queue in the QueuePolicy object. Can you by any chance share the full Cloudformation template for this service? It'd be the output of copilot svc package
or you can grab it from the Template tab of the Cloudformation Stack page. Sorry for so many requests!
Oh don't be sorry, I'm grateful for the help. I'm having some issues deploying the service at all right now (unrelated, that was working fine before), so I don't know if that will affect any of your research but here it is:
Service name: worker-service
Environment: staging
# Copyright Amazon.com Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
AWSTemplateFormatVersion: 2010-09-09
Description: CloudFormation template that represents a worker service on Amazon ECS.
Metadata:
Manifest: |
# The manifest for the "worker-service" service.
# Read the full specification for the "Worker Service" type at:
# https://aws.github.io/copilot-cli/docs/manifest/worker-service/
# Your service name will be used in naming your resources like log groups, ECS services, etc.
name: worker-service
type: Worker Service
# Configuration for your containers and service.
image:
# Docker build arguments.
build: worker-service/Dockerfile
platform: linux/x86_64 # See https://aws.github.io/copilot-cli/docs/manifest/worker-service/#platform
count:
range:
min: 1
max: 1
spot_from: 1
queue_delay:
acceptable_latency: 10m
msg_processing_time: 60s
exec: true # Enable running commands in your container.
# storage:
# readonly_fs: true # Limit to read-only access to mounted root filesystems.
# You can register to topics from other services.
# The events can be received from an SQS queue via the env var $COPILOT_QUEUE_URI.
subscribe:
topics:
- name: jobs
service: scheduled-job
queue: false
# Optional fields for more advanced use-cases.
#
#variables: # Pass environment variables as key value pairs.
# LOG_LEVEL: info
#secrets: # Pass secrets from AWS Systems Manager (SSM) Parameter Store.
# GITHUB_TOKEN: GITHUB_TOKEN # The key is the name of the environment variable, the value is the name of the SSM parameter.
# You can override any of the values defined above by environment.
environments:
dev:
secrets:
DOPPLER_TOKEN: /copilot/portal/dev/secrets/DOPPLER_TOKEN_GRAPHQL
cpu: 256 # Number of CPU units for the task.
memory: 512 # Amount of memory in MiB used by the task.
staging:
secrets:
DOPPLER_TOKEN: /copilot/portal/staging/secrets/DOPPLER_TOKEN_GRAPHQL
cpu: 256 # Number of CPU units for the task.
memory: 512 # Amount of memory in MiB used by the task.
prod:
secrets:
DOPPLER_TOKEN: /copilot/portal/prod/secrets/DOPPLER_TOKEN_GRAPHQL
cpu: 512 # Number of CPU units for the task.
memory: 1024 # Amount of memory in MiB used by the task.
Parameters:
AppName:
Type: String
EnvName:
Type: String
WorkloadName:
Type: String
ContainerImage:
Type: String
TaskCPU:
Type: String
TaskMemory:
Type: String
TaskCount:
Type: Number
AddonsTemplateURL:
Description: 'URL of the addons nested stack template within the S3 bucket.'
Type: String
Default: ""
EnvFileARN:
Description: 'URL of the environment file.'
Type: String
Default: ""
LogRetention:
Type: Number
Default: 30
Conditions:
IsGovCloud: !Equals [!Ref "AWS::Partition", "aws-us-gov"]
HasAddons: !Not [!Equals [!Ref AddonsTemplateURL, ""]]
HasEnvFile: !Not [!Equals [!Ref EnvFileARN, ""]]
Resources:
LogGroup:
Metadata:
'aws:copilot:description': 'A CloudWatch log group to hold your service logs'
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Join ['', [/copilot/, !Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName]]
RetentionInDays: !Ref LogRetention
TaskDefinition:
Metadata:
'aws:copilot:description': 'An ECS task definition to group your containers and run them on ECS'
Type: AWS::ECS::TaskDefinition
DependsOn: LogGroup
Properties:
Family: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName]]
NetworkMode: awsvpc
RequiresCompatibilities:
- FARGATE
Cpu: !Ref TaskCPU
Memory: !Ref TaskMemory
ExecutionRoleArn: !GetAtt ExecutionRole.Arn
TaskRoleArn: !GetAtt TaskRole.Arn
ContainerDefinitions:
- Name: !Ref WorkloadName
Image: !Ref ContainerImage
Secrets:
- Name: DOPPLER_TOKEN
ValueFrom: /copilot/portal/staging/secrets/DOPPLER_TOKEN_GRAPHQL
Environment:
- Name: COPILOT_APPLICATION_NAME
Value: !Sub '${AppName}'
- Name: COPILOT_SERVICE_DISCOVERY_ENDPOINT
Value: staging.portal.local
- Name: COPILOT_ENVIRONMENT_NAME
Value: !Sub '${EnvName}'
- Name: COPILOT_SERVICE_NAME
Value: !Sub '${WorkloadName}'
- Name: COPILOT_QUEUE_URI
Value: !Ref EventsQueue
EnvironmentFiles:
- !If
- HasEnvFile
- Type: s3
Value: !Ref EnvFileARN
- !Ref AWS::NoValue
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-region: !Ref AWS::Region
awslogs-group: !Ref LogGroup
awslogs-stream-prefix: copilot
ExecutionRole:
Metadata:
'aws:copilot:description': 'An IAM Role for the Fargate agent to make AWS API calls on your behalf'
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: ecs-tasks.amazonaws.com
Action: 'sts:AssumeRole'
Policies:
- PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, SecretsPolicy]]
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: 'Allow'
Action:
- 'ssm:GetParameters'
Resource:
- !Sub 'arn:${AWS::Partition}:ssm:${AWS::Region}:${AWS::AccountId}:parameter/*'
Condition:
StringEquals:
'ssm:ResourceTag/copilot-application': !Sub '${AppName}'
'ssm:ResourceTag/copilot-environment': !Sub '${EnvName}'
- Effect: 'Allow'
Action:
- 'secretsmanager:GetSecretValue'
Resource:
- !Sub 'arn:${AWS::Partition}:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:*'
Condition:
StringEquals:
'secretsmanager:ResourceTag/copilot-application': !Sub '${AppName}'
'secretsmanager:ResourceTag/copilot-environment': !Sub '${EnvName}'
- Effect: 'Allow'
Action:
- 'kms:Decrypt'
Resource:
- !Sub 'arn:${AWS::Partition}:kms:${AWS::Region}:${AWS::AccountId}:key/*'
- !If
# Optional IAM permission required by ECS task def env file
# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/taskdef-envfiles.html#taskdef-envfiles-iam
# Example EnvFileARN: arn:aws:s3:::stackset-demo-infrastruc-pipelinebuiltartifactbuc-11dj7ctf52wyf/manual/1638391936/env
- HasEnvFile
- PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, GetEnvFilePolicy]]
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: 'Allow'
Action:
- 's3:GetObject'
Resource:
- !Ref EnvFileARN
- Effect: 'Allow'
Action:
- 's3:GetBucketLocation'
Resource:
- !Join
- ''
- - 'arn:'
- !Ref AWS::Partition
- ':s3:::'
- !Select [0, !Split ['/', !Select [5, !Split [':', !Ref EnvFileARN]]]]
- !Ref AWS::NoValue
ManagedPolicyArns:
- !Sub 'arn:${AWS::Partition}:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy'
TaskRole:
Metadata:
'aws:copilot:description': 'An IAM role to control permissions for the containers in your tasks'
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: ecs-tasks.amazonaws.com
Action: 'sts:AssumeRole'
Policies:
- PolicyName: 'DenyIAMExceptTaggedRoles'
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: 'Deny'
Action: 'iam:*'
Resource: '*'
- Effect: 'Allow'
Action: 'sts:AssumeRole'
Resource:
- !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/*'
Condition:
StringEquals:
'iam:ResourceTag/copilot-application': !Sub '${AppName}'
'iam:ResourceTag/copilot-environment': !Sub '${EnvName}'
- PolicyName: 'ExecuteCommand'
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: 'Allow'
Action: ["ssmmessages:CreateControlChannel", "ssmmessages:OpenControlChannel", "ssmmessages:CreateDataChannel", "ssmmessages:OpenDataChannel"]
Resource: "*"
- Effect: 'Allow'
Action: ["logs:CreateLogStream", "logs:DescribeLogGroups", "logs:DescribeLogStreams", "logs:PutLogEvents"]
Resource: "*"
DynamicDesiredCountAction:
Metadata:
'aws:copilot:description': "A custom resource returning the ECS service's running task count"
Type: Custom::DynamicDesiredCountFunction
Properties:
ServiceToken: !GetAtt DynamicDesiredCountFunction.Arn
Cluster:
Fn::ImportValue: !Sub '${AppName}-${EnvName}-ClusterId'
App: !Ref AppName
Env: !Ref EnvName
Svc: !Ref WorkloadName
DefaultDesiredCount: !Ref TaskCount
# We need to force trigger this lambda function on all deployments, so we give it a random ID as input on all event types.
UpdateID: 73191530-9589-4126-8a33-90dbd274c63f
DynamicDesiredCountFunction:
Type: AWS::Lambda::Function
Properties:
Code:
S3Bucket: stackset-portal-infrastr-pipelinebuiltartifactbuc-xv6y38wdclzb
S3Key: manual/scripts/custom-resources/dynamicdesiredcountfunction/acd1f00a18ceccc32a780fb208be61f3f62274d775f987fd9feec37493d9173c.zip
Handler: "index.handler"
Timeout: 600
MemorySize: 512
Role: !GetAtt 'DynamicDesiredCountFunctionRole.Arn'
Runtime: nodejs16.x
DynamicDesiredCountFunctionRole:
Metadata:
'aws:copilot:description': "An IAM Role for describing number of running tasks in your ECS service"
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
Action:
- sts:AssumeRole
Path: /
ManagedPolicyArns:
- !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Policies:
- PolicyName: "DelegateDesiredCountAccess"
PolicyDocument:
Version: '2012-10-17'
Statement:
- Sid: ECS
Effect: Allow
Action:
- ecs:DescribeServices
Resource: "*"
Condition:
ArnEquals:
'ecs:cluster':
Fn::Sub:
- arn:${AWS::Partition}:ecs:${AWS::Region}:${AWS::AccountId}:cluster/${ClusterName}
- ClusterName:
Fn::ImportValue: !Sub '${AppName}-${EnvName}-ClusterId'
- Sid: ResourceGroups
Effect: Allow
Action:
- resource-groups:GetResources
Resource: "*"
- Sid: Tags
Effect: Allow
Action:
- "tag:GetResources"
Resource: "*"
AutoScalingRole:
Metadata:
'aws:copilot:description': 'An IAM role for container auto scaling'
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: ecs-tasks.amazonaws.com
Action: 'sts:AssumeRole'
ManagedPolicyArns:
- !Sub 'arn:${AWS::Partition}:iam::aws:policy/service-role/AmazonEC2ContainerServiceAutoscaleRole'
AutoScalingTarget:
Metadata:
'aws:copilot:description': "An autoscaling target to scale your service's desired count"
Type: AWS::ApplicationAutoScaling::ScalableTarget
Properties:
MinCapacity: 1
MaxCapacity: 1
ResourceId:
Fn::Join:
- '/'
- - 'service'
- Fn::ImportValue: !Sub '${AppName}-${EnvName}-ClusterId'
- !GetAtt Service.Name
ScalableDimension: ecs:service:DesiredCount
ServiceNamespace: ecs
RoleARN: !GetAtt AutoScalingRole.Arn
BacklogPerTaskCalculatorLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName:
Fn::Join:
- '/'
- - '/aws'
- 'lambda'
- Fn::Sub: "${BacklogPerTaskCalculatorFunction}"
RetentionInDays: 3
BacklogPerTaskCalculatorFunction:
Metadata:
'aws:copilot:description': "A Lambda function to emit BacklogPerTask metrics to CloudWatch"
Type: AWS::Lambda::Function
Properties:
Code:
S3Bucket: stackset-portal-infrastr-pipelinebuiltartifactbuc-xv6y38wdclzb
S3Key: manual/scripts/custom-resources/backlogpertaskcalculatorfunction/bf3100e33cd3034c18d5085d79928ebca40a6ef289ce6a36bf3934e59c528275.zip
Handler: "index.handler"
Timeout: 600
MemorySize: 512
Role: !GetAtt BacklogPerTaskCalculatorRole.Arn
Runtime: nodejs16.x
Environment:
Variables:
CLUSTER_NAME:
Fn::ImportValue: !Sub '${AppName}-${EnvName}-ClusterId'
SERVICE_NAME: !Ref Service
NAMESPACE: !Sub '${AppName}-${EnvName}-${WorkloadName}'
QUEUE_NAMES:
Fn::Join:
- ','
- - !GetAtt EventsQueue.QueueName
BacklogPerTaskCalculatorRole:
Metadata:
'aws:copilot:description': 'An IAM role for BacklogPerTaskCalculatorFunction'
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
Action:
- sts:AssumeRole
Path: /
Policies:
- PolicyName: "BacklogPerTaskCalculatorAccess"
PolicyDocument:
Version: '2012-10-17'
Statement:
- Sid: ECS
Effect: Allow
Action:
- ecs:DescribeServices
Resource: "*"
Condition:
ArnEquals:
'ecs:cluster':
Fn::Sub:
- arn:${AWS::Partition}:ecs:${AWS::Region}:${AWS::AccountId}:cluster/${ClusterName}
- ClusterName:
Fn::ImportValue: !Sub '${AppName}-${EnvName}-ClusterId'
- Sid: SQS
Effect: Allow
Action:
- sqs:GetQueueAttributes
- sqs:GetQueueUrl
Resource:
- !GetAtt EventsQueue.Arn
ManagedPolicyArns:
- !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
BacklogPerTaskScheduledRule:
Metadata:
'aws:copilot:description': "A trigger to invoke the BacklogPerTaskCalculator Lambda function every minute"
DependsOn:
- BacklogPerTaskCalculatorLogGroup # Ensure log group is created before invoking.
Type: AWS::Events::Rule
Properties:
ScheduleExpression: "rate(1 minute)"
State: "ENABLED"
Targets:
- Arn: !GetAtt BacklogPerTaskCalculatorFunction.Arn
Id: "BacklogPerTaskCalculatorFunctionTrigger"
PermissionToInvokeBacklogPerTaskCalculatorLambda:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !Ref BacklogPerTaskCalculatorFunction
Action: lambda:InvokeFunction
Principal: events.amazonaws.com
SourceArn: !GetAtt BacklogPerTaskScheduledRule.Arn
AutoScalingPolicyEventsQueue:
Metadata:
'aws:copilot:description': "An autoscaling policy to maintain 10 messages/task for EventsQueue"
Type: AWS::ApplicationAutoScaling::ScalingPolicy
Properties:
PolicyName: !Join ['-', [!Ref WorkloadName, BacklogPerTask, !GetAtt EventsQueue.QueueName]]
PolicyType: TargetTrackingScaling
ScalingTargetId: !Ref AutoScalingTarget
TargetTrackingScalingPolicyConfiguration:
ScaleInCooldown: 120
ScaleOutCooldown: 60
CustomizedMetricSpecification:
Namespace: !Sub '${AppName}-${EnvName}-${WorkloadName}'
MetricName: BacklogPerTask
Statistic: Average
Dimensions:
- Name: QueueName
Value: !GetAtt EventsQueue.QueueName
Unit: Count
TargetValue: 10
Service:
DependsOn:
- EnvControllerAction
Metadata:
'aws:copilot:description': 'An ECS service to run and maintain your tasks in the environment cluster'
Type: AWS::ECS::Service
Properties:
PlatformVersion: LATEST
Cluster:
Fn::ImportValue: !Sub '${AppName}-${EnvName}-ClusterId'
TaskDefinition: !Ref TaskDefinition
DesiredCount: !GetAtt DynamicDesiredCountAction.DesiredCount
DeploymentConfiguration:
DeploymentCircuitBreaker:
Enable: true
Rollback: true
MinimumHealthyPercent: 100
MaximumPercent: 200
Alarms:
AlarmNames: []
Enable: false
Rollback: true
PropagateTags: SERVICE
EnableExecuteCommand: true
CapacityProviderStrategy:
- CapacityProvider: FARGATE_SPOT
Weight: 1
- CapacityProvider: FARGATE
Weight: 0
Base: 0
ServiceConnectConfiguration: !If
- IsGovCloud
- !Ref AWS::NoValue
- Enabled: False
NetworkConfiguration:
AwsvpcConfiguration:
AssignPublicIp: ENABLED
Subnets:
Fn::Split:
- ','
- Fn::ImportValue: !Sub '${AppName}-${EnvName}-PublicSubnets'
SecurityGroups:
- Fn::ImportValue: !Sub '${AppName}-${EnvName}-EnvironmentSecurityGroup'
ServiceRegistries: !Ref 'AWS::NoValue'
EventsKMSKey:
Metadata:
'aws:copilot:description': 'A KMS key to encrypt messages in your queues'
Type: AWS::KMS::Key
Properties:
KeyPolicy:
Version: '2012-10-17'
Statement:
- Sid: "Allow key use"
Effect: Allow
Principal:
AWS: !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:root'
Action:
- "kms:Create*"
- "kms:Describe*"
- "kms:Enable*"
- "kms:List*"
- "kms:Put*"
- "kms:Update*"
- "kms:Revoke*"
- "kms:Disable*"
- "kms:Get*"
- "kms:Delete*"
- "kms:ScheduleKeyDeletion"
- "kms:CancelKeyDeletion"
- "kms:Tag*"
- "kms:UntagResource"
- "kms:Encrypt"
- "kms:Decrypt"
- "kms:ReEncrypt*"
- "kms:GenerateDataKey*"
Resource: '*'
- Sid: "Allow SNS encryption"
Effect: "Allow"
Principal:
Service: sns.amazonaws.com
Action:
- "kms:Decrypt"
- "kms:GenerateDataKey*"
Resource: '*'
- Sid: "Allow SQS encryption"
Effect: "Allow"
Principal:
Service: sqs.amazonaws.com
Action:
- "kms:Encrypt"
- "kms:Decrypt"
- "kms:ReEncrypt*"
- "kms:GenerateDataKey*"
Resource: '*'
- Sid: "Allow task role encrypt/decrypt"
Effect: "Allow"
Principal:
AWS:
- !GetAtt TaskRole.Arn
Action:
- "kms:Encrypt"
- "kms:Decrypt"
Resource: '*'
EventsQueue:
Metadata:
'aws:copilot:description': 'An events SQS queue to buffer messages'
Type: AWS::SQS::Queue
Properties:
KmsMasterKeyId: !Ref EventsKMSKey
QueuePolicy:
Type: AWS::SQS::QueuePolicy
Properties:
Queues: [!Ref 'EventsQueue']
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
AWS:
- !GetAtt TaskRole.Arn
Action:
- sqs:ReceiveMessage
- sqs:DeleteMessage
Resource: !GetAtt EventsQueue.Arn
- Effect: Allow
Principal:
Service: sns.amazonaws.com
Action:
- sqs:SendMessage
Resource: !GetAtt EventsQueue.Arn
Condition:
ArnEquals:
aws:SourceArn: !Join ['', [!Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:', !Ref AppName, '-', !Ref EnvName, '-scheduled-job-jobs']]
scheduledjobjobsSNSTopicSubscription:
Metadata:
'aws:copilot:description': 'A SNS subscription to topic jobs from service scheduled-job'
Type: AWS::SNS::Subscription
Properties:
TopicArn: !Join ['', [!Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:', !Ref AppName, '-', !Ref EnvName, '-scheduled-job-jobs']]
Protocol: 'sqs'
Endpoint: !GetAtt EventsQueue.Arn
AddonsStack:
Metadata:
'aws:copilot:description': 'An Addons CloudFormation Stack for your additional AWS resources'
Type: AWS::CloudFormation::Stack
Condition: HasAddons
Properties:
Parameters:
App: !Ref AppName
Env: !Ref EnvName
Name: !Ref WorkloadName
TemplateURL: !Ref AddonsTemplateURL
EnvControllerAction:
Metadata:
'aws:copilot:description': "Update your environment's shared resources"
Type: Custom::EnvControllerFunction
Properties:
ServiceToken: !GetAtt EnvControllerFunction.Arn
Workload: !Ref WorkloadName
EnvStack: !Sub '${AppName}-${EnvName}'
Parameters: []
EnvVersion: v1.13.0
EnvControllerFunction:
Type: AWS::Lambda::Function
Properties:
Code:
S3Bucket: stackset-portal-infrastr-pipelinebuiltartifactbuc-xv6y38wdclzb
S3Key: manual/scripts/custom-resources/envcontrollerfunction/3ffcf03598029891816b7ce2d1ff14fdd8079af4406a0cfeff1d4aa0109dcd7d.zip
Handler: "index.handler"
Timeout: 900
MemorySize: 512
Role: !GetAtt 'EnvControllerRole.Arn'
Runtime: nodejs16.x
EnvControllerRole:
Metadata:
'aws:copilot:description': "An IAM role to update your environment stack"
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
Action:
- sts:AssumeRole
Path: /
Policies:
- PolicyName: "EnvControllerStackUpdate"
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- cloudformation:DescribeStacks
- cloudformation:UpdateStack
Resource: !Sub 'arn:${AWS::Partition}:cloudformation:${AWS::Region}:${AWS::AccountId}:stack/${AppName}-${EnvName}/*'
Condition:
StringEquals:
'cloudformation:ResourceTag/copilot-application': !Sub '${AppName}'
'cloudformation:ResourceTag/copilot-environment': !Sub '${EnvName}'
- PolicyName: "EnvControllerRolePass"
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- iam:PassRole
Resource: !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${AppName}-${EnvName}-CFNExecutionRole'
Condition:
StringEquals:
'iam:ResourceTag/copilot-application': !Sub '${AppName}'
'iam:ResourceTag/copilot-environment': !Sub '${EnvName}'
ManagedPolicyArns:
- !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Heyy @marytal! Thanks for sharing the CFN snippets! Very helpful ❤️
I read through the thread, and found this particularly interesting:
I tried to deploy and it failed (logs in the stopped tasks all seem to be fine, just logging "deployed successfully" and exiting)
It looks like the service was successfully deployed & was running briefly; and then it exited (peacefully). However, because it was running for such a short period of time, the health check was never able to tell that the service was stable. After a few "unhealthy" checks, the health check decided that the deployment was unsuccessful.
For example, if you deploy a service with this Python application code ⬇️ What I described above would happen - this program takes literally less than 1 second to run, so it confuses the health check.
print("yo!")
Then I looked at the application code you posted in the main post.
async function main() {
await receiveAndProcessMessage();
}
main()
Perhaps the program executed the async
function main()
in a different thread and then just exited immediately.
I wonder if it would help if we add await
in front of main()
?
In addition, it'd probably help to have a for loop (maybe with a sleep) like this ⬇️
async function main() {
while (true) {
await receiveAndProcessMessage();
// Maybe add a sleep here
}
}
Otherwise, even if we await
, the program will retrieve once and just exit after that. Still probably too short for health check to know it's stable!
You're so helpful, thank you!
Okay! Updates! I've finally gotten the deploy to succeed.
I contacted AWS support and it turns out the worker services need to run continuously (similar to a backend service). I was under the impression that we wanted to run a quick script to read a single message and then exit. So now I've got:
let continuePolling = true;
async function main() {
while (continuePolling) {
await receiveAndProcessMessage();
}
}
main().catch((e) => {
console.log("An error caused the worker service to stop.", e);
});
process.on("SIGTERM", () => {
console.log("Received SIGTERM signal, will quit when all work is done");
continuePolling = false;
});
But, now that it's up and running again, we're back to AccessDenied
, so I'm going to try to add the addon
again and see if that solves that!
Okay.. deployed successfully with the addons. Same error.
worker-service/addons/sqs-iam.yml
:
Parameters:
App:
Type: String
Description: Your application's name.
Env:
Type: String
Description: The environment name your service, job, or workflow is being deployed to.
Name:
Type: String
Description: The name of the service, job, or workflow being deployed.
Resources:
SQSAccessPolicy:
Type: AWS::IAM::ManagedPolicy
Properties:
PolicyDocument:
Version: "2012-10-17"
Statement:
- Sid: SQSActions
Effect: Allow
Action:
- sqs:ReceiveMessage
- sqs:DeleteMessage
Resource: "arn:aws:sqs:us-east-2:014491063547:portal-staging-worker-service-EventsQueue-bPDGV3YdVmzb"
Outputs:
SQSAccessPolicyArn:
Description: "The ARN of the ManagedPolicy to attach to the task role."
Value: !Ref SQSAccessPolicy
Using process.env.COPILOT_QUEUE_URI
for the queue URL.
worker-service/manifest.yml
:
name: worker-service
type: Worker Service
image:
build: worker-service/Dockerfile
platform: linux/x86_64
count: 1
exec: true
subscribe:
topics:
- name: jobs
service: scheduled-job
queue: false
environments:
dev:
secrets:
DOPPLER_TOKEN: /copilot/portal/dev/secrets/DOPPLER_TOKEN_GRAPHQL
cpu: 256
memory: 512
staging:
secrets:
DOPPLER_TOKEN: /copilot/portal/staging/secrets/DOPPLER_TOKEN_GRAPHQL
cpu: 256
memory: 512
prod:
secrets:
DOPPLER_TOKEN: /copilot/portal/prod/secrets/DOPPLER_TOKEN_GRAPHQL
cpu: 512
memory: 1024
I'll ask AWS support to have a look!
How weird! I tried to reproduce the issue, but my worker service was able to receive messages even without the addons 🤔. What is the version of the AWS SDK that you are using? Is it v2 or v3 🤔
I was using v2 but I switch to v3 and I see the same issue:
import {
DeleteMessageCommand,
Message,
ReceiveMessageCommand,
SQSClient,
} from "@aws-sdk/client-sqs";
const sqs = new SQSClient({ region: "use-east-2" });
const receiveAndProcessMessage = async () => {
console.log("Attempting to receive message...");
try {
const receiveMessageResponse = await sqs.send(
new ReceiveMessageCommand({
QueueUrl: queueUrl,
MaxNumberOfMessages: 1,
WaitTimeSeconds: 20,
VisibilityTimeout: 60,
})
);
if (receiveMessageResponse.Messages) {
const message = receiveMessageResponse.Messages[0];
await processMessage(message);
await deleteMessage(message);
} else {
console.log("No messages in queue.");
}
} catch (err) {
console.error(err);
}
};
I'm waiting on a response from AWS support. I'll let you know if they find anything, but it's likely not a copilot-cli issue, so we can close this if you'd like!
Yeah AWS support should be able to help! Probably not related but just in case 💭 In const sqs = new SQSClient({ region: "use-east-2" });
, "us-east-2" was mistyped!
(not the issue, but thanks! :) )
@marytal While we wait on AWS support, my teammate @dannyrandall thought of this possibility: do you happen to have set up environment variables from inside of you Dockerfile? Like setting AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
🤔
Hmm I use Doppler for secret management and I set up the doppler key like so:
staging:
secrets:
DOPPLER_TOKEN: /copilot/portal/prod/secrets/DOPPLER_TOKEN_GRAPHQL
And then in my docker file I run doppler run -- npm run start
After your comment I wanted to double check that my secret/access key are available in the app so I logged them and they seem to be accessible with process.env.AWS_ACCESS_KEY_ID
, etc.
Are you thinking maybe they weren't accessible?
umm huh interesting!
I am not familiar with Doppler, so I'm not sure what secrets is it injecting exactly. The service that I used for testing doesn't have AWS_ACCESS_KEY_ID
as env var, and the credential chain falls through to "ECS credential provider" (See this doc for the order in which credentials are selected).
Therefore, if there are other AWS credentials present in your container, it is possible that the SQS client is making calls from that credential (which doesn't have access to the SQS queue), instead of from the TaskRole
.
My awesome teammate @dannyrandall had this snippet that you can use to look at the identity being used to make calls:
try {
const sts = new aws.STS();
const id = await sts.getCallerIdentity().promise();
console.log("id:", id);
} catch (err) {
console.log("error getting identity", err);
}
This should give you information such as the ARN of the identity (so we know whether it's the task role or not) and the account ID! We can give it a try and see what we can find.
Hi! I will try that, thank you! I got a response from amazon support:
I worked with our SQS internal service team with the request id you shared and they found that the receive message api call recorded in the request id was made from the IAM User arn:aws:iam::014491063547:user/tbt-portal-staging. This IAM User has no SQS Permissions on an IAM Level, or on the SQS Queue Access Policy level. Hence, we might need to check whether we are configuring this user at any point during the ECS cluster configurations
So there is some progress!
Hi again! I received a response from AWS:
As suggested by the internal SQS team, could you please provide enough permission to the IAM User arn:aws:iam::014491063547:user/tbt-portal-staging to do the receive message and delete message actions on sqs queue. I am attaching a sample SQS policy for your reference.
{
"Sid": "__owner_statement",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::014491063547:user/tbt-portal-staging"
},
"Action": [
"sqs:ReceiveMessage",
"sqs:DeleteMessage"
],
"Resource": “<queue arn>”
}
I updated my addon to be:
Parameters:
App:
Type: String
Description: Your application's name.
Env:
Type: String
Description: The environment name your service, job, or workflow is being deployed to.
Name:
Type: String
Description: The name of the service, job, or workflow being deployed.
Resources:
SQSAccessPolicy:
Type: AWS::IAM::ManagedPolicy
Properties:
PolicyDocument:
Version: "2012-10-17"
Statement:
- Sid: SQSActions
Effect: Allow
Principal:
AWS: "arn:aws:iam::014491063547:user/tbt-portal-staging"
Action:
- sqs:ReceiveMessage
- sqs:DeleteMessage
Resource: "arn:aws:sqs:us-east-2:014491063547:portal-staging-worker-service-EventsQueue-bPDGV3YdVmzb"
Outputs:
SQSAccessPolicyArn:
Description: "The ARN of the ManagedPolicy to attach to the task role."
Value: !Ref SQSAccessPolicy
But when I tried to deploy, I got an error: Policy document should not specify a principal.
Is there a way to add the permissions that they've asked me to add via copilot?
Hello @marytal - glad to hear back from you!
I think the problem here is two-fold.
The user arn:aws:iam::014491063547:user/tbt-portal-staging
is probably managed outside of Copilot, because we don't create IAM users as a part of the infra by default. For a non-Copilot IAM identity, you should add the permissions through the interface where your team manages that IAM user through, for example, the IAM console, aws cli, CloudFormation, etc., instead of through Copilot addons.
However, a typical Copilot set up is to just use the TaskRole
that Copilot creates for you to receive/delete messages in the ECS tasks. This is the default behavior. In your case, the presence of the process.env.AWS_ACCESS_KEY_ID
(discussed in https://github.com/aws/copilot-cli/issues/4770#issuecomment-1533510192) -e which likely points to user/tbt-portal-staging
- prevents ECS from making calls from TaskRole
, because AWS_ACCESS_KEY_ID
takes higher priority over TaskRole
.
If you are able to remove that AWS_ACCESS_KEY_ID
env var, your task should be able to make calls as the TaskRole
, which wouldn't have any permission issue in the first place.
If you are certain that user/tbt-portal-staging
is expected to be the identity to receive/delete message, then please go ahead and add the permissions through the interface where user/tbt-portal-staging
is managed. Otherwise, you can also try to remove the AWS_ACCESS_KEY_ID
environment variable, so that your ECS tasks can use the Copilot TaskRole
without permissions issues.
Hi! Thanks so much for your help!!
Therefore, if there are other AWS credentials present in your container, it is possible that the SQS client is making calls from that credential (which doesn't have access to the SQS queue), instead of from the TaskRole.
^ You were totally right about this, that is exactly what was going on!
Everything is working as expected now 🎉 Haha finally 😅
Thanks again. 🎉 It feels almost sad to close this issue, been going on for so long! I'll miss you :P !
Hi! I used copilot to add a scheduled job and a worker service. The scheduled job manifest looks something like this:
The worker service manifest looks something like this:
In my worker-service/Dockerfile I run my node script. It uses the AWS SDK to interact with the SQS queue created for the worker service. Here is a simplified version of the server:
I am seeing a permission error saying:
I checked and it appears that the SQS queue's permissions (the default one that got created for this worker service) has the correct IAM role (the same one that the created ECS has)
I haven't messed with any permissions outside of copilot. Is my setup incorrect, am I missing something? Seems the service should have access to this queue without any additional trouble.
Thanks!