Closed WillGibson closed 7 months ago
Hey @WillGibson sorry for the delay!! Are you still seeing the issue?
It should be normal for the service connect logs to occasionally have logs like "remove x clusters". I've seen such logs in my successful deployments:
[2024-01-19 03:51:40.895][33][info][upstream] [source/common/upstream/cds_api_helper.cc:32] cds: add 2 cluster(s), remove 1 cluster(s)
Currently I am tempted to think that service connect should have little to do with the ELB health check failure. What the ELB health check does is just to send a request to the private IP address of the task over the health check path, and wait for a response from the task. Typically service connect is not involved in this process. But I could be wrong on that front, so I won't completely rule that out.
and a Celery worker is used for the health check.
How does your celery worker respond to health check? Do you find in the main container (not the service connect sidecar) 's log anything interesting?
Hi @Lou1415926,
Yes, we are still seeing the issue. I guess it's some kind of networking problem. It's sad that there are no logs for the health check requests/responses, so we have no way of seeing what was actually going on when the health check failed. I think it's some kind of connectivity issue and the requests are just not getting through to the service container.
I have reworded a sentence in the original post to be clearer...
"Its landing page connects to various addons plus a Celery worker. This landing page is used for the health check."
This page returns a 200 regardless of it's success in using the addons etc., so the Celery worker does not have any direct effect on the web service's health.
As it happens, we added a health check to the Celery worker service yesterday. It does not use ELB and so far has not had any issues with health checks failing when all is actually OK.
@WillGibson You can go to the Target Group console to see the "Health status details" -
This troubleshooting page can hopefully help you decipher the reason codes.
In addition, the "Events" tab of your service's page in the ECS console might be able to provide some clues too:
When it's ready, it starts responding successfully to health check requests, often doing several before any woes occur
This is what baffles me the most. The first 2 health checks are reaching the container which are responding properly, but it seems like the third has somehow failed to reach the container all of a sudden. Hopefully we can gather more clues from those places!
There doesn't seem to be anything helpful corresponding to when the problem has occurred in the Task's Events tab, just some port 443 is unhealthy in target-group
which does not help debug it.
My colleagues @yusufsheiqh and @codeninja merged a commit adding AWS X-Ray yesterday morning which includes these changes to the health check configuration...
...and I couldn't get it to manifest the problem yesterday afternoon or this morning.
If I revert those changes (seems to be the grace period that matters) the problem will manifest, but I'm not sure if it's the same problem or something from adding X-Ray.
It seems to be consistently failing though, instead of intermittently failing, so maybe adding X-Ray does indeed require a longer grace period, but looking at current example it's responded to the health check with a 200 status code all 8 times before it is deemed to have failed health checks and shut down.
All it said in the Target Group's health status details column while that was going on is "Health checks failed", which is not new information 😂
just some
port 443 is unhealthy in target-group
which does not help debug it.
From the manifest you provided above, it seems like the health check port was set to be 8080
. You can take a look at the CloudFormation template at Resources.TargetGroup.Properties.HealthCheckPort
to check if it is indeed 8080 as the manifest. I also assume the landing page (that returns 200 and also what the health checks hit) is on port 8080, instead of 443 of the container right? This is probably not the reason why the health checks intermittently fail, though I'd like to point that out in case it is relevant in the context of your application.
with a 200 status code all 8 times before it is deemed to have failed health checks and shut down.
My best guess was also that the grace_periods
needed to be increased, which is one of the most common reasons for occasionally failing health check. If the issue persists, I'd recommend contacting AWS Support, as the engineers involved will have more visibility into what actually happened inside of your service.
At the mean time, I am also happy to check your entire manifest to validate the configuration, if you are willing to share!
Resources.TargetGroup.Properties.HealthCheckPort
is 8080
, which is what the nginx
container is set to serve stuff up on.
Increasing the grace period is OK. If the problem doesn't distract us any more that's good, but the fact that it can respond successfully several times to the health checks then suddenly be considered to fail, and there are no logs anywhere to make clear how it is considered to have failed, is a bad smell for me.
Here is the slightly redacted manifest for our web
service (would attach, but YAML is not supported by GitHub)...
# Copyright Amazon.com Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
AWSTemplateFormatVersion: 2010-09-09
Description: CloudFormation template that represents a load balanced web service on Amazon ECS using AWS Copilot with YAML patches.
Metadata:
Version: v1.32.1
Manifest: |
# The manifest for the "web" service.
# Read the full specification for the "Load Balanced Web Service" type at:
# https://aws.github.io/copilot-cli/docs/manifest/lb-web-service/
# Your service name will be used in naming your resources like log groups, ECS services, etc.
name: web
type: Load Balanced Web Service
# Distribute traffic to your service.
http:
# Requests to this path will be forwarded to your service.
# To match all requests you can use the "/" path.
path: '/'
# You can specify a custom health check path. The default is "/".
# healthcheck: '/'
target_container: nginx
healthcheck:
path: '/'
port: 8080
success_codes: '200'
healthy_threshold: 3
unhealthy_threshold: 3
interval: 35s
timeout: 30s
grace_period: 120s
sidecars:
nginx:
port: 443
image: REDACTED
variables:
SERVER: localhost:8000
ipfilter:
port: 8000
image: REDACTED
variables:
PORT: 8000
SERVER: localhost:8080
APPCONFIG_PROFILES: ipfilter:default:default
IPFILTER_ENABLED: True
EMAIL: REDACTED
PROTECTED_PATHS: /
appconfig:
port: 2772
image: REDACTED
essential: true
variables:
ROLE_ARN: arn:aws:iam::REDACTED:role/AppConfigIpFilterRole
# Configuration for your containers and service.
image:
location: REDACTED
# Port exposed through your container to route traffic to it.
port: 8080
cpu: 256 # Number of CPU units for the task.
memory: 1024 # Amount of memory in MiB used by the task.
count: 1 # Number of tasks that should be running in your service.
exec: true # Enable running commands in your container.
network:
connect: true # Enable Service Connect for intra-environment traffic between services.
vpc:
placement: 'private'
storage:
readonly_fs: false
observability:
tracing: awsxray
# Optional fields for more advanced use-cases.
#
variables: # Pass environment variables as key value pairs.
SECRET_KEY: REDACTED
PORT: 8080
DEBUG: True
S3_BUCKET_NAME: REDACTED
ALLOWED_HOSTS: "*"
OTEL_PROPAGATORS: xray
OTEL_PYTHON_ID_GENERATOR: xray
OTEL_SERVICE_NAME: REDACTED-REDACTED-web
OTEL_METRICS_EXPORTER: console,otlp
OTEL_TRACES_EXPORTER: console,otlp
OTEL_TRACES_SAMPLER: traceidratio
OTEL_TRACES_SAMPLER_ARG: "0.05"
secrets: # Pass secrets from AWS Systems Manager (SSM) Parameter Store.
DJANGO_SECRET_KEY: /copilot/REDACTED/REDACTED/secrets/DJANGO_SECRET_KEY
OPENSEARCH_ENDPOINT: /copilot/REDACTED/REDACTED/secrets/REDACTED_OPENSEARCH
REDIS_ENDPOINT: /copilot/REDACTED/REDACTED/secrets/REDACTED_REDIS
DATABASE_CREDENTIALS:
secretsmanager: /copilot/REDACTED/REDACTED/secrets/REDACTED_POSTGRES
RDS_DATABASE_CREDENTIALS:
secretsmanager: /copilot/REDACTED/REDACTED/secrets/REDACTED_RDS_POSTGRES
# You can override any of the values defined above by environment.
environments:
dev:
http:
alias: v2.REDACTED.dev.uktrade.digital
ant:
http:
alias: v2.REDACTED.ant.uktrade.digital
staging:
http:
alias: v2.REDACTED.staging.uktrade.digital
sidecars:
ipfilter:
variables:
IPFILTER_ENABLED: False
REDACTED:
http:
alias: v2.REDACTED.REDACTED.uktrade.digital
REDACTED:
http:
alias: v2.REDACTED.REDACTED.uktrade.digital
REDACTED:
http:
alias: v2.REDACTED.REDACTED.uktrade.digital
REDACTED:
http:
alias: v2.REDACTED.REDACTED.uktrade.digital
REDACTED:
http:
alias: v2.REDACTED.REDACTED.uktrade.digital
Parameters:
AppName:
Type: String
EnvName:
Type: String
WorkloadName:
Type: String
ContainerImage:
Type: String
ContainerPort:
Type: Number
TaskCPU:
Type: String
TaskMemory:
Type: String
TaskCount:
Type: Number
DNSDelegated:
Type: String
AllowedValues: [true, false]
LogRetention:
Type: Number
AddonsTemplateURL:
Description: 'URL of the addons nested stack template within the S3 bucket.'
Type: String
Default: ""
EnvFileARN:
Description: 'URL of the environment file.'
Type: String
Default: ""
EnvFileARNForappconfig:
Type: String
Description: 'URL of the environment file for the appconfig sidecar.'
Default: ""
EnvFileARNForipfilter:
Type: String
Description: 'URL of the environment file for the ipfilter sidecar.'
Default: ""
EnvFileARNFornginx:
Type: String
Description: 'URL of the environment file for the nginx sidecar.'
Default: ""
ArtifactKeyARN:
Type: String
Description: 'KMS Key used for encrypting artifacts'
TargetContainer:
Type: String
TargetPort:
Type: Number
HTTPSEnabled:
Type: String
AllowedValues: [true, false]
RulePath:
Type: String
Conditions:
IsGovCloud: !Equals [!Ref "AWS::Partition", "aws-us-gov"]
HasAssociatedDomain: !Equals [!Ref DNSDelegated, true]
HasAddons: !Not [!Equals [!Ref AddonsTemplateURL, ""]]
HasEnvFile: !Not [!Equals [!Ref EnvFileARN, ""]]
HasEnvFileForappconfig: !Not [!Equals [!Ref EnvFileARNForappconfig, ""]]
HasEnvFileForipfilter: !Not [!Equals [!Ref EnvFileARNForipfilter, ""]]
HasEnvFileFornginx: !Not [!Equals [!Ref EnvFileARNFornginx, ""]]
Resources: # If a bucket URL is specified, that means the template exists.
LogGroup:
Metadata:
'aws:copilot:description': 'A CloudWatch log group to hold your service logs'
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub '/copilot/${AppName}/${EnvName}/${WorkloadName}'
RetentionInDays: !Ref LogRetention
TaskDefinition:
Metadata:
'aws:copilot:description': 'An ECS task definition to group your containers and run them on ECS'
Type: AWS::ECS::TaskDefinition
DependsOn: LogGroup
Properties:
Family: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName]]
NetworkMode: awsvpc
RequiresCompatibilities:
- FARGATE
Cpu: !Ref TaskCPU
Memory: !Ref TaskMemory
ExecutionRoleArn: !GetAtt ExecutionRole.Arn
TaskRoleArn: !GetAtt TaskRole.Arn
ContainerDefinitions:
- Name: !Ref WorkloadName
Image: !Ref ContainerImage
Secrets:
- Name: DATABASE_CREDENTIALS
ValueFrom: !Sub 'arn:${AWS::Partition}:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:/copilot/REDACTED/REDACTED/secrets/REDACTED_POSTGRES'
- Name: DJANGO_SECRET_KEY
ValueFrom: /copilot/REDACTED/REDACTED/secrets/DJANGO_SECRET_KEY
- Name: OPENSEARCH_ENDPOINT
ValueFrom: /copilot/REDACTED/REDACTED/secrets/REDACTED_OPENSEARCH
- Name: RDS_DATABASE_CREDENTIALS
ValueFrom: !Sub 'arn:${AWS::Partition}:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:/copilot/REDACTED/REDACTED/secrets/REDACTED_RDS_POSTGRES'
- Name: REDIS_ENDPOINT
ValueFrom: /copilot/REDACTED/REDACTED/secrets/REDACTED_REDIS
Environment:
- Name: COPILOT_APPLICATION_NAME
Value: !Sub '${AppName}'
- Name: COPILOT_SERVICE_DISCOVERY_ENDPOINT
Value: REDACTED.REDACTED.local
- Name: COPILOT_ENVIRONMENT_NAME
Value: !Sub '${EnvName}'
- Name: COPILOT_SERVICE_NAME
Value: !Sub '${WorkloadName}'
- Name: COPILOT_LB_DNS
Value: !GetAtt EnvControllerAction.PublicLoadBalancerDNSName
- Name: ALLOWED_HOSTS
Value: "*"
- Name: DEBUG
Value: "True"
- Name: OTEL_METRICS_EXPORTER
Value: "console,otlp"
- Name: OTEL_PROPAGATORS
Value: "xray"
- Name: OTEL_PYTHON_ID_GENERATOR
Value: "xray"
- Name: OTEL_SERVICE_NAME
Value: "REDACTED-REDACTED-web"
- Name: OTEL_TRACES_EXPORTER
Value: "console,otlp"
- Name: OTEL_TRACES_SAMPLER
Value: "traceidratio"
- Name: OTEL_TRACES_SAMPLER_ARG
Value: "0.05"
- Name: PORT
Value: "8080"
- Name: S3_BUCKET_NAME
Value: "REDACTED-s3-bucket-REDACTED"
- Name: SECRET_KEY
Value: "REDACTED"
EnvironmentFiles:
- !If
- HasEnvFile
- Type: s3
Value: !Ref EnvFileARN
- !Ref AWS::NoValue
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-region: !Ref AWS::Region
awslogs-group: !Ref LogGroup
awslogs-stream-prefix: copilot
PortMappings:
- ContainerPort: 8080
Protocol: tcp
ReadonlyRootFilesystem: false
MountPoints:
- ContainerPath: /tmp
SourceVolume: temporary-fs
- Name: aws-otel-collector
Image: public.ecr.aws/aws-observability/aws-otel-collector:v0.17.0
Command:
- --config=/etc/ecs/ecs-xray.yaml
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-region: !Ref AWS::Region
awslogs-group: !Ref LogGroup
awslogs-stream-prefix: copilot
- Name: appconfig
Image: public.ecr.aws/aws-appconfig/aws-appconfig-agent:2.x
Essential: true
PortMappings:
- ContainerPort: 2772
Protocol: tcp
Environment:
- Name: COPILOT_APPLICATION_NAME
Value: !Sub '${AppName}'
- Name: COPILOT_SERVICE_DISCOVERY_ENDPOINT
Value: REDACTED.REDACTED.local
- Name: COPILOT_ENVIRONMENT_NAME
Value: !Sub '${EnvName}'
- Name: COPILOT_SERVICE_NAME
Value: !Sub '${WorkloadName}'
- Name: COPILOT_LB_DNS
Value: !GetAtt EnvControllerAction.PublicLoadBalancerDNSName
- Name: ROLE_ARN
Value: "arn:aws:iam::REDACTED:role/AppConfigIpFilterRole"
EnvironmentFiles:
- !If
- HasEnvFileForappconfig
- Type: "s3"
Value: !Ref EnvFileARNForappconfig
- !Ref "AWS::NoValue"
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-region: !Ref AWS::Region
awslogs-group: !Ref LogGroup
awslogs-stream-prefix: copilot
- Name: ipfilter
Image: public.ecr.aws/uktrade/ip-filter:latest
PortMappings:
- ContainerPort: 8000
Protocol: tcp
Environment:
- Name: COPILOT_APPLICATION_NAME
Value: !Sub '${AppName}'
- Name: COPILOT_SERVICE_DISCOVERY_ENDPOINT
Value: REDACTED.REDACTED.local
- Name: COPILOT_ENVIRONMENT_NAME
Value: !Sub '${EnvName}'
- Name: COPILOT_SERVICE_NAME
Value: !Sub '${WorkloadName}'
- Name: COPILOT_LB_DNS
Value: !GetAtt EnvControllerAction.PublicLoadBalancerDNSName
- Name: APPCONFIG_PROFILES
Value: "ipfilter:default:default"
- Name: EMAIL
Value: "REDACTED"
- Name: IPFILTER_ENABLED
Value: "True"
- Name: PORT
Value: "8000"
- Name: PROTECTED_PATHS
Value: "/"
- Name: SERVER
Value: "localhost:8080"
EnvironmentFiles:
- !If
- HasEnvFileForipfilter
- Type: "s3"
Value: !Ref EnvFileARNForipfilter
- !Ref "AWS::NoValue"
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-region: !Ref AWS::Region
awslogs-group: !Ref LogGroup
awslogs-stream-prefix: copilot
- Name: nginx
Image: REDACTED
PortMappings:
- ContainerPort: 443
Name: target
Protocol: tcp
Environment:
- Name: COPILOT_APPLICATION_NAME
Value: !Sub '${AppName}'
- Name: COPILOT_SERVICE_DISCOVERY_ENDPOINT
Value: REDACTED.REDACTED.local
- Name: COPILOT_ENVIRONMENT_NAME
Value: !Sub '${EnvName}'
- Name: COPILOT_SERVICE_NAME
Value: !Sub '${WorkloadName}'
- Name: COPILOT_LB_DNS
Value: !GetAtt EnvControllerAction.PublicLoadBalancerDNSName
- Name: SERVER
Value: "localhost:8000"
EnvironmentFiles:
- !If
- HasEnvFileFornginx
- Type: "s3"
Value: !Ref EnvFileARNFornginx
- !Ref "AWS::NoValue"
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-region: !Ref AWS::Region
awslogs-group: !Ref LogGroup
awslogs-stream-prefix: copilot
Volumes:
- Name: temporary-fs
ExecutionRole:
Metadata:
'aws:copilot:description': 'An IAM Role for the Fargate agent to make AWS API calls on your behalf'
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: ecs-tasks.amazonaws.com
Action: 'sts:AssumeRole'
Policies:
- PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, SecretsPolicy]]
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: 'Allow'
Action:
- 'ssm:GetParameters'
Resource:
- !Sub 'arn:${AWS::Partition}:ssm:${AWS::Region}:${AWS::AccountId}:parameter/*'
Condition:
StringEquals:
'ssm:ResourceTag/copilot-application': !Sub '${AppName}'
'ssm:ResourceTag/copilot-environment': !Sub '${EnvName}'
- Effect: 'Allow'
Action:
- 'secretsmanager:GetSecretValue'
Resource:
- !Sub 'arn:${AWS::Partition}:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:*'
Condition:
StringEquals:
'secretsmanager:ResourceTag/copilot-application': !Sub '${AppName}'
'secretsmanager:ResourceTag/copilot-environment': !Sub '${EnvName}'
- Effect: 'Allow'
Action:
- 'kms:Decrypt'
Resource:
- !Ref ArtifactKeyARN
- !If
# Optional IAM permission required by ECS task def env file
# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/taskdef-envfiles.html#taskdef-envfiles-iam
# Example EnvFileARN: arn:aws:s3:::stackset-demo-infrastruc-pipelinebuiltartifactbuc-11dj7ctf52wyf/manual/1638391936/env
- HasEnvFile
- PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, GetEnvFilePolicy]]
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: 'Allow'
Action:
- 's3:GetObject'
Resource:
- !Ref EnvFileARN
- Effect: 'Allow'
Action:
- 's3:GetBucketLocation'
Resource:
- !Join
- ''
- - 'arn:'
- !Ref AWS::Partition
- ':s3:::'
- !Select [0, !Split ['/', !Select [5, !Split [':', !Ref EnvFileARN]]]]
- !Ref AWS::NoValue
- !If
- HasEnvFileForappconfig
- PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, GetEnvFilePolicyForappconfig]]
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: 'Allow'
Action:
- 's3:GetObject'
Resource:
- !Ref EnvFileARNForappconfig
- Effect: 'Allow'
Action:
- 's3:GetBucketLocation'
Resource:
- !Join
- ''
- - 'arn:'
- !Ref AWS::Partition
- ':s3:::'
- !Select [0, !Split ['/', !Select [5, !Split [':', !Ref EnvFileARNForappconfig]]]]
- !Ref AWS::NoValue
- !If
- HasEnvFileForipfilter
- PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, GetEnvFilePolicyForipfilter]]
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: 'Allow'
Action:
- 's3:GetObject'
Resource:
- !Ref EnvFileARNForipfilter
- Effect: 'Allow'
Action:
- 's3:GetBucketLocation'
Resource:
- !Join
- ''
- - 'arn:'
- !Ref AWS::Partition
- ':s3:::'
- !Select [0, !Split ['/', !Select [5, !Split [':', !Ref EnvFileARNForipfilter]]]]
- !Ref AWS::NoValue
- !If
- HasEnvFileFornginx
- PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, GetEnvFilePolicyFornginx]]
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: 'Allow'
Action:
- 's3:GetObject'
Resource:
- !Ref EnvFileARNFornginx
- Effect: 'Allow'
Action:
- 's3:GetBucketLocation'
Resource:
- !Join
- ''
- - 'arn:'
- !Ref AWS::Partition
- ':s3:::'
- !Select [0, !Split ['/', !Select [5, !Split [':', !Ref EnvFileARNFornginx]]]]
- !Ref AWS::NoValue
ManagedPolicyArns:
- !Sub 'arn:${AWS::Partition}:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy'
TaskRole:
Metadata:
'aws:copilot:description': 'An IAM role to control permissions for the containers in your tasks'
Type: AWS::IAM::Role
Properties:
ManagedPolicyArns:
- Fn::GetAtt: [AddonsStack, Outputs.appConfigAccessPolicy]
- Fn::GetAtt: [AddonsStack, Outputs.REDACTEDS3BucketAccessPolicy]
- Fn::GetAtt: [AddonsStack, Outputs.XRayAccessPolicy]
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: ecs-tasks.amazonaws.com
Action: 'sts:AssumeRole'
Policies:
- PolicyName: 'DenyIAM'
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: 'Deny'
Action: 'iam:*'
Resource: '*'
- PolicyName: 'ExecuteCommand'
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: 'Allow'
Action: ["ssmmessages:CreateControlChannel", "ssmmessages:OpenControlChannel", "ssmmessages:CreateDataChannel", "ssmmessages:OpenDataChannel"]
Resource: "*"
- Effect: 'Allow'
Action: ["logs:CreateLogStream", "logs:DescribeLogGroups", "logs:DescribeLogStreams", "logs:PutLogEvents"]
Resource: "*"
- PolicyName: 'AWSDistroOpenTelemetryPolicy'
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: 'Allow'
Action:
- 'logs:PutLogEvents'
- 'logs:CreateLogGroup'
- 'logs:CreateLogStream'
- 'logs:DescribeLogStreams'
- 'logs:DescribeLogGroups'
- 'xray:PutTraceSegments'
- 'xray:PutTelemetryRecords'
- 'xray:GetSamplingRules'
- 'xray:GetSamplingTargets'
- 'xray:GetSamplingStatisticSummaries'
Resource: "*"
DiscoveryService:
Metadata:
'aws:copilot:description': 'Service discovery for your services to communicate within the VPC'
Type: AWS::ServiceDiscovery::Service
Properties:
Description: Discovery Service for the Copilot services
DnsConfig:
RoutingPolicy: MULTIVALUE
DnsRecords:
- TTL: 10
Type: A
- TTL: 10
Type: SRV
HealthCheckCustomConfig:
FailureThreshold: 1
Name: !Ref WorkloadName
NamespaceId:
Fn::ImportValue: !Sub '${AppName}-${EnvName}-ServiceDiscoveryNamespaceID'
EnvControllerAction:
Metadata:
'aws:copilot:description': "Update your environment's shared resources"
Type: Custom::EnvControllerFunction
Properties:
ServiceToken: !GetAtt EnvControllerFunction.Arn
Workload: !Ref WorkloadName
Aliases: ["v2.REDACTED.REDACTED.uktrade.digital"]
EnvStack: !Sub '${AppName}-${EnvName}'
Parameters: [ALBWorkloads, Aliases, NATWorkloads]
EnvVersion: v1.32.1
EnvControllerFunction:
Type: AWS::Lambda::Function
Properties:
Code:
S3Bucket: stackset-REDACTED-infr-pipelinebuiltartifactbuc-REDACTED
S3Key: manual/scripts/custom-resources/envcontrollerfunction/REDACTED.zip
Handler: "index.handler"
Timeout: 900
MemorySize: 512
Role: !GetAtt 'EnvControllerRole.Arn'
Runtime: nodejs16.x
EnvControllerRole:
Metadata:
'aws:copilot:description': "An IAM role to update your environment stack"
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
Action:
- sts:AssumeRole
Path: /
Policies:
- PolicyName: "EnvControllerStackUpdate"
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- cloudformation:DescribeStacks
- cloudformation:UpdateStack
Resource: !Sub 'arn:${AWS::Partition}:cloudformation:${AWS::Region}:${AWS::AccountId}:stack/${AppName}-${EnvName}/*'
Condition:
StringEquals:
'cloudformation:ResourceTag/copilot-application': !Sub '${AppName}'
'cloudformation:ResourceTag/copilot-environment': !Sub '${EnvName}'
- PolicyName: "EnvControllerRolePass"
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- iam:PassRole
Resource: !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${AppName}-${EnvName}-CFNExecutionRole'
Condition:
StringEquals:
'iam:ResourceTag/copilot-application': !Sub '${AppName}'
'iam:ResourceTag/copilot-environment': !Sub '${EnvName}'
ManagedPolicyArns:
- !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Service:
Metadata:
'aws:copilot:description': 'An ECS service to run and maintain your tasks in the environment cluster'
Type: AWS::ECS::Service
DependsOn:
- HTTPListenerRuleWithDomain
- HTTPSListenerRule
Properties:
PlatformVersion: LATEST
Cluster:
Fn::ImportValue: !Sub '${AppName}-${EnvName}-ClusterId'
TaskDefinition: !Ref TaskDefinition
DesiredCount: !Ref TaskCount
DeploymentConfiguration:
DeploymentCircuitBreaker:
Enable: true
Rollback: true
MinimumHealthyPercent: 100
MaximumPercent: 200
Alarms: !If
- IsGovCloud
- !Ref AWS::NoValue
- Enable: false
AlarmNames: []
Rollback: true
PropagateTags: SERVICE
EnableExecuteCommand: true
LaunchType: FARGATE
ServiceConnectConfiguration:
Enabled: True
Namespace: REDACTED.REDACTED.local
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-region: !Ref AWS::Region
awslogs-group: !Ref LogGroup
awslogs-stream-prefix: copilot
Services:
- PortName: target
# Avoid using the same service with Service Discovery in a namespace.
DiscoveryName: !Join ["-", [!Ref WorkloadName, "sc"]]
ClientAliases:
- Port: !Ref TargetPort
DnsName: !Ref WorkloadName
NetworkConfiguration:
AwsvpcConfiguration:
AssignPublicIp: DISABLED
Subnets:
Fn::Split:
- ','
- Fn::ImportValue: !Sub '${AppName}-${EnvName}-PrivateSubnets'
SecurityGroups:
- Fn::ImportValue: !Sub '${AppName}-${EnvName}-EnvironmentSecurityGroup'
# This may need to be adjusted if the container takes a while to start up
HealthCheckGracePeriodSeconds: 120
LoadBalancers:
- ContainerName: nginx
ContainerPort: 443
TargetGroupArn: !Ref TargetGroup
ServiceRegistries:
- RegistryArn: !GetAtt DiscoveryService.Arn
Port: !Ref TargetPort
TargetGroup:
Metadata:
'aws:copilot:description': "A target group to connect the load balancer to your service on port 443"
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
HealthCheckPath: / # Default is '/'.
HealthCheckPort: 8080 # Default is 'traffic-port'.
Matcher:
HttpCode: 200
HealthyThresholdCount: 3
UnhealthyThresholdCount: 3
HealthCheckIntervalSeconds: 35
HealthCheckTimeoutSeconds: 30
HealthCheckProtocol: HTTP
Port: 443
Protocol: HTTPS
TargetGroupAttributes:
- Key: deregistration_delay.timeout_seconds
Value: 60 # ECS Default is 300; Copilot default is 60.
- Key: stickiness.enabled
Value: false
TargetType: ip
VpcId:
Fn::ImportValue: !Sub "${AppName}-${EnvName}-VpcId"
RulePriorityFunction:
Type: AWS::Lambda::Function
Properties:
Code:
S3Bucket: stackset-REDACTED-infr-pipelinebuiltartifactbuc-REDACTED
S3Key: manual/scripts/custom-resources/rulepriorityfunction/REDACTED.zip
Handler: "index.nextAvailableRulePriorityHandler"
Timeout: 600
MemorySize: 512
Role: !GetAtt "RulePriorityFunctionRole.Arn"
Runtime: nodejs16.x
RulePriorityFunctionRole:
Metadata:
'aws:copilot:description': "An IAM Role to describe load balancer rules for assigning a priority"
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
Action:
- sts:AssumeRole
Path: /
ManagedPolicyArns:
- !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Policies:
- PolicyName: "RulePriorityGeneratorAccess"
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- elasticloadbalancing:DescribeRules
Resource: "*"
HTTPSRulePriorityAction:
Metadata:
'aws:copilot:description': 'A custom resource assigning priority for HTTPS listener rules'
Type: Custom::RulePriorityFunction
Properties:
ServiceToken: !GetAtt RulePriorityFunction.Arn
RulePath: ["/"]
ListenerArn: !GetAtt EnvControllerAction.HTTPSListenerArn
HTTPRuleWithDomainPriorityAction:
Metadata:
'aws:copilot:description': 'A custom resource assigning priority for HTTP listener rules'
Type: Custom::RulePriorityFunction
Properties:
ServiceToken: !GetAtt RulePriorityFunction.Arn
RulePath: ["/"]
ListenerArn: !GetAtt EnvControllerAction.HTTPListenerArn
HTTPListenerRuleWithDomain:
Metadata:
'aws:copilot:description': 'An HTTP listener rule for path `/` that redirects HTTP to HTTPS'
Type: AWS::ElasticLoadBalancingV2::ListenerRule
Properties:
Actions:
- Type: redirect
RedirectConfig:
Protocol: HTTPS
Port: 443
Host: "#{host}"
Path: "/#{path}"
Query: "#{query}"
StatusCode: HTTP_301
Conditions:
- Field: 'host-header'
HostHeaderConfig:
Values: ["v2.REDACTED.REDACTED.uktrade.digital"]
- Field: 'path-pattern'
PathPatternConfig:
Values:
- /*
ListenerArn: !GetAtt EnvControllerAction.HTTPListenerArn
Priority: !GetAtt HTTPRuleWithDomainPriorityAction.Priority
HTTPSListenerRule:
Metadata:
'aws:copilot:description': 'An HTTPS listener rule for path `/` that forwards HTTPS traffic to your tasks'
Type: AWS::ElasticLoadBalancingV2::ListenerRule
Properties:
Actions:
- TargetGroupArn: !Ref TargetGroup
Type: forward
Conditions:
- Field: 'host-header'
HostHeaderConfig:
Values: ["v2.REDACTED.REDACTED.uktrade.digital"]
- Field: 'path-pattern'
PathPatternConfig:
Values:
- /*
ListenerArn: !GetAtt EnvControllerAction.HTTPSListenerArn
Priority: !GetAtt HTTPSRulePriorityAction.Priority
AddonsStack:
Metadata:
'aws:copilot:description': 'An Addons CloudFormation Stack for your additional AWS resources'
Type: AWS::CloudFormation::Stack
DependsOn: EnvControllerAction
Condition: HasAddons
Properties:
Parameters:
App: !Ref AppName
Env: !Ref EnvName
Name: !Ref WorkloadName
TemplateURL: !Ref AddonsTemplateURL
Outputs:
DiscoveryServiceARN:
Description: ARN of the Discovery Service.
Value: !GetAtt DiscoveryService.Arn
Export:
Name: !Sub ${AWS::StackName}-DiscoveryServiceARN
Ummm your manifest and CloudFormation looks good to me! Given that the nginx
container's nginx.confg
is listening on 8080
, I don't think there is any issue. Let's see if after increasing grace_periods
the issue manifests again which will indicate that there are deeper issues. Feel free to update the thread in that case.
the fact that it can respond successfully several times to the health checks then suddenly be considered to fail, and there are no logs anywhere to make clear how it is considered to have failed, is a bad smell for me.
Totally agree. I think it'd be helpful if the TargetGroup page can show more information (e.g. the actual response code received, or whether it has timed out) . I can forward this feedback for you!
I think it'd be helpful if the TargetGroup page can show more information (e.g. the actual response code received, or whether it has timed out).
While you're passing on feedback, being able to view logs for ELB health checks would be super useful for debugging this kind of thing. They wouldn't need to be kept for long, 24 hours would be plenty. It could also be something we only turn on when needed maybe, e.g. "Enable health check logging for X hours". Or perhaps just log and display details of the last X failures.
While you're passing on feedback, being able to view logs for ELB health checks would be super useful for debugging this kind of thing. They wouldn't need to be kept for long, 24 hours would be plenty. It could also be something we only turn on when needed maybe, e.g. "Enable health check logging for X hours". Or perhaps just log and display details of the last X failures.
Yes please, this would be a huge win!
This issue is stale because it has been open 60 days with no response activity. Remove the stale label, add a comment, or this will be closed in 14 days.
This issue is closed due to inactivity. Feel free to reopen the issue if you have any further questions!
We have started experiencing intermittent ELB health check failures during deployment of a load balanced web service.
There is nothing in the task logs to indicate any kind of failure to respond nicely to the ELB health check requests. It's trotting along happily responding to requests, then shut down.
The sequence of events goes something like this:
cds: add 2 cluster(s), remove 3 cluster(s)
The Service Connect thing might be a coincidence, but the correlation between the add/remove cluster logs and our health check failing is too consistent to ignore.
Service Connect is enabled via
network.connect: true
and we have done no other configuration on that front.The service in question is part of our Django test application.
Its landing page, which just connects to various addons and a Celery worker is used for the health check.Its landing page connects to various addons plus a Celery worker. This landing page is used for the health check.The health check configuration on the service is...
Looking in the AWS Console, all these number correspond with the settings in the ELB health check except the grace period one, which does not appear in there.
The task
count
on the service in question is1
.We think Service Connect might be "doing something wrong", but we're not certain of that.
It seems to have begun after we (foolishly) came back to work after the new year. This CLoudWatch Log Insights query...
...run against our 7 playground environments yields...
I'm sure some more information would be helpful too, just shout and I'll try to fill any gaps.