Possible issue with Service Connect & ELB Health Check

WillGibson commented 10 months ago

We have started experiencing intermittent ELB health check failures during deployment of a load balanced web service.

There is nothing in the task logs to indicate any kind of failure to respond nicely to the ELB health check requests. It's trotting along happily responding to requests, then shut down.

The sequence of events goes something like this:

New task definition is deployed to ECS
When it's ready, it starts responding successfully to health check requests, often doing several before any woes occur
Service Connect logs stuff like cds: add 2 cluster(s), remove 3 cluster(s)
Within seconds (usually a handful, but I have seen a few handfuls) the task is declared unhealthy and shut down

The Service Connect thing might be a coincidence, but the correlation between the add/remove cluster logs and our health check failing is too consistent to ignore.

Service Connect is enabled via network.connect: true and we have done no other configuration on that front.

The service in question is part of our Django test application. ~~Its landing page, which just connects to various addons and a Celery worker is used for the health check.~~ Its landing page connects to various addons plus a Celery worker. This landing page is used for the health check.

The health check configuration on the service is...

http:
  ...
  healthcheck:
    path: '/'
    port: 8080
    success_codes: '302,200'
    healthy_threshold: 3
    unhealthy_threshold: 2
    interval: 35s
    timeout: 30s
    grace_period: 101s

Looking in the AWS Console, all these number correspond with the settings in the ELB health check except the grace period one, which does not appear in there.

The task count on the service in question is 1.

We think Service Connect might be "doing something wrong", but we're not certain of that.

It seems to have begun after we (foolishly) came back to work after the new year. This CLoudWatch Log Insights query...

fields @timestamp, @message, @log
| filter @message like "cds_api_helper"
| filter @message like "remove"
| filter @message not like "remove 0"
| sort @timestamp asc
| limit 1500

...run against our 7 playground environments yields...

I'm sure some more information would be helpful too, just shout and I'll try to fill any gaps.

Lou1415926 commented 10 months ago

Hey @WillGibson sorry for the delay!! Are you still seeing the issue?

It should be normal for the service connect logs to occasionally have logs like "remove x clusters". I've seen such logs in my successful deployments:

[2024-01-19 03:51:40.895][33][info][upstream] [source/common/upstream/cds_api_helper.cc:32] cds: add 2 cluster(s), remove 1 cluster(s)

Currently I am tempted to think that service connect should have little to do with the ELB health check failure. What the ELB health check does is just to send a request to the private IP address of the task over the health check path, and wait for a response from the task. Typically service connect is not involved in this process. But I could be wrong on that front, so I won't completely rule that out.

and a Celery worker is used for the health check.

How does your celery worker respond to health check? Do you find in the main container (not the service connect sidecar) 's log anything interesting?

WillGibson commented 10 months ago

Hi @Lou1415926,

Yes, we are still seeing the issue. I guess it's some kind of networking problem. It's sad that there are no logs for the health check requests/responses, so we have no way of seeing what was actually going on when the health check failed. I think it's some kind of connectivity issue and the requests are just not getting through to the service container.

I have reworded a sentence in the original post to be clearer...

"Its landing page connects to various addons plus a Celery worker. This landing page is used for the health check."

This page returns a 200 regardless of it's success in using the addons etc., so the Celery worker does not have any direct effect on the web service's health.

As it happens, we added a health check to the Celery worker service yesterday. It does not use ELB and so far has not had any issues with health checks failing when all is actually OK.

Lou1415926 commented 10 months ago

@WillGibson You can go to the Target Group console to see the "Health status details" -

This troubleshooting page can hopefully help you decipher the reason codes.

In addition, the "Events" tab of your service's page in the ECS console might be able to provide some clues too:

When it's ready, it starts responding successfully to health check requests, often doing several before any woes occur

This is what baffles me the most. The first 2 health checks are reaching the container which are responding properly, but it seems like the third has somehow failed to reach the container all of a sudden. Hopefully we can gather more clues from those places!

WillGibson commented 10 months ago

There doesn't seem to be anything helpful corresponding to when the problem has occurred in the Task's Events tab, just some port 443 is unhealthy in target-group which does not help debug it.

My colleagues @yusufsheiqh and @codeninja merged a commit adding AWS X-Ray yesterday morning which includes these changes to the health check configuration...

...and I couldn't get it to manifest the problem yesterday afternoon or this morning.

If I revert those changes (seems to be the grace period that matters) the problem will manifest, but I'm not sure if it's the same problem or something from adding X-Ray.

It seems to be consistently failing though, instead of intermittently failing, so maybe adding X-Ray does indeed require a longer grace period, but looking at current example it's responded to the health check with a 200 status code all 8 times before it is deemed to have failed health checks and shut down.

All it said in the Target Group's health status details column while that was going on is "Health checks failed", which is not new information 😂

Lou1415926 commented 9 months ago

just some port 443 is unhealthy in target-group which does not help debug it.

From the manifest you provided above, it seems like the health check port was set to be 8080. You can take a look at the CloudFormation template at Resources.TargetGroup.Properties.HealthCheckPort to check if it is indeed 8080 as the manifest. I also assume the landing page (that returns 200 and also what the health checks hit) is on port 8080, instead of 443 of the container right? This is probably not the reason why the health checks intermittently fail, though I'd like to point that out in case it is relevant in the context of your application.

with a 200 status code all 8 times before it is deemed to have failed health checks and shut down.

My best guess was also that the grace_periods needed to be increased, which is one of the most common reasons for occasionally failing health check. If the issue persists, I'd recommend contacting AWS Support, as the engineers involved will have more visibility into what actually happened inside of your service.

At the mean time, I am also happy to check your entire manifest to validate the configuration, if you are willing to share!

WillGibson commented 9 months ago

Resources.TargetGroup.Properties.HealthCheckPort is 8080, which is what the nginx container is set to serve stuff up on.

Increasing the grace period is OK. If the problem doesn't distract us any more that's good, but the fact that it can respond successfully several times to the health checks then suddenly be considered to fail, and there are no logs anywhere to make clear how it is considered to have failed, is a bad smell for me.

Here is the slightly redacted manifest for our web service (would attach, but YAML is not supported by GitHub)...

# Copyright Amazon.com Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
AWSTemplateFormatVersion: 2010-09-09
Description: CloudFormation template that represents a load balanced web service on Amazon ECS using AWS Copilot with YAML patches.
Metadata:
    Version: v1.32.1
    Manifest: |
        # The manifest for the "web" service.
        # Read the full specification for the "Load Balanced Web Service" type at:
        #  https://aws.github.io/copilot-cli/docs/manifest/lb-web-service/

        # Your service name will be used in naming your resources like log groups, ECS services, etc.
        name: web
        type: Load Balanced Web Service
        # Distribute traffic to your service.
        http:
          # Requests to this path will be forwarded to your service.
          # To match all requests you can use the "/" path.
          path: '/'
          # You can specify a custom health check path. The default is "/".
          # healthcheck: '/'
          target_container: nginx
          healthcheck:
            path: '/'
            port: 8080
            success_codes: '200'
            healthy_threshold: 3
            unhealthy_threshold: 3
            interval: 35s
            timeout: 30s
            grace_period: 120s
        sidecars:
          nginx:
            port: 443
            image: REDACTED
            variables:
              SERVER: localhost:8000
          ipfilter:
            port: 8000
            image: REDACTED
            variables:
              PORT: 8000
              SERVER: localhost:8080
              APPCONFIG_PROFILES: ipfilter:default:default
              IPFILTER_ENABLED: True
              EMAIL: REDACTED
              PROTECTED_PATHS: /
          appconfig:
            port: 2772
            image: REDACTED
            essential: true
            variables:
              ROLE_ARN: arn:aws:iam::REDACTED:role/AppConfigIpFilterRole
        # Configuration for your containers and service.
        image:
          location: REDACTED
          # Port exposed through your container to route traffic to it.
          port: 8080
        cpu: 256 # Number of CPU units for the task.
        memory: 1024 # Amount of memory in MiB used by the task.
        count: 1 # Number of tasks that should be running in your service.
        exec: true # Enable running commands in your container.
        network:
          connect: true # Enable Service Connect for intra-environment traffic between services.
          vpc:
            placement: 'private'
        storage:
          readonly_fs: false
        observability:
          tracing: awsxray
        # Optional fields for more advanced use-cases.
        #
        variables: # Pass environment variables as key value pairs.
          SECRET_KEY: REDACTED
          PORT: 8080
          DEBUG: True
          S3_BUCKET_NAME: REDACTED
          ALLOWED_HOSTS: "*"
          OTEL_PROPAGATORS: xray
          OTEL_PYTHON_ID_GENERATOR: xray
          OTEL_SERVICE_NAME: REDACTED-REDACTED-web
          OTEL_METRICS_EXPORTER: console,otlp
          OTEL_TRACES_EXPORTER: console,otlp
          OTEL_TRACES_SAMPLER: traceidratio
          OTEL_TRACES_SAMPLER_ARG: "0.05"
        secrets: # Pass secrets from AWS Systems Manager (SSM) Parameter Store.
          DJANGO_SECRET_KEY: /copilot/REDACTED/REDACTED/secrets/DJANGO_SECRET_KEY
          OPENSEARCH_ENDPOINT: /copilot/REDACTED/REDACTED/secrets/REDACTED_OPENSEARCH
          REDIS_ENDPOINT: /copilot/REDACTED/REDACTED/secrets/REDACTED_REDIS
          DATABASE_CREDENTIALS:
            secretsmanager: /copilot/REDACTED/REDACTED/secrets/REDACTED_POSTGRES
          RDS_DATABASE_CREDENTIALS:
            secretsmanager: /copilot/REDACTED/REDACTED/secrets/REDACTED_RDS_POSTGRES
        # You can override any of the values defined above by environment.
        environments:
          dev:
            http:
              alias: v2.REDACTED.dev.uktrade.digital
          ant:
            http:
              alias: v2.REDACTED.ant.uktrade.digital
          staging:
            http:
              alias: v2.REDACTED.staging.uktrade.digital
            sidecars:
              ipfilter:
                variables:
                  IPFILTER_ENABLED: False
          REDACTED:
            http:
              alias: v2.REDACTED.REDACTED.uktrade.digital
          REDACTED:
            http:
              alias: v2.REDACTED.REDACTED.uktrade.digital
          REDACTED:
            http:
              alias: v2.REDACTED.REDACTED.uktrade.digital
          REDACTED:
            http:
              alias: v2.REDACTED.REDACTED.uktrade.digital
          REDACTED:
            http:
              alias: v2.REDACTED.REDACTED.uktrade.digital
Parameters:
    AppName:
        Type: String
    EnvName:
        Type: String
    WorkloadName:
        Type: String
    ContainerImage:
        Type: String
    ContainerPort:
        Type: Number
    TaskCPU:
        Type: String
    TaskMemory:
        Type: String
    TaskCount:
        Type: Number
    DNSDelegated:
        Type: String
        AllowedValues: [true, false]
    LogRetention:
        Type: Number
    AddonsTemplateURL:
        Description: 'URL of the addons nested stack template within the S3 bucket.'
        Type: String
        Default: ""
    EnvFileARN:
        Description: 'URL of the environment file.'
        Type: String
        Default: ""
    EnvFileARNForappconfig:
        Type: String
        Description: 'URL of the environment file for the appconfig sidecar.'
        Default: ""
    EnvFileARNForipfilter:
        Type: String
        Description: 'URL of the environment file for the ipfilter sidecar.'
        Default: ""
    EnvFileARNFornginx:
        Type: String
        Description: 'URL of the environment file for the nginx sidecar.'
        Default: ""
    ArtifactKeyARN:
        Type: String
        Description: 'KMS Key used for encrypting artifacts'
    TargetContainer:
        Type: String
    TargetPort:
        Type: Number
    HTTPSEnabled:
        Type: String
        AllowedValues: [true, false]
    RulePath:
        Type: String
Conditions:
    IsGovCloud: !Equals [!Ref "AWS::Partition", "aws-us-gov"]
    HasAssociatedDomain: !Equals [!Ref DNSDelegated, true]
    HasAddons: !Not [!Equals [!Ref AddonsTemplateURL, ""]]
    HasEnvFile: !Not [!Equals [!Ref EnvFileARN, ""]]
    HasEnvFileForappconfig: !Not [!Equals [!Ref EnvFileARNForappconfig, ""]]
    HasEnvFileForipfilter: !Not [!Equals [!Ref EnvFileARNForipfilter, ""]]
    HasEnvFileFornginx: !Not [!Equals [!Ref EnvFileARNFornginx, ""]]
Resources: # If a bucket URL is specified, that means the template exists.
    LogGroup:
        Metadata:
            'aws:copilot:description': 'A CloudWatch log group to hold your service logs'
        Type: AWS::Logs::LogGroup
        Properties:
            LogGroupName: !Sub '/copilot/${AppName}/${EnvName}/${WorkloadName}'
            RetentionInDays: !Ref LogRetention
    TaskDefinition:
        Metadata:
            'aws:copilot:description': 'An ECS task definition to group your containers and run them on ECS'
        Type: AWS::ECS::TaskDefinition
        DependsOn: LogGroup
        Properties:
            Family: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName]]
            NetworkMode: awsvpc
            RequiresCompatibilities:
                - FARGATE
            Cpu: !Ref TaskCPU
            Memory: !Ref TaskMemory
            ExecutionRoleArn: !GetAtt ExecutionRole.Arn
            TaskRoleArn: !GetAtt TaskRole.Arn
            ContainerDefinitions:
                - Name: !Ref WorkloadName
                  Image: !Ref ContainerImage
                  Secrets:
                    - Name: DATABASE_CREDENTIALS
                      ValueFrom: !Sub 'arn:${AWS::Partition}:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:/copilot/REDACTED/REDACTED/secrets/REDACTED_POSTGRES'
                    - Name: DJANGO_SECRET_KEY
                      ValueFrom: /copilot/REDACTED/REDACTED/secrets/DJANGO_SECRET_KEY
                    - Name: OPENSEARCH_ENDPOINT
                      ValueFrom: /copilot/REDACTED/REDACTED/secrets/REDACTED_OPENSEARCH
                    - Name: RDS_DATABASE_CREDENTIALS
                      ValueFrom: !Sub 'arn:${AWS::Partition}:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:/copilot/REDACTED/REDACTED/secrets/REDACTED_RDS_POSTGRES'
                    - Name: REDIS_ENDPOINT
                      ValueFrom: /copilot/REDACTED/REDACTED/secrets/REDACTED_REDIS
                  Environment:
                    - Name: COPILOT_APPLICATION_NAME
                      Value: !Sub '${AppName}'
                    - Name: COPILOT_SERVICE_DISCOVERY_ENDPOINT
                      Value: REDACTED.REDACTED.local
                    - Name: COPILOT_ENVIRONMENT_NAME
                      Value: !Sub '${EnvName}'
                    - Name: COPILOT_SERVICE_NAME
                      Value: !Sub '${WorkloadName}'
                    - Name: COPILOT_LB_DNS
                      Value: !GetAtt EnvControllerAction.PublicLoadBalancerDNSName
                    - Name: ALLOWED_HOSTS
                      Value: "*"
                    - Name: DEBUG
                      Value: "True"
                    - Name: OTEL_METRICS_EXPORTER
                      Value: "console,otlp"
                    - Name: OTEL_PROPAGATORS
                      Value: "xray"
                    - Name: OTEL_PYTHON_ID_GENERATOR
                      Value: "xray"
                    - Name: OTEL_SERVICE_NAME
                      Value: "REDACTED-REDACTED-web"
                    - Name: OTEL_TRACES_EXPORTER
                      Value: "console,otlp"
                    - Name: OTEL_TRACES_SAMPLER
                      Value: "traceidratio"
                    - Name: OTEL_TRACES_SAMPLER_ARG
                      Value: "0.05"
                    - Name: PORT
                      Value: "8080"
                    - Name: S3_BUCKET_NAME
                      Value: "REDACTED-s3-bucket-REDACTED"
                    - Name: SECRET_KEY
                      Value: "REDACTED"
                  EnvironmentFiles:
                    - !If
                      - HasEnvFile
                      - Type: s3
                        Value: !Ref EnvFileARN
                      - !Ref AWS::NoValue
                  LogConfiguration:
                    LogDriver: awslogs
                    Options:
                        awslogs-region: !Ref AWS::Region
                        awslogs-group: !Ref LogGroup
                        awslogs-stream-prefix: copilot
                  PortMappings:
                    - ContainerPort: 8080
                      Protocol: tcp
                  ReadonlyRootFilesystem: false
                  MountPoints:
                    - ContainerPath: /tmp
                      SourceVolume: temporary-fs
                - Name: aws-otel-collector
                  Image: public.ecr.aws/aws-observability/aws-otel-collector:v0.17.0
                  Command:
                    - --config=/etc/ecs/ecs-xray.yaml
                  LogConfiguration:
                    LogDriver: awslogs
                    Options:
                        awslogs-region: !Ref AWS::Region
                        awslogs-group: !Ref LogGroup
                        awslogs-stream-prefix: copilot
                - Name: appconfig
                  Image: public.ecr.aws/aws-appconfig/aws-appconfig-agent:2.x
                  Essential: true
                  PortMappings:
                    - ContainerPort: 2772
                      Protocol: tcp
                  Environment:
                    - Name: COPILOT_APPLICATION_NAME
                      Value: !Sub '${AppName}'
                    - Name: COPILOT_SERVICE_DISCOVERY_ENDPOINT
                      Value: REDACTED.REDACTED.local
                    - Name: COPILOT_ENVIRONMENT_NAME
                      Value: !Sub '${EnvName}'
                    - Name: COPILOT_SERVICE_NAME
                      Value: !Sub '${WorkloadName}'
                    - Name: COPILOT_LB_DNS
                      Value: !GetAtt EnvControllerAction.PublicLoadBalancerDNSName
                    - Name: ROLE_ARN
                      Value: "arn:aws:iam::REDACTED:role/AppConfigIpFilterRole"
                  EnvironmentFiles:
                    - !If
                      - HasEnvFileForappconfig
                      - Type: "s3"
                        Value: !Ref EnvFileARNForappconfig
                      - !Ref "AWS::NoValue"
                  LogConfiguration:
                    LogDriver: awslogs
                    Options:
                        awslogs-region: !Ref AWS::Region
                        awslogs-group: !Ref LogGroup
                        awslogs-stream-prefix: copilot
                - Name: ipfilter
                  Image: public.ecr.aws/uktrade/ip-filter:latest
                  PortMappings:
                    - ContainerPort: 8000
                      Protocol: tcp
                  Environment:
                    - Name: COPILOT_APPLICATION_NAME
                      Value: !Sub '${AppName}'
                    - Name: COPILOT_SERVICE_DISCOVERY_ENDPOINT
                      Value: REDACTED.REDACTED.local
                    - Name: COPILOT_ENVIRONMENT_NAME
                      Value: !Sub '${EnvName}'
                    - Name: COPILOT_SERVICE_NAME
                      Value: !Sub '${WorkloadName}'
                    - Name: COPILOT_LB_DNS
                      Value: !GetAtt EnvControllerAction.PublicLoadBalancerDNSName
                    - Name: APPCONFIG_PROFILES
                      Value: "ipfilter:default:default"
                    - Name: EMAIL
                      Value: "REDACTED"
                    - Name: IPFILTER_ENABLED
                      Value: "True"
                    - Name: PORT
                      Value: "8000"
                    - Name: PROTECTED_PATHS
                      Value: "/"
                    - Name: SERVER
                      Value: "localhost:8080"
                  EnvironmentFiles:
                    - !If
                      - HasEnvFileForipfilter
                      - Type: "s3"
                        Value: !Ref EnvFileARNForipfilter
                      - !Ref "AWS::NoValue"
                  LogConfiguration:
                    LogDriver: awslogs
                    Options:
                        awslogs-region: !Ref AWS::Region
                        awslogs-group: !Ref LogGroup
                        awslogs-stream-prefix: copilot
                - Name: nginx
                  Image: REDACTED
                  PortMappings:
                    - ContainerPort: 443
                      Name: target
                      Protocol: tcp
                  Environment:
                    - Name: COPILOT_APPLICATION_NAME
                      Value: !Sub '${AppName}'
                    - Name: COPILOT_SERVICE_DISCOVERY_ENDPOINT
                      Value: REDACTED.REDACTED.local
                    - Name: COPILOT_ENVIRONMENT_NAME
                      Value: !Sub '${EnvName}'
                    - Name: COPILOT_SERVICE_NAME
                      Value: !Sub '${WorkloadName}'
                    - Name: COPILOT_LB_DNS
                      Value: !GetAtt EnvControllerAction.PublicLoadBalancerDNSName
                    - Name: SERVER
                      Value: "localhost:8000"
                  EnvironmentFiles:
                    - !If
                      - HasEnvFileFornginx
                      - Type: "s3"
                        Value: !Ref EnvFileARNFornginx
                      - !Ref "AWS::NoValue"
                  LogConfiguration:
                    LogDriver: awslogs
                    Options:
                        awslogs-region: !Ref AWS::Region
                        awslogs-group: !Ref LogGroup
                        awslogs-stream-prefix: copilot
            Volumes:
                - Name: temporary-fs
    ExecutionRole:
        Metadata:
            'aws:copilot:description': 'An IAM Role for the Fargate agent to make AWS API calls on your behalf'
        Type: AWS::IAM::Role
        Properties:
            AssumeRolePolicyDocument:
                Version: '2012-10-17'
                Statement:
                    - Effect: Allow
                      Principal:
                        Service: ecs-tasks.amazonaws.com
                      Action: 'sts:AssumeRole'
            Policies:
                - PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, SecretsPolicy]]
                  PolicyDocument:
                    Version: '2012-10-17'
                    Statement:
                        - Effect: 'Allow'
                          Action:
                            - 'ssm:GetParameters'
                          Resource:
                            - !Sub 'arn:${AWS::Partition}:ssm:${AWS::Region}:${AWS::AccountId}:parameter/*'
                          Condition:
                            StringEquals:
                                'ssm:ResourceTag/copilot-application': !Sub '${AppName}'
                                'ssm:ResourceTag/copilot-environment': !Sub '${EnvName}'
                        - Effect: 'Allow'
                          Action:
                            - 'secretsmanager:GetSecretValue'
                          Resource:
                            - !Sub 'arn:${AWS::Partition}:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:*'
                          Condition:
                            StringEquals:
                                'secretsmanager:ResourceTag/copilot-application': !Sub '${AppName}'
                                'secretsmanager:ResourceTag/copilot-environment': !Sub '${EnvName}'
                        - Effect: 'Allow'
                          Action:
                            - 'kms:Decrypt'
                          Resource:
                            - !Ref ArtifactKeyARN
                - !If
                  # Optional IAM permission required by ECS task def env file
                  # https://docs.aws.amazon.com/AmazonECS/latest/developerguide/taskdef-envfiles.html#taskdef-envfiles-iam
                  # Example EnvFileARN: arn:aws:s3:::stackset-demo-infrastruc-pipelinebuiltartifactbuc-11dj7ctf52wyf/manual/1638391936/env
                  - HasEnvFile
                  - PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, GetEnvFilePolicy]]
                    PolicyDocument:
                        Version: '2012-10-17'
                        Statement:
                            - Effect: 'Allow'
                              Action:
                                - 's3:GetObject'
                              Resource:
                                - !Ref EnvFileARN
                            - Effect: 'Allow'
                              Action:
                                - 's3:GetBucketLocation'
                              Resource:
                                - !Join
                                  - ''
                                  - - 'arn:'
                                    - !Ref AWS::Partition
                                    - ':s3:::'
                                    - !Select [0, !Split ['/', !Select [5, !Split [':', !Ref EnvFileARN]]]]
                  - !Ref AWS::NoValue
                - !If
                  - HasEnvFileForappconfig
                  - PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, GetEnvFilePolicyForappconfig]]
                    PolicyDocument:
                        Version: '2012-10-17'
                        Statement:
                            - Effect: 'Allow'
                              Action:
                                - 's3:GetObject'
                              Resource:
                                - !Ref EnvFileARNForappconfig
                            - Effect: 'Allow'
                              Action:
                                - 's3:GetBucketLocation'
                              Resource:
                                - !Join
                                  - ''
                                  - - 'arn:'
                                    - !Ref AWS::Partition
                                    - ':s3:::'
                                    - !Select [0, !Split ['/', !Select [5, !Split [':', !Ref EnvFileARNForappconfig]]]]
                  - !Ref AWS::NoValue
                - !If
                  - HasEnvFileForipfilter
                  - PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, GetEnvFilePolicyForipfilter]]
                    PolicyDocument:
                        Version: '2012-10-17'
                        Statement:
                            - Effect: 'Allow'
                              Action:
                                - 's3:GetObject'
                              Resource:
                                - !Ref EnvFileARNForipfilter
                            - Effect: 'Allow'
                              Action:
                                - 's3:GetBucketLocation'
                              Resource:
                                - !Join
                                  - ''
                                  - - 'arn:'
                                    - !Ref AWS::Partition
                                    - ':s3:::'
                                    - !Select [0, !Split ['/', !Select [5, !Split [':', !Ref EnvFileARNForipfilter]]]]
                  - !Ref AWS::NoValue
                - !If
                  - HasEnvFileFornginx
                  - PolicyName: !Join ['', [!Ref AppName, '-', !Ref EnvName, '-', !Ref WorkloadName, GetEnvFilePolicyFornginx]]
                    PolicyDocument:
                        Version: '2012-10-17'
                        Statement:
                            - Effect: 'Allow'
                              Action:
                                - 's3:GetObject'
                              Resource:
                                - !Ref EnvFileARNFornginx
                            - Effect: 'Allow'
                              Action:
                                - 's3:GetBucketLocation'
                              Resource:
                                - !Join
                                  - ''
                                  - - 'arn:'
                                    - !Ref AWS::Partition
                                    - ':s3:::'
                                    - !Select [0, !Split ['/', !Select [5, !Split [':', !Ref EnvFileARNFornginx]]]]
                  - !Ref AWS::NoValue
            ManagedPolicyArns:
                - !Sub 'arn:${AWS::Partition}:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy'
    TaskRole:
        Metadata:
            'aws:copilot:description': 'An IAM role to control permissions for the containers in your tasks'
        Type: AWS::IAM::Role
        Properties:
            ManagedPolicyArns:
                - Fn::GetAtt: [AddonsStack, Outputs.appConfigAccessPolicy]
                - Fn::GetAtt: [AddonsStack, Outputs.REDACTEDS3BucketAccessPolicy]
                - Fn::GetAtt: [AddonsStack, Outputs.XRayAccessPolicy]
            AssumeRolePolicyDocument:
                Version: '2012-10-17'
                Statement:
                    - Effect: Allow
                      Principal:
                        Service: ecs-tasks.amazonaws.com
                      Action: 'sts:AssumeRole'
            Policies:
                - PolicyName: 'DenyIAM'
                  PolicyDocument:
                    Version: '2012-10-17'
                    Statement:
                        - Effect: 'Deny'
                          Action: 'iam:*'
                          Resource: '*'
                - PolicyName: 'ExecuteCommand'
                  PolicyDocument:
                    Version: '2012-10-17'
                    Statement:
                        - Effect: 'Allow'
                          Action: ["ssmmessages:CreateControlChannel", "ssmmessages:OpenControlChannel", "ssmmessages:CreateDataChannel", "ssmmessages:OpenDataChannel"]
                          Resource: "*"
                        - Effect: 'Allow'
                          Action: ["logs:CreateLogStream", "logs:DescribeLogGroups", "logs:DescribeLogStreams", "logs:PutLogEvents"]
                          Resource: "*"
                - PolicyName: 'AWSDistroOpenTelemetryPolicy'
                  PolicyDocument:
                    Version: '2012-10-17'
                    Statement:
                        - Effect: 'Allow'
                          Action:
                            - 'logs:PutLogEvents'
                            - 'logs:CreateLogGroup'
                            - 'logs:CreateLogStream'
                            - 'logs:DescribeLogStreams'
                            - 'logs:DescribeLogGroups'
                            - 'xray:PutTraceSegments'
                            - 'xray:PutTelemetryRecords'
                            - 'xray:GetSamplingRules'
                            - 'xray:GetSamplingTargets'
                            - 'xray:GetSamplingStatisticSummaries'
                          Resource: "*"
    DiscoveryService:
        Metadata:
            'aws:copilot:description': 'Service discovery for your services to communicate within the VPC'
        Type: AWS::ServiceDiscovery::Service
        Properties:
            Description: Discovery Service for the Copilot services
            DnsConfig:
                RoutingPolicy: MULTIVALUE
                DnsRecords:
                    - TTL: 10
                      Type: A
                    - TTL: 10
                      Type: SRV
            HealthCheckCustomConfig:
                FailureThreshold: 1
            Name: !Ref WorkloadName
            NamespaceId:
                Fn::ImportValue: !Sub '${AppName}-${EnvName}-ServiceDiscoveryNamespaceID'
    EnvControllerAction:
        Metadata:
            'aws:copilot:description': "Update your environment's shared resources"
        Type: Custom::EnvControllerFunction
        Properties:
            ServiceToken: !GetAtt EnvControllerFunction.Arn
            Workload: !Ref WorkloadName
            Aliases: ["v2.REDACTED.REDACTED.uktrade.digital"]
            EnvStack: !Sub '${AppName}-${EnvName}'
            Parameters: [ALBWorkloads, Aliases, NATWorkloads]
            EnvVersion: v1.32.1
    EnvControllerFunction:
        Type: AWS::Lambda::Function
        Properties:
            Code:
                S3Bucket: stackset-REDACTED-infr-pipelinebuiltartifactbuc-REDACTED
                S3Key: manual/scripts/custom-resources/envcontrollerfunction/REDACTED.zip
            Handler: "index.handler"
            Timeout: 900
            MemorySize: 512
            Role: !GetAtt 'EnvControllerRole.Arn'
            Runtime: nodejs16.x
    EnvControllerRole:
        Metadata:
            'aws:copilot:description': "An IAM role to update your environment stack"
        Type: AWS::IAM::Role
        Properties:
            AssumeRolePolicyDocument:
                Version: '2012-10-17'
                Statement:
                    - Effect: Allow
                      Principal:
                        Service:
                            - lambda.amazonaws.com
                      Action:
                        - sts:AssumeRole
            Path: /
            Policies:
                - PolicyName: "EnvControllerStackUpdate"
                  PolicyDocument:
                    Version: '2012-10-17'
                    Statement:
                        - Effect: Allow
                          Action:
                            - cloudformation:DescribeStacks
                            - cloudformation:UpdateStack
                          Resource: !Sub 'arn:${AWS::Partition}:cloudformation:${AWS::Region}:${AWS::AccountId}:stack/${AppName}-${EnvName}/*'
                          Condition:
                            StringEquals:
                                'cloudformation:ResourceTag/copilot-application': !Sub '${AppName}'
                                'cloudformation:ResourceTag/copilot-environment': !Sub '${EnvName}'
                - PolicyName: "EnvControllerRolePass"
                  PolicyDocument:
                    Version: '2012-10-17'
                    Statement:
                        - Effect: Allow
                          Action:
                            - iam:PassRole
                          Resource: !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${AppName}-${EnvName}-CFNExecutionRole'
                          Condition:
                            StringEquals:
                                'iam:ResourceTag/copilot-application': !Sub '${AppName}'
                                'iam:ResourceTag/copilot-environment': !Sub '${EnvName}'
            ManagedPolicyArns:
                - !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
    Service:
        Metadata:
            'aws:copilot:description': 'An ECS service to run and maintain your tasks in the environment cluster'
        Type: AWS::ECS::Service
        DependsOn:
            - HTTPListenerRuleWithDomain
            - HTTPSListenerRule
        Properties:
            PlatformVersion: LATEST
            Cluster:
                Fn::ImportValue: !Sub '${AppName}-${EnvName}-ClusterId'
            TaskDefinition: !Ref TaskDefinition
            DesiredCount: !Ref TaskCount
            DeploymentConfiguration:
                DeploymentCircuitBreaker:
                    Enable: true
                    Rollback: true
                MinimumHealthyPercent: 100
                MaximumPercent: 200
                Alarms: !If
                    - IsGovCloud
                    - !Ref AWS::NoValue
                    - Enable: false
                      AlarmNames: []
                      Rollback: true
            PropagateTags: SERVICE
            EnableExecuteCommand: true
            LaunchType: FARGATE
            ServiceConnectConfiguration:
                Enabled: True
                Namespace: REDACTED.REDACTED.local
                LogConfiguration:
                    LogDriver: awslogs
                    Options:
                        awslogs-region: !Ref AWS::Region
                        awslogs-group: !Ref LogGroup
                        awslogs-stream-prefix: copilot
                Services:
                    - PortName: target
                      # Avoid using the same service with Service Discovery in a namespace.
                      DiscoveryName: !Join ["-", [!Ref WorkloadName, "sc"]]
                      ClientAliases:
                        - Port: !Ref TargetPort
                          DnsName: !Ref WorkloadName
            NetworkConfiguration:
                AwsvpcConfiguration:
                    AssignPublicIp: DISABLED
                    Subnets:
                        Fn::Split:
                            - ','
                            - Fn::ImportValue: !Sub '${AppName}-${EnvName}-PrivateSubnets'
                    SecurityGroups:
                        - Fn::ImportValue: !Sub '${AppName}-${EnvName}-EnvironmentSecurityGroup'
            # This may need to be adjusted if the container takes a while to start up
            HealthCheckGracePeriodSeconds: 120
            LoadBalancers:
                - ContainerName: nginx
                  ContainerPort: 443
                  TargetGroupArn: !Ref TargetGroup
            ServiceRegistries:
                - RegistryArn: !GetAtt DiscoveryService.Arn
                  Port: !Ref TargetPort
    TargetGroup:
        Metadata:
            'aws:copilot:description': "A target group to connect the load balancer to your service on port 443"
        Type: AWS::ElasticLoadBalancingV2::TargetGroup
        Properties:
            HealthCheckPath: / # Default is '/'.
            HealthCheckPort: 8080 # Default is 'traffic-port'.
            Matcher:
                HttpCode: 200
            HealthyThresholdCount: 3
            UnhealthyThresholdCount: 3
            HealthCheckIntervalSeconds: 35
            HealthCheckTimeoutSeconds: 30
            HealthCheckProtocol: HTTP
            Port: 443
            Protocol: HTTPS
            TargetGroupAttributes:
                - Key: deregistration_delay.timeout_seconds
                  Value: 60 # ECS Default is 300; Copilot default is 60.
                - Key: stickiness.enabled
                  Value: false
            TargetType: ip
            VpcId:
                Fn::ImportValue: !Sub "${AppName}-${EnvName}-VpcId"
    RulePriorityFunction:
        Type: AWS::Lambda::Function
        Properties:
            Code:
                S3Bucket: stackset-REDACTED-infr-pipelinebuiltartifactbuc-REDACTED
                S3Key: manual/scripts/custom-resources/rulepriorityfunction/REDACTED.zip
            Handler: "index.nextAvailableRulePriorityHandler"
            Timeout: 600
            MemorySize: 512
            Role: !GetAtt "RulePriorityFunctionRole.Arn"
            Runtime: nodejs16.x
    RulePriorityFunctionRole:
        Metadata:
            'aws:copilot:description': "An IAM Role to describe load balancer rules for assigning a priority"
        Type: AWS::IAM::Role
        Properties:
            AssumeRolePolicyDocument:
                Version: '2012-10-17'
                Statement:
                    - Effect: Allow
                      Principal:
                        Service:
                            - lambda.amazonaws.com
                      Action:
                        - sts:AssumeRole
            Path: /
            ManagedPolicyArns:
                - !Sub arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
            Policies:
                - PolicyName: "RulePriorityGeneratorAccess"
                  PolicyDocument:
                    Version: '2012-10-17'
                    Statement:
                        - Effect: Allow
                          Action:
                            - elasticloadbalancing:DescribeRules
                          Resource: "*"
    HTTPSRulePriorityAction:
        Metadata:
            'aws:copilot:description': 'A custom resource assigning priority for HTTPS listener rules'
        Type: Custom::RulePriorityFunction
        Properties:
            ServiceToken: !GetAtt RulePriorityFunction.Arn
            RulePath: ["/"]
            ListenerArn: !GetAtt EnvControllerAction.HTTPSListenerArn
    HTTPRuleWithDomainPriorityAction:
        Metadata:
            'aws:copilot:description': 'A custom resource assigning priority for HTTP listener rules'
        Type: Custom::RulePriorityFunction
        Properties:
            ServiceToken: !GetAtt RulePriorityFunction.Arn
            RulePath: ["/"]
            ListenerArn: !GetAtt EnvControllerAction.HTTPListenerArn
    HTTPListenerRuleWithDomain:
        Metadata:
            'aws:copilot:description': 'An HTTP listener rule for path `/` that redirects HTTP to HTTPS'
        Type: AWS::ElasticLoadBalancingV2::ListenerRule
        Properties:
            Actions:
                - Type: redirect
                  RedirectConfig:
                    Protocol: HTTPS
                    Port: 443
                    Host: "#{host}"
                    Path: "/#{path}"
                    Query: "#{query}"
                    StatusCode: HTTP_301
            Conditions:
                - Field: 'host-header'
                  HostHeaderConfig:
                    Values: ["v2.REDACTED.REDACTED.uktrade.digital"]
                - Field: 'path-pattern'
                  PathPatternConfig:
                    Values:
                        - /*
            ListenerArn: !GetAtt EnvControllerAction.HTTPListenerArn
            Priority: !GetAtt HTTPRuleWithDomainPriorityAction.Priority
    HTTPSListenerRule:
        Metadata:
            'aws:copilot:description': 'An HTTPS listener rule for path `/` that forwards HTTPS traffic to your tasks'
        Type: AWS::ElasticLoadBalancingV2::ListenerRule
        Properties:
            Actions:
                - TargetGroupArn: !Ref TargetGroup
                  Type: forward
            Conditions:
                - Field: 'host-header'
                  HostHeaderConfig:
                    Values: ["v2.REDACTED.REDACTED.uktrade.digital"]
                - Field: 'path-pattern'
                  PathPatternConfig:
                    Values:
                        - /*
            ListenerArn: !GetAtt EnvControllerAction.HTTPSListenerArn
            Priority: !GetAtt HTTPSRulePriorityAction.Priority
    AddonsStack:
        Metadata:
            'aws:copilot:description': 'An Addons CloudFormation Stack for your additional AWS resources'
        Type: AWS::CloudFormation::Stack
        DependsOn: EnvControllerAction
        Condition: HasAddons
        Properties:
            Parameters:
                App: !Ref AppName
                Env: !Ref EnvName
                Name: !Ref WorkloadName
            TemplateURL: !Ref AddonsTemplateURL
Outputs:
    DiscoveryServiceARN:
        Description: ARN of the Discovery Service.
        Value: !GetAtt DiscoveryService.Arn
        Export:
            Name: !Sub ${AWS::StackName}-DiscoveryServiceARN

Lou1415926 commented 9 months ago

Ummm your manifest and CloudFormation looks good to me! Given that the nginx container's nginx.confg is listening on 8080, I don't think there is any issue. Let's see if after increasing grace_periods the issue manifests again which will indicate that there are deeper issues. Feel free to update the thread in that case.

the fact that it can respond successfully several times to the health checks then suddenly be considered to fail, and there are no logs anywhere to make clear how it is considered to have failed, is a bad smell for me.

Totally agree. I think it'd be helpful if the TargetGroup page can show more information (e.g. the actual response code received, or whether it has timed out) . I can forward this feedback for you!

WillGibson commented 9 months ago

I think it'd be helpful if the TargetGroup page can show more information (e.g. the actual response code received, or whether it has timed out).

While you're passing on feedback, being able to view logs for ELB health checks would be super useful for debugging this kind of thing. They wouldn't need to be kept for long, 24 hours would be plenty. It could also be something we only turn on when needed maybe, e.g. "Enable health check logging for X hours". Or perhaps just log and display details of the last X failures.

ssyberg commented 9 months ago

While you're passing on feedback, being able to view logs for ELB health checks would be super useful for debugging this kind of thing. They wouldn't need to be kept for long, 24 hours would be plenty. It could also be something we only turn on when needed maybe, e.g. "Enable health check logging for X hours". Or perhaps just log and display details of the last X failures.

Yes please, this would be a huge win!

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 60 days with no response activity. Remove the stale label, add a comment, or this will be closed in 14 days.

github-actions[bot] commented 7 months ago

This issue is closed due to inactivity. Feel free to reopen the issue if you have any further questions!

aws / copilot-cli

Possible issue with Service Connect & ELB Health Check #5613