aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.5k stars 3.84k forks source link

(aws-ecs): ELB TG can't connect to ECS EC2 instances ( healthcheck failed ) #14764

Open Insidexa opened 3 years ago

Insidexa commented 3 years ago

ELB TG can't connect to ECS EC2 instances ( healthcheck failed ) when use cluster.AsgCapacity over cluster.addCapacity .

Reproduction Steps


const.taskDefinition = new ecs.TaskDefinition(this, 'Backend', {
    family: 'someFamily',
    compatibility: ecs.Compatibility.EC2,
    executionRole,
    networkMode: ecs.NetworkMode.BRIDGE,
    taskRole,
});
taskDefinition.addContainer('backend', {
    image: ecs.ContainerImage.fromRegistry('hashicorp/http-echo'),
    memoryLimitMiB: 512,
    command: [
        `-listen=:${containerPort}`,
        '-text="hello world"'
    ],
    environment: {},
    portMappings: [
        {
            containerPort: containerPort,
            protocol: ecs.Protocol.TCP,
        },
    ],
});

const sg = new ec2.SecurityGroup(this, `SG${identifier}`, {
    vpc: this.cluster.vpc,
});

const autoScalingGroup = new autoscaling.AutoScalingGroup(this, `asg${identifier}`, {
    vpc: this.cluster.vpc,
    instanceType: new ec2.InstanceType(instanceType),
    machineImage: ecs.EcsOptimizedImage.amazonLinux2(),
    minCapacity: clusterMinCapacity,
    maxCapacity: clusterMaxCapacity,
    desiredCapacity: clusterDesiredCapacity,
    associatePublicIpAddress: true,
    cooldown: cdk.Duration.minutes(1),
    keyName: clusterKeyName,
    securityGroup: sg,
});

const asgProvider = new ecs.AsgCapacityProvider(this, `AsgProvider${identifier}`, {
    autoScalingGroup,
    canContainersAccessInstanceRole: true,
    enableManagedScaling: false,
    enableManagedTerminationProtection: false,
});

this.cluster.addAsgCapacityProvider(asgProvider);

What did you expect to happen?

I expect aws-ecs library automatically create security group with required inbound rules or have some method to allow connect ELB TG to EC2 instances

I expect method addAsgCapacityProvider add automatically access ELB TG to EC2 instances.

Normal SG created with cluster.addCapacity
image

What actually happened?

Actually EC2 instances create with only my security group inbound rules ( SSH ).
image

How I temporarily fixed this issue. I compared security group where create with cluster.addCapacity() and created SG with ASG provider.

This code fix trouble but I think this code must be default in aws-cdk.
Or I don't understand from AWS CDK ECS last update and deprecation cluster.addCapacity

this.ecsPatternService.loadBalancer.connections.allowTo(sg, ec2.Port.tcpRange(32768, 65535), `allow ELB TG connect to EC2 ${instanceType}`);

Environment

Insidexa commented 3 years ago

So, I use default CDK code without custom Load Balancer or Security Group.
When I use cluster.addCapacity.

I got next SG for EC2 instances in CloudFormation. Also you can see what EC2 SG source ingress is ServiceLBSecurityGroupA319D865.
And ServiceLBSecurityGroup Egress to EC2.

    "Clustert3smallInstancesInstanceSecurityGroupfromEcsAlbAutoscalingStackServiceLBSecurityGroup5C549F7132768655358E64B116": {
      "Type": "AWS::EC2::SecurityGroupIngress",
      "Properties": {
        "IpProtocol": "tcp",
        "Description": "Load balancer to target",
        "FromPort": 32768,
        "GroupId": {
          "Fn::GetAtt": [
            "Clustert3smallInstancesInstanceSecurityGroupE1595C07",
            "GroupId"
          ]
        },
        "SourceSecurityGroupId": {
          "Fn::GetAtt": [
            "ServiceLBSecurityGroupA319D865",
            "GroupId"
          ]
        },
        "ToPort": 65535
      },
      "Metadata": {
        "aws:cdk:path": "EcsAlbAutoscalingStack/Cluster/t3smallInstances/InstanceSecurityGroup/from EcsAlbAutoscalingStackServiceLBSecurityGroup5C549F71:32768-65535"
      }
    },
    "ServiceLBSecurityGrouptoEcsAlbAutoscalingStackClustert3smallInstancesInstanceSecurityGroupB12A13003276865535889A18E2": {
      "Type": "AWS::EC2::SecurityGroupEgress",
      "Properties": {
        "GroupId": {
          "Fn::GetAtt": [
            "ServiceLBSecurityGroupA319D865",
            "GroupId"
          ]
        },
        "IpProtocol": "tcp",
        "Description": "Load balancer to target",
        "DestinationSecurityGroupId": {
          "Fn::GetAtt": [
            "Clustert3smallInstancesInstanceSecurityGroupE1595C07",
            "GroupId"
          ]
        },
        "FromPort": 32768,
        "ToPort": 65535
      },
      "Metadata": {
        "aws:cdk:path": "EcsAlbAutoscalingStack/Service/LB/SecurityGroup/to EcsAlbAutoscalingStackClustert3smallInstancesInstanceSecurityGroupB12A1300:32768-65535"
      }
    },

But when I use cluster.addAsgCapacityProvider I got next SG for EC2 instances. That all what I found about SG with AsgProvider.
I don't find any SG for this LB with description Load balancer to target.

        "ecsasgt3microInstanceSecurityGroupEB660B39": {
      "Type": "AWS::EC2::SecurityGroup",
      "Properties": {
        "GroupDescription": "EcsAlbAutoscalingStack/ecs/asgt3.micro/InstanceSecurityGroup",
        "SecurityGroupEgress": [
          {
            "CidrIp": "0.0.0.0/0",
            "Description": "Allow all outbound traffic by default",
            "IpProtocol": "-1"
          }
        ],
        "Tags": [
          {
            "Key": "Name",
            "Value": "EcsAlbAutoscalingStack/ecs/asgt3.micro"
          }
        ],
        "VpcId": "vpc-c13aaeaa"
      },
      "Metadata": {
        "aws:cdk:path": "EcsAlbAutoscalingStack/ecs/asgt3.micro/InstanceSecurityGroup/Resource"
      }
    },
Insidexa commented 3 years ago

I found problem, in deprecated version cluster.addCapacity call in method this.addAutoScalingGroup and in method addAutoScalingGroup have next line this.connections.connections.addSecurityGroup(...autoScalingGroup.connections.securityGroups);

But in new version what recommended addAsgCapacityProvider over cluster.addCapacity this line I not found

When I add manual this line into my code I get correct SG

MrArnoldPalmer commented 3 years ago

@Insidexa to clarify, the issue you were having was because of usage of a deprecated method and using the updated replacement fixes it?

Insidexa commented 3 years ago

@MrArnoldPalmer I have problem because I use new method cluster.addAsgCapacityProvider instead of deprecated method cluster.addCapacity.

MrArnoldPalmer commented 3 years ago

I see, sounds like a bug. Are you able to submit a PR with the added line so we can take a look?

Insidexa commented 3 years ago

@MrArnoldPalmer Sure, I will try to create my first PR to open source :)

peterwoodworth commented 3 years ago

Hey @Insidexa, any update on the PR?

Insidexa commented 3 years ago

@MrArnoldPalmer @peterwoodworth Pull request - https://github.com/aws/aws-cdk/pull/15107 Sorry for long response

andreialecu commented 3 years ago

Just ran into this because I noticed VSCode mentioned that addCapacity is deprecated. Which I guess happened in a recent cdk update.

After migrating to addAsgCapacityProvider there was this diff on a bunch of security groups which seemed suspect:

Security Group Changes
┌───┬───────────────────────────────────────────────────────────────────────────────────────┬─────┬─────────────────┬───────────────────────────────────────────────────────────────────────────────────────┐
│   │ Group                                                                                 │ Dir │ Protocol        │ Peer                                                                                  │
├───┼───────────────────────────────────────────────────────────────────────────────────────┼─────┼─────────────────┼───────────────────────────────────────────────────────────────────────────────────────┤
│ - │ {"Fn::ImportValue":"*********-All-ALB:ExportsOutputFnGetAttLBSecurityGroup8A41EA2BGro │ Out │ TCP 32768-65535 │ {"Fn::ImportValue":"*********-All-EcsCluster-Production:ExportsOutputFnGetAttClusterS │
│   │ upId851EE1F6"}                                                                        │     │                 │ calingInstanceSecurityGroup8C9BAE52GroupId5449321F"}                                  │
├───┼───────────────────────────────────────────────────────────────────────────────────────┼─────┼─────────────────┼───────────────────────────────────────────────────────────────────────────────────────┤
│ - │ {"Fn::ImportValue":"*********-All-EcsCluster-Production:ExportsOutputFnGetAttClusterS │ In  │ TCP 32768-65535 │ {"Fn::ImportValue":"*********-All-ALB:ExportsOutputFnGetAttLBSecurityGroup8A41EA2BGro │
│   │ calingInstanceSecurityGroup8C9BAE52GroupId5449321F"}                                  │     │                 │ upId851EE1F6"}                                                                        │
└───┴───────────────────────────────────────────────────────────────────────────────────────┴─────┴─────────────────┴───────────────────────────────────────────────────────────────────────────────────────┘

I was able to manually add it back by adding this line below the cluster initialization:

    cluster.connections.connections.addSecurityGroup(
      ...autoScalingGroup.connections.securityGroups
    );

Now the diff properly shows the ports being allowed again.

@Insidexa will you be able to add tests to the PR you opened? If not I could open a new PR with the fix and some tests.

vincent-dm commented 2 years ago

Any news on this?

The latest CDK release is now printing warnings about our usage of Cluster.addCapacity, but we cannot switch to Cluster.addAsgCapacityProvider because of this bug...

NGL321 commented 2 years ago

I have marked this as in-progress since there is a PR with the fix waiting on a review. Once merged this issues should be resolved

metametadata commented 2 years ago

We've recently migrated from Cluster.addCapacity to Cluster.addAsgCapacityProvider and this problem was not reproducible. I.e. it maybe is already fixed in software.amazon.awscdk/aws-cdk-lib "2.3.0".

rix0rrr commented 2 years ago

Mitch, assigning back to you for triage. I'm missing the context to know if something has happened here recently and whether this is still relevant or no...

spg commented 2 years ago

Temporary fix:

autoScalingGroup.connections.allowFrom(loadBalancer, ec2.Port.tcpRange(32768, 65535))
robbiecooray1 commented 2 years ago

@rix0rrr still an issue, specifically when upgrading from CDK v1 to v2 where AddAutoScalingGroup is deprecated. Above workaround from @spg works.

With .AddAutoScalingGroup, the following rules are in place. With .AddAsgCapacityProvider, the following diff appears (i.e. the rules get dropped), making the service unavailable for requests.

Tested with cdk v2.33.0.

Group Dir Protocol Peer
- ${alb/SecurityGroup.GroupId} Out TCP 32768 - 65535 ${clusterSG.GroupId}
- ${clusterSG.GroupId} In TCP 32768 - 65535 ${alb/SecurityGroup.GroupId}