aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.55k stars 3.87k forks source link

aws_fsx: resource sg-xxx has a dependent object #31521

Open gmarchand opened 6 days ago

gmarchand commented 6 days ago

Describe the bug

I have created a Fsx Lustre cluster. I can't delete it due to a dependence between security group and the ENI of Fsx Lustre

Regression Issue

Last Known Working CDK Version

No response

Expected Behavior

Able to delete the stack without going to the AWS Console

Current Behavior

destroying... [6/7] 2:43:14 PM | DELETE_FAILED | AWS::EC2::SecurityGroup | lustresg6B5C6047 resource sg-05712a293a30dce52 has a dependent object (Service: Ec2, Status Code: 400, Request ID: 62184a39-ecfc-4eb7-8ba2-6c96f113b959)

 ❌  batch-ffmpeg-storage-stack: destroy failed Error: The stack named batch-ffmpeg-storage-stack is in a failed state. You may need to delete it from the AWS console : DELETE_FAILED (The following resource(s) failed to delete: [lustresg6B5C6047]. ): resource sg-xxxx has a dependent object (Service: Ec2, Status Code: 400, Request ID: xxx)
    at destroyStack (/Users/xxxx/.local/share/mise/installs/node/18.18.2/lib/node_modules/aws-cdk/lib/index.js:463:2157)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async CdkToolkit.destroy (/Users/xxxx/.local/share/mise/installs/node/18.18.2/lib/node_modules/aws-cdk/lib/index.js:466:208221)
    at async exec4 (/Users/xxxx/.local/share/mise/installs/node/18.18.2/lib/node_modules/aws-cdk/lib/index.js:521:54490)

The stack named batch-ffmpeg-storage-stack is in a failed state. You may need to delete it from the AWS console : DELETE_FAILED (The following resource(s) failed to delete: [lustresgxxx]. ): resource sg-xxxx has a dependent object (Service: Ec2, Status Code: 400, Request ID: xxx-ecfc-4eb7-8ba2-6c96f113b959)
task: Failed to run task "cdk:destroy": exit status 1

Reproduction Steps

 lustre_subnet = self.vpc.select_subnets(
            subnet_type=ec2.SubnetType.PRIVATE_ISOLATED
        ).subnets[0]

        lustre_sg = ec2.SecurityGroup(
            self,
            "LustreSecurityGroup",
            vpc=self.vpc,
            description="Security group for FSx Lustre",
            allow_all_outbound=True,
        )
        lustre_sg.add_ingress_rule(
            peer=ec2.Peer.ipv4(self.vpc.vpc_cidr_block),
            connection=ec2.Port.tcp(988),
            description="FSx Lustre client port",
        )

        lustre_fs = fsx.LustreFileSystem(
            self,
            "LustreFileSystem",
            vpc=self.vpc,
            vpc_subnet=lustre_subnet,
            security_group=lustre_sg,
            storage_capacity_gib=self.node.try_get_context(
                "batch-ffmpeg:lustre-fs:storage_capacity_gi_b"
            ),
            lustre_configuration=fsx.LustreConfiguration(
                deployment_type=fsx.LustreDeploymentType.SCRATCH_2,
                export_path=self.s3_bucket.s3_url_for_object(),
                import_path=self.s3_bucket.s3_url_for_object(),
                auto_import_policy=fsx.LustreAutoImportPolicy.NEW_CHANGED_DELETED,
            ),
        )

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.159.1 (build c66f4e3)

Framework Version

No response

Node.js Version

v18.18.2

OS

Macos

Language

Python

Language Version

3.12.6

Other information

No response

khushail commented 5 days ago

Hi @gmarchand , thanks for reaching out.

I see many similar issues in the past mentioning about dependency between Security group and ENI and failing to delete the resource. Please see for reference -

  1. https://github.com/aws/aws-cdk/issues/9970
  2. https://github.com/aws/aws-cdk/issues/12130

Please confirm if the mentioned issue looks the same as you are facing.

gmarchand commented 4 days ago

I think it's not the same issue. Because it seems to be an issue with ALB where ENIs are realeased later than ALB so dependencies can't be deleted despite object tell it's deleted

khushail commented 3 days ago

@gmarchand thanks for keeping patience.

Yes, I am able to repro the above scenario with the shared code.

Python code -

      # Create S3 bucket for Lustre file system
        s3_bucket = s3.Bucket(self, "bucketForLustreFS")

        vpc = ec2.Vpc.from_lookup(self, "VPC", is_default=True)

        lustre_subnet = vpc.select_subnets(
            subnet_type= ec2.SubnetType.PUBLIC
        ).subnets[0]

        lustre_sg = ec2.SecurityGroup(
            self,
            "LustreSecurityGroup",
            vpc=vpc,
            description="Security group for FSx Lustre",
            allow_all_outbound=True,
        )
        lustre_sg.add_ingress_rule(
            peer=ec2.Peer.ipv4(vpc.vpc_cidr_block),
            connection=ec2.Port.tcp(988),
            description="FSx Lustre client port",
        )

        lustre_fs = fsx.LustreFileSystem(
            self,
            "LustreFileSystem",
            vpc=vpc,
            vpc_subnet=lustre_subnet,
            security_group=lustre_sg,
            storage_capacity_gib=1200,
            lustre_configuration=fsx.LustreConfiguration(
                deployment_type=fsx.LustreDeploymentType.SCRATCH_2,
                export_path=s3_bucket.s3_url_for_object(),
                import_path=s3_bucket.s3_url_for_object(),
                auto_import_policy=fsx.LustreAutoImportPolicy.NEW_CHANGED_DELETED,
            ),
        )

Error snippet -

4:00:45 PM | DELETE_FAILED        | AWS::EC2::SecurityGroup        | LustreSecurityGroupEAA9048E
resource sg-05f4fbb8ccd2ca27c has a dependent object (Service: Ec2, Status Code: 400, Request ID: eab5f5a7-6e0a-4bf8-94a0-c8f04bea7eea)

 ❌  LusterIssuePythonStack: destroy failed Error: The stack named LusterIssuePythonStack is in a failed state. You may need to delete it from the AWS console : DELETE_FAILED (The following resource(s) failed to delete: [LustreSecurityGroupEAA9048E]. )
    at destroyStack (/usr/local/lib/node_modules/aws-cdk/lib/index.js:468:2157)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async CdkToolkit.destroy (/usr/local/lib/node_modules/aws-cdk/lib/index.js:471:208654)
    at async exec4 (/usr/local/lib/node_modules/aws-cdk/lib/index.js:526:54490)

The stack named LusterIssuePythonStack is in a failed state. You may need to delete it from the AWS console : DELETE_FAILED (The following resource(s) failed to delete: [LustreSecurityGroupEAA9048E]. )

On the console, the dependent ENIs are shown as due to which Security group is not able to get deleted-

Screenshot 2024-09-24 at 4 22 00 PM

References - https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#eni-basics https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/requester-managed-eni.html

Analysis -

Here is a link to RE:Post which mentions - You can't delete a security group that's associated with a requester-managed network interface. Requester-managed network interfaces are automatically created for managed resources, such as Application Load Balancer nodes. Some AWS services and resources have security groups that are always attached to the elastic network interface. Examples include AWS Lambda, Amazon FSx, Amazon ElastiCache for Redis, and ElastiCache for Memcached.

I assume this might be the reason why deletion of security group is producing this error. However requesting inputs from the team as well.

CC: @pahud

pahud commented 3 days ago

You can't delete a security group that's associated with a requester-managed network interface. Requester-managed network interfaces are automatically created for managed resources, such as Application Load Balancer nodes. Some AWS services and resources have security groups that are always attached to the elastic network interface. Examples include AWS Lambda, Amazon FSx, Amazon ElastiCache for Redis, and ElastiCache for Memcached.

Yes if a security group is associated with an ENI which is not managed by yourself, you won't be able to delete that security group. One of the similar use case is Lambda function with VPC support, where an AWS-managed ENI would be created and associated with your SG and you just can't delete it by destroy the CFN stack. All you can do is wait. I am not sure if there's any better solution but this is what I read from https://repost.aws/knowledge-center/lambda-eni-find-delete

I am guessing this might be a similar case.