aws-cloudformation / cloudformation-coverage-roadmap

The AWS CloudFormation Public Coverage Roadmap
https://aws.amazon.com/cloudformation/
Creative Commons Attribution Share Alike 4.0 International
1.1k stars 53 forks source link

CloudFormation should wait for SageMaker to clean up ENIs when deleting SageMaker endpoints hosting VPC-connected models #1327

Open petermeansrock opened 1 year ago

petermeansrock commented 1 year ago

Name of the resource

AWS::SageMaker::Endpoint

Resource name

No response

Description

When SageMaker provisions EC2 instances to deploy a customer's AWS::SageMaker::Endpoint resource for VPC-connected models, SageMaker creates Elastic Network Interfaces (ENIs) in the customer's account outside of the associated CloudFormation stack. When deleting a stack, CloudFormation will successfully delete the endpoint resource followed by the model before failing to delete the associated security group(s) and subnet(s). Unfortunately, as there are ENIs associated with these networking resources, stack deletion will fail after 15 minutes with errors like:

resource sg-<id> has a dependent object (Service: AmazonEC2; Status Code: 400; Error Code: DependencyViolation; Request ID: <request-id>; Proxy: null)
Resource handler returned message: "The subnet 'subnet-<id>' has dependencies and cannot be deleted. (Service: Ec2, Status Code: 400, Request ID: <request-id>, Extended Request ID: null)" (RequestToken: <token>, HandlerErrorCode: InvalidRequest)

Just as CloudFormation waits for Lambda-created ENIs to be cleaned up on function deletion, shouldn't CloudFormation do the same with SageMaker?

Other Details

No response

jasonmeverett commented 3 months ago

Has there been any update on this? This hangup is significantly impacting the runtime of some CDK stack deployment workflows we're building

enano9311 commented 3 months ago

Same, this issue means we need to add custom handling when deleting sagemaker endpoints via CloudFormation or deal with each delete attempt taking ~15min to timeout.

AlJohri commented 1 month ago

This issue is severely slowing down our ability to delete sagemaker endpoints during blue green deployments.