Open lucavb opened 4 months ago
Hi @lucavb, thanks for reporting this! Could you please clarify which resource you are referring to that fails to delete here? Is it the helm resource which installs Karpenter?
Hey @andskli it seems to be the CustomResource that installs either the EC2NodeClass
or the NodePool. As I have said the cluster is basically without nodes at that point and the custom resource that should remove those two resources just times out. Does that help?
Edit: So I recreated my example in our account and here you see the failed resource:
The resource that could not be removed:
And the lambda that times out:
I'm not sure if this is related, but we also just ran into an issue deleting the stack. In our case it failed on trying to delete the NodeRole and a NodeClass. In cloudfromation the event error message points to the instance profile:
Karpenter Node Role:
Resource handler returned message: "Cannot delete entity, must remove roles from instance profile first.
A node class that we provisioned using karpenter.addNodeClass
CloudFormation did not receive a response from your Custom Resource. Please check your logs for requestId [0a56d113-0455-4c30-bca2-9b64cb2be7fa]. If you are using the Python cfn-response module, you may need to update your Lambda function code so that CloudFormation can attach the updated version.
All we did in this case was add karpenter to an existing stack, provision a node class to test, and then tried to tear it down.
@andskli is there any update on this?
Have not had much time to look at this. Had a quick check-in and I am able to reproduce using your example, so thanks for that @lucavb.
Leaving the following as a note to future self or anyone willing to pick this issue up in the next few weeks as I won't be able to (summer holiday):
What seems to happen is that the EC2NodeClass
doesn't get deleted because of the finalizer applied to the resource. I am not sure exactly how to solve this, perhaps we can utilize dependencies between NodePool
and EC2NodeClass
in a clever way somehow, or perhaps we can work on getting a force delete/remove finalizer option into upstream CDK resource which addEC2NodeClass()
and addNodePool()
uses under the hood.
Hey,
we have been using
cdk-eks-karpenter
for a while now and we have been experiencing issues with the removal of stacks where karpenter has been installed using this package. Basically CloudFormation triggers the delete on the CustomResource which installed the yaml file into the cluster that then fails / times out. In the EKS console all the nodes have already been removed and the cluster is pretty much only still existing on paper (but I cannot connect with kubectl to it anymore). Eventually the CustomResource times out after 1h and CloudFormation fails.We have produced this sort of minimal example where the error still occurs and where we do nothing more than just creating a cluster within our pre-created VPC and then install karpenter using this package.