aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.88k stars 970 forks source link

karpenter does not delete aws network interface after scale down #5582

Open Mieszko96 opened 10 months ago

Mieszko96 commented 10 months ago

Description

Observed Behavior:

  1. install 1 node EKS
  2. scale 1 pod from 1->700
  3. nodes scale up from 1 to 5
  4. uninstalled nodeclass and nodepool
  5. nodes scale down fro 5 to 1
  6. network interface is staying with status "Available"
  7. terraform destroy does not work, cuz it can't delete VPC, cuz this network interface stays Expected Behavior:

network interface should be deleted after scale down

Versions:

tzneal commented 10 months ago

There is a cleanup process for leaked ENIs in the VPC CNI. How long is your cluster staying up in total after you scale down?

Mieszko96 commented 10 months ago

There is a cleanup process for leaked ENIs in the VPC CNI. How long is your cluster staying up in total after you scale down?

few minutes depends how i write my terraform :).

But tested with wait 20min wait and it was sometimes working sometimes not, so i'm little confused.

Mieszko96 commented 10 months ago

@tzneal

17.36 - terraform destroy -target 'module.karpenter[0].helm_release.karpenter_provisioner'

module.karpenter[0].helm_release.karpenter_provisioner: Still destroying... [id=karpenter-provisioner, 10s elapsed]
module.karpenter[0].helm_release.karpenter_provisioner: Still destroying... [id=karpenter-provisioner, 20s elapsed]
module.karpenter[0].helm_release.karpenter_provisioner: Still destroying... [id=karpenter-provisioner, 30s elapsed]
module.karpenter[0].helm_release.karpenter_provisioner: Still destroying... [id=karpenter-provisioner, 40s elapsed]
module.karpenter[0].helm_release.karpenter_provisioner: Still destroying... [id=karpenter-provisioner, 50s elapsed]
module.karpenter[0].helm_release.karpenter_provisioner: Still destroying... [id=karpenter-provisioner, 1m0s elapsed]
module.karpenter[0].helm_release.karpenter_provisioner: Still destroying... [id=karpenter-provisioner, 1m10s elapsed]
module.karpenter[0].helm_release.karpenter_provisioner: Still destroying... [id=karpenter-provisioner, 1m20s elapsed]
module.karpenter[0].helm_release.karpenter_provisioner: Still destroying... [id=karpenter-provisioner, 1m30s elapsed]
module.karpenter[0].helm_release.karpenter_provisioner: Still destroying... [id=karpenter-provisioner, 1m40s elapsed]
module.karpenter[0].helm_release.karpenter_provisioner: Destruction complete after 1m42s

during this terraform cluster was scale down 17.38 scaled down

17.40 network interface stays

aws-K8S-i-0645b6195e3cdf9a7 | – | Available aws-K8S-i-0b8a85cb97a8a5209 | – | Available

17.44 still 2 network interfaces

Also in cloudtrail i see

February 01, 2024, 17:37:25, was attempt to delete those network interfaces

I assume there was try to delete those network interfaces but they were still attached. Is there any repeat try in karpenter? or my assumptions are wrong?

17.50 still 2 network interfcaes and don't see anything in cloudtrail

Mieszko96 commented 10 months ago

If my assumption is correct

I assume there was try to delete those network interfaces but they were still attached

Is there a way that there will be an introduced retry mechanism for deleting those network interfaces? Or run delete network interface only if network interface is in avaliable status, that can't be deleted

tzneal commented 10 months ago

We're working on improving this, but this is a known issue with very short lived nodes/clusters. I'll leave this issue open to track fixing it.

tanpsingh commented 2 months ago

@tzneal @engedaam Facing the same issue. Terraform is not able to delete SG which is attached to ENI and this ENI was attached to the terminated karpenter node

CtrlAltDft commented 2 months ago

@tzneal Ant update on the improvement? We're facing the same issue when ENIs are left behind when node is terminated