Mirantis / hmc

Apache License 2.0
10 stars 11 forks source link

HelmRelease should not be removed until all objects are removed #217

Closed Kshatrix closed 1 day ago

Kshatrix commented 3 weeks ago

helm-controller has a timeout (5min by default) that is used in helm uninstall --wait. After 5 minutes, it removes finalizer, leaving resources behind.

slysunkin commented 3 weeks ago

https://fluxcd.io/flux/installation/uninstall/ "Note that the uninstall command will not remove any Kubernetes objects or Helm releases that were reconciled on the cluster by Flux. It is safe to uninstall Flux and rerun the bootstrap, any existing workloads will not be affected."

slysunkin commented 2 weeks ago

Sometimes AWS control plane instances (i.e. -aws-dev-cp-) may have issues with removal of security group and internet getaway. This will result of non-deleted ControlPlane or Cluster, so security groups and internet getaways should be removed manually in AWS console.

slysunkin commented 2 weeks ago

According to the most recent investigations the root cause of this issue is in the sequence of removal of HelmRelease objects. Even if I can enforce "foreground" removal (child objects first, then parents), it is still quite often case where AWSCluster object (i.e. "AWSCluster/-aws-dev") is removed before Machine objects (for example: "Machine/-aws-dev-md-"). In this case AWSCluster doesn't exist while Machine becomes orphaned. And the whole Cluster is stuck until 5 minutes timeout. Setting a finalizer (just a manual one) on AWSCluster helped to resolve the issue: I can wait until Machine is deleted, remove my finalizer from AWSCluster, and then all remaining objects (AWSCluster and then Cluster) are deleted.

slysunkin commented 2 weeks ago

PR https://github.com/Mirantis/hmc/pull/242 was created

slysunkin commented 1 week ago

The same functionality should be implemented for Azure