Closed runningman84 closed 1 year ago
See https://github.com/fluxcd/helm-controller/issues/149#issuecomment-1454241601. In combination with a sensitive retry configuration, this should ensure that from next release on releases should terminate gracefully (by marking them as "failed"), and then being retried once the controller finds a new node.
This should now happen in >=v0.31.0
, see also #644.
We have clusters in EKS where workers are controlled by karpenter. The worker nodes are spot instances. Therefore the cluster is quite dynamic nodes appear and disappear every few minutes.
Running helm controller on these nodes is risky because if you have long running helm install operations a given helm controller pod might be interrupted.
It would be great if the helm controller would wait before shutting down (which my still be an issue once a spot node is terminated within a 2 minute window) or would ensure that the given helm release does not stay in progress.
Another idea could be to use jobs or pods to do the single helm operation instead of doing everything in the main loop.
tldr I would like to run helm controller on short lives nodes without manual cleanups … right know I run it on fargate which is quite expensive compared to spot instances.