Feature request clean shutdown or helm operations as single pods / jobs

runningman84 commented 1 year ago

We have clusters in EKS where workers are controlled by karpenter. The worker nodes are spot instances. Therefore the cluster is quite dynamic nodes appear and disappear every few minutes.

Running helm controller on these nodes is risky because if you have long running helm install operations a given helm controller pod might be interrupted.

It would be great if the helm controller would wait before shutting down (which my still be an issue once a spot node is terminated within a 2 minute window) or would ensure that the given helm release does not stay in progress.

Another idea could be to use jobs or pods to do the single helm operation instead of doing everything in the main loop.

tldr I would like to run helm controller on short lives nodes without manual cleanups … right know I run it on fargate which is quite expensive compared to spot instances.

hiddeco commented 1 year ago

See https://github.com/fluxcd/helm-controller/issues/149#issuecomment-1454241601. In combination with a sensitive retry configuration, this should ensure that from next release on releases should terminate gracefully (by marking them as "failed"), and then being retried once the controller finds a new node.

hiddeco commented 1 year ago

This should now happen in >=v0.31.0, see also #644.

fluxcd / helm-controller

Feature request clean shutdown or helm operations as single pods / jobs #632