StackStorm / stackstorm-k8s

K8s Helm Chart that codifies StackStorm (aka "IFTTT for Ops" https://stackstorm.com/) Highly Availability fleet as a simple to use reproducible infrastructure-as-code app
https://helm.stackstorm.com/
Apache License 2.0
99 stars 107 forks source link

Some K8S resources do not get deleted after 'helm delete'. #101

Open erenatas opened 4 years ago

erenatas commented 4 years ago

Hello,

I would like to report an issue (I believe). After trying to delete the deployed helm chart, I can see that there are job pods remained. Here is an example of kubectl get po after helm delete --purge stackstorm on Helm 2 or helm delete stackstorm on Helm 3:

❯ kubectl get po   
NAME                                                              READY   STATUS                       RESTARTS   AGE
stackstorm-job-st2-apikey-load-j8rkv                              0/1     Completed                    0          6d21h
stackstorm-job-st2-key-load-wbdxp                                 0/1     Completed                    0          6d21h
stackstorm-job-st2-register-content-l6rmh                         0/1     Completed                    0          6d21h

Not only these pods, but also the jobs remain as well. on kubectl get jobs, I can see:

❯ kubectl get job
NAME                                   COMPLETIONS   DURATION   AGE
stackstorm-job-st2-apikey-load         1/1           45s        6d21h
stackstorm-job-st2-key-load            1/1           9s         6d21h
stackstorm-job-st2-register-content    1/1           25s        6d21h

There are also PV,PVC objects remained but I believe that is desired?

Thanks!

Edit: I believe that the reason is related to this. I think both jobs and pods should be deleted on helm delete and also, at least, pods of jobs should be deleted after they have been successful.

arm4b commented 4 years ago

Yes, that's caused by the fact that hooks are not managed by Helm and there is no garbage collector for them yet: https://helm.sh/docs/topics/charts_hooks/#hook-resources-are-not-managed-with-corresponding-releases

The only workaround is adding "helm.sh/hook-delete-policy": hook-succeeded annotation, https://helm.sh/docs/topics/charts_hooks/#hook-deletion-policies which is not desired. Instead of deleting successful job immediately, we want to keep it for the informational reasons so the user could grab the logs and know which content was registered, otherwise this Helm magic would remain unnoticed. Sadly, Helm doesn't delete jobs once the release was removed.

It's actually a good topic to discuss and I'm inviting others to provide their feedback. Depending on feedback it might be a good idea to change the behavior until Chart reaches stable state.

arm4b commented 4 years ago

Alternative per Helm doc advice is trying ttlSecondsAfterFinished (https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#clean-up-finished-jobs-automatically) and TTLController https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/ to clean-up the job automatically after some reasonable delay. However this feature looks like still in alpha state.

cognifloyd commented 3 years ago

Alternative per Helm doc advice is trying ttlSecondsAfterFinished (https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#clean-up-finished-jobs-automatically) and TTLController https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/ to clean-up the job automatically after some reasonable delay. However this feature looks like still in alpha state.

It looks like this hit beta state in k8s v1.21 - so it's probably safe to start adding something like this.

I'm guessing the ttl would need to be defined in values.yaml (perhaps in jobs.ttlSecondsAfterFinished), probably with a large default like 604800 (1 week).

cognifloyd commented 2 years ago

Quick follow-up if anyone wants to work on this. The TTL-after-finished Controller feature hit stable in k8s v1.23.

NOTE: This feature will only clean up the old jobs. The only other hook we have is used for running tests though (a Pod), so it is unlikely to be (and probably should not be) in any production clusters. That test Pod is only deleted automatically if it succeeds - if there's a failure, it has to be cleaned up manually (or just use a disposable cluster for testing).

cognifloyd commented 1 year ago

k8s 1.22 was EOL on 2022-10-28, so we can safely use ttlSecondsAfterFinished now. A PR to implement this would be welcome!