Update terminatingpod sanity check

IBM / operator-for-redis-cluster

IBM Operator for Redis Cluster

https://ibm.github.io/operator-for-redis-cluster

MIT License

60 stars 34 forks source link

Update terminatingpod sanity check #85

Closed 4n4nd closed 1 year ago

4n4nd commented 1 year ago

The operator will now disassociate pods that are stuck in terminating state for over a minute. Also, it will add a label "redis-operator.k8s.io/marked-for-termination" = "true", in case the pods need to be cleaned up later.

Resolves #84

cin commented 1 year ago

Hey @4n4nd, I haven't had a chance to look into this yet, but the tests are failing due to https://github.com/IBM/operator-for-redis-cluster/actions/runs/3873873803/jobs/6629153343#step:4:34. Looks like a linter problem?

cin commented 1 year ago

It looks like all the tests pass. I will test this out ASAP! Thanks!

cin commented 1 year ago

@4n4nd, do you have a reliable way of reproducing this issue?

4n4nd commented 1 year ago

@4n4nd, do you have a reliable way of reproducing this issue?

I believe you can set tolerations on k8s to not evict pods even when the node is down (docs).

The bigger issue is that since it fails one sanity check every time (terminatingpod check), it doesn't perform other checks. And this operator is designed to perform actions only when any of the sanitychecks needs it. This is why the operator essentially gets stuck in a non-reconciling state.