deleteOrphanPvc might remove pvc of pending container

applike-ss commented 2 years ago

We are observing an issue daily where a historical node can not come up again because the pvc got removed. deleteOrphanPvc is enabled and the druid cluster is running on spot instances which can be taken from us at any point in time by the cloud provider. I have seen that there was an issue in the past for deleteOrphanPvc with a race condition (#150) and we might be running into a similar problem.

My assumption right now is that this happens:

Node gets SchedulingDisabled condition (due to spot termination notice)
Pod gets terminated
New Pod get created, but due to not enough overhead of resources needs to wait until a new instance is spun up (Pending state)
Operator sees that the pod was running, but now terminated and removed the pvc

At least i could create the exact same scenario manually in our cluster when spawning a statefulset with bigger resources than we have headroom and then while the new node gets spawned removing the pvc already.

In addition to applying a similar fix as for #150 here, i would suggest to also kill the pod if it is in a pending state for longer than a configurable timespan (5 minute default maybe?).

AdheipSingh commented 2 years ago

Thanks for checking on this. Few things:

operator will not delete pvc if a pod is in pending state.
operator will not delete pvc if all the sts in the CR are not in running state.ie replicas should be available.

Also in running on spot we did see at sometimes this happen, and adding #150 fix for the reason.

Can you list the exact steps to re-create the scenario. Post fixing #150 we dint see the issue arise. Ill have a look if i can re-create the issue

applike-ss commented 2 years ago

That is interesting, i was about to ask whether it could be a race condition when the pod is being terminated and then for milliseconds doesn't exist in any state (between termination and pending state).

Maybe i should've mentioned that we are running the operator in version 0.0.9 which i see seems to be the latest release. Also that release was made 2 months after the fix for #150 was merged, so i do assume it is in the binary as well.

Can you list the exact steps to re-create the scenario.

If you mean druid-wise, i can not. If you mean in general the steps are like this:

have an kubernetes cluster with some node autoscaling mechanism
create a statefulset with a volume claim template more resources that you currently have spare on the node with the most free resources so that autoscaling has to kick in
remove the pvc while the pod is still in pending state
statefulset gets deleted and pod stays in pending state forever as the pvc does not get recreated (which i assume is the correct behaviour from the kubernetes side)

The issue does not happen for us when scaling up or down druid pods to/from a higher number (like mentioned in #150), but instead every now and then when a spot termination happens.

Btw. i can see in our logs that something removed the pvc and i assure you that neither of us with access to the cluster did it at that time. See this screenshot (ignore that the entries are out of order):

We do in total have 5 replicas for this statefulset, so it can not simply be that last terminated pod or something.

AdheipSingh commented 2 years ago

@applike-ss how frequent you see this issue ? on every spot node killed did you face this ?

applike-ss commented 2 years ago

We did see it once a day starting a few days before i reported the issue. We do not see it on every spot node that gets killed. Over the last 4 days the issue wasn't visible at all.

AdheipSingh commented 2 years ago

hmmm interesting, i am not sure what extra conditional can be done to check this race. This feature was tested and is running on large spot infra running druid. If you have any suggestions/improvements feel free to point out.

applike-ss commented 2 years ago

In fact right now it happened again.

I would like to suggest a "kill pod" feature if it stays in pending state longer than a configurable amount of time (default 5m?).

In addition to that, maybe we could have a delayed pvc removal (1m default?) which prior removal checks again if the pvc is bound and aborts a pending deletion then?

druid-io / druid-operator

deleteOrphanPvc might remove pvc of pending container #305