druid-io / druid-operator

Druid Kubernetes Operator
Other
205 stars 93 forks source link

deleteOrphanPvc might remove pvc of pending container #305

Open applike-ss opened 2 years ago

applike-ss commented 2 years ago

We are observing an issue daily where a historical node can not come up again because the pvc got removed. deleteOrphanPvc is enabled and the druid cluster is running on spot instances which can be taken from us at any point in time by the cloud provider. I have seen that there was an issue in the past for deleteOrphanPvc with a race condition (#150) and we might be running into a similar problem.

My assumption right now is that this happens:

At least i could create the exact same scenario manually in our cluster when spawning a statefulset with bigger resources than we have headroom and then while the new node gets spawned removing the pvc already.

In addition to applying a similar fix as for #150 here, i would suggest to also kill the pod if it is in a pending state for longer than a configurable timespan (5 minute default maybe?).

AdheipSingh commented 2 years ago

Thanks for checking on this. Few things:

  1. operator will not delete pvc if a pod is in pending state.
  2. operator will not delete pvc if all the sts in the CR are not in running state.ie replicas should be available.

Also in running on spot we did see at sometimes this happen, and adding #150 fix for the reason.

Can you list the exact steps to re-create the scenario. Post fixing #150 we dint see the issue arise. Ill have a look if i can re-create the issue

applike-ss commented 2 years ago

That is interesting, i was about to ask whether it could be a race condition when the pod is being terminated and then for milliseconds doesn't exist in any state (between termination and pending state).

Maybe i should've mentioned that we are running the operator in version 0.0.9 which i see seems to be the latest release. Also that release was made 2 months after the fix for #150 was merged, so i do assume it is in the binary as well.

Can you list the exact steps to re-create the scenario.

If you mean druid-wise, i can not. If you mean in general the steps are like this:

The issue does not happen for us when scaling up or down druid pods to/from a higher number (like mentioned in #150), but instead every now and then when a spot termination happens.

Btw. i can see in our logs that something removed the pvc and i assure you that neither of us with access to the cluster did it at that time. See this screenshot (ignore that the entries are out of order): image

We do in total have 5 replicas for this statefulset, so it can not simply be that last terminated pod or something.

AdheipSingh commented 2 years ago

@applike-ss how frequent you see this issue ? on every spot node killed did you face this ?

applike-ss commented 2 years ago

We did see it once a day starting a few days before i reported the issue. We do not see it on every spot node that gets killed. Over the last 4 days the issue wasn't visible at all.

AdheipSingh commented 2 years ago

hmmm interesting, i am not sure what extra conditional can be done to check this race. This feature was tested and is running on large spot infra running druid. If you have any suggestions/improvements feel free to point out.

applike-ss commented 2 years ago

In fact right now it happened again.

I would like to suggest a "kill pod" feature if it stays in pending state longer than a configurable amount of time (default 5m?).

In addition to that, maybe we could have a delayed pvc removal (1m default?) which prior removal checks again if the pvc is bound and aborts a pending deletion then?