datainfrahq / druid-operator

Apache Druid On Kubernetes
Other
101 stars 42 forks source link

Updates to CR Wrongly Bumps Generation Field #154

Closed dave-mccowan closed 6 months ago

dave-mccowan commented 6 months ago

When druid-operator updates the CR, for example in updateFinalizers(), the generation field in the CR is bumped from 1 to 2. This introduces a subtle and sometimes serious bug. In deployDruidCluster(), the code checks m.Generation > 1, as the key for whether to use RollingDeploy. Since generation is bumped at the very beginning due to updating the finalizers, this check always passes and this code treats all deployments as updates, even if it is the initial deploy.

The initial deploy is supposed to be in parallel instead of in rolling, even if rollingDeploy is true. With this bug, the initial deploy is also done as rolling. This causes the initial deploy to be slower and it logs lots of errors.

This bug is even more serious if authentication is configured. In this case, the initial components won't completely come up until they've contacted conductor. But, conductor will not start until the earlier components have completed their start up. So, druid deploy is deadlocked.

In operator v1.2.1, I worked around this issue by adding the finalizers to the CR myself, so no update to the CR was made by the operator, generation remained set to 1, and the deploy completed. Updating to v1.2.3, I find that generation is getting bumped again and my initial deployment is deadlocked.

To recreate the bug, simply do an initial deploy and observe the generation field in the CR. It should be 1. To recreate the adverse side effects, set rollingDeploy to true. Components will start in rolling order instead of in parallel. To observe even worse side effects, add basic auth to the CR.