datainfrahq / druid-operator

Apache Druid On Kubernetes
Other
100 stars 39 forks source link

Operator takes down druid cluster upon re-creation #170

Closed mariuskimmina closed 2 months ago

mariuskimmina commented 3 months ago

We currently have the druid-operator and the druid cluster in the same namespace. We'll soon add a second druid cluster and to keep things cleaner we would like to move the operator to it's own namespace but while we were testing this we ran into a couple of issues.

First, here is how we imaged this should work:

  1. Take down the current operator
  2. All CRDs remain and the druid cluster stays alive
  3. Bring up the operator again in the new namespace
  4. The operator picks up the existing CRDs and ClusterRoles and things keep working
  5. No clusters get destroyed

Step 1 and 2 worked as expected, we could bring down the operator without affecting our existing druid cluster. On step 3 we faced a couple of issues, first the operator did not support helms --skip-crd flag which prevented the new operator from coming up while the CRD already existed, a fix for this already got merged here

A similar issue then occurs for clusterRoles, we fixed this in our local chart by adding an option to skip the clusterRoles as well and of course we can also open a PR for this.

Now, with both of the above in place we were able to bring up the druid operator in it's own namespace but once the operator was up it somehow removed the existing CRD and because of the owner dependence the whole druid cluster was gone with it. We are yet to figure out why exactly the CRD got removed. It also did not create a new one, we were left with a running operator but no druid cluster.

We saw the following events in our kubernetes cluster which seem to be related

druid             11m         Normal    DruidOperatorUpdateSuccess        druid/sb2                                                                        Updated [sb2:*v1alpha1.Druid].
druid             11m         Normal    DruidOperatorDeleteSuccess        druid/sb2                                                                        Successfully deleted object [data-volume-druid-sb2-middlemanagers-2:*v1.PersistentVolumeClaim] in namespace [druid]

That said, we haven't yet found the root cause of the operator removing the CRD

AdheipSingh commented 3 months ago

Operator will not mess up any current state unless something changes on the desired state ie the CR.

I faced similar issue long back, as i had uninstalled the operator chart. But then we did add --keep-crds which worked fine.

DId you mark the CR for deletion, did helm do it ? i m curious. Also is your operator running cluster scope ?

mariuskimmina commented 3 months ago

We are using Helm via Terraform

resource "helm_release" "druid_operator" {
  name       = "druid-operator"
  repository = "https://charts.datainfra.io"
  chart      = "druid-operator"
  namespace  = var.namespace
  version    = var.operator_chart_version

  set {
    name  = "resources.requests.cpu"
    value = var.operator_cpu_request
  }

  set {
    name  = "resources.requests.memory"
    value = var.operator_memory_request
  }

  set {
    name  = "resources.limits.memory"
    value = var.operator_memory_limit
  }
  set {
    name  = "env.WATCH_NAMESPACE"
    value = var.watch_namespace
  }
}

We are using watch_namespace to limit the operator to only the 2 namespaces we will have druid clusters in.

We first took down the current operator in the namespace druid by applying our druid module again, which doesn't contain the operator anymore

terraform apply --target module.druid

This removed the whole operator helm chart. We then re-created the operator by applying it's new module

terraform apply --target module.druid_operator

This brought up the new operator successfully as described above. Only once the new operator was running the existing CRD got removed (which resulted in the CR being removed)

AdheipSingh commented 3 months ago

I am pretty sure, tf is re-creating CRD's. Try to use --keep-crds flag. Not sure where to add it in tf.

mariuskimmina commented 3 months ago

I don't think that's the case, see we ran into #169 before - so we had a case where the druid operator was applied successfully, terraform was done, but unable to start because there was this exec format error. While the operator was unable to start the CRD and cluster where still there. Only once the operator start running the CRD and cluster got removed.

AdheipSingh commented 3 months ago

For such issues, Ill suggest to get connected here https://calendly.com/adheip-singh/30-min-meeting?month=2024-06.

I am confused with terminology being used and mentioned

While the operator was unable to start the CRD and cluster where still there. Only once the operator start running the CRD and cluster got removed.

Please look into

In your case its tf > helm for applying and operator for reconciling. There is an abstraction b/w the two points mentioned above. If you send in a bad config operator will reconcile. So ill suggest to look into tf on what config it is applying.

Please note operator performs lookups for CRD. Operator does not perform any lookups for CR. Apply config for CR is totally an event driven mechanism. Operator won't delete any CR, until a deletion timestamp is set and Operator will never delete a CRD. The way you are applying configurations to CRD and CR is something to be looked into.

Regarding issue #169 , once i get time will push amd64 image.

mariuskimmina commented 2 months ago

Small update on this before our meeting later:

We found that both the CRD and the CR do have a deletionTimestamp set after removing the helm chart of the operator. That said, the actual deletion only happens once the new (re-created) druid-operator starts running. We do have the keep-crds set to true (I don't think this does anything tho, the flag isn't used anywhere and Helm doesn't automatically delete CRDs in Helm 3. It did in Helm 2)

TessaIO commented 2 months ago

@AdheipSingh we just tested the scenario we discussed in the meeting and Helm was the one responsible for deleting the CRD. In fact, according to https://github.com/helm/helm/issues/7279#issuecomment-568019521 if a helm release was deployed and the CRD was under the templates folder when you uninstall the Helm release, Helm would try to delete the CRD. It should be noted that we're installing the Helm chart before merging https://github.com/datainfrahq/druid-operator/pull/162 so what happened makes sense. Thanks for your time again