Closed mariuskimmina closed 2 months ago
Operator will not mess up any current state unless something changes on the desired state ie the CR.
I faced similar issue long back, as i had uninstalled the operator chart. But then we did add --keep-crds which worked fine.
DId you mark the CR for deletion, did helm do it ? i m curious. Also is your operator running cluster scope ?
We are using Helm via Terraform
resource "helm_release" "druid_operator" {
name = "druid-operator"
repository = "https://charts.datainfra.io"
chart = "druid-operator"
namespace = var.namespace
version = var.operator_chart_version
set {
name = "resources.requests.cpu"
value = var.operator_cpu_request
}
set {
name = "resources.requests.memory"
value = var.operator_memory_request
}
set {
name = "resources.limits.memory"
value = var.operator_memory_limit
}
set {
name = "env.WATCH_NAMESPACE"
value = var.watch_namespace
}
}
We are using watch_namespace
to limit the operator to only the 2 namespaces we will have druid clusters in.
We first took down the current operator in the namespace druid
by applying our druid module again, which doesn't contain the operator anymore
terraform apply --target module.druid
This removed the whole operator helm chart. We then re-created the operator by applying it's new module
terraform apply --target module.druid_operator
This brought up the new operator successfully as described above. Only once the new operator was running the existing CRD got removed (which resulted in the CR being removed)
I am pretty sure, tf is re-creating CRD's. Try to use --keep-crds flag. Not sure where to add it in tf.
I don't think that's the case, see we ran into #169 before - so we had a case where the druid operator was applied successfully, terraform was done, but unable to start because there was this exec format error. While the operator was unable to start the CRD and cluster where still there. Only once the operator start running the CRD and cluster got removed.
For such issues, Ill suggest to get connected here https://calendly.com/adheip-singh/30-min-meeting?month=2024-06.
I am confused with terminology being used and mentioned
While the operator was unable to start the CRD and cluster where still there. Only once the operator start running the CRD and cluster got removed.
Please look into
In your case its tf > helm for applying and operator for reconciling. There is an abstraction b/w the two points mentioned above. If you send in a bad config operator will reconcile. So ill suggest to look into tf on what config it is applying.
Please note operator performs lookups for CRD. Operator does not perform any lookups for CR. Apply config for CR is totally an event driven mechanism. Operator won't delete any CR, until a deletion timestamp is set and Operator will never delete a CRD. The way you are applying configurations to CRD and CR is something to be looked into.
Regarding issue #169 , once i get time will push amd64 image.
Small update on this before our meeting later:
We found that both the CRD and the CR do have a deletionTimestamp
set after removing the helm chart of the operator. That said, the actual deletion only happens once the new (re-created) druid-operator starts running. We do have the keep-crds set to true (I don't think this does anything tho, the flag isn't used anywhere and Helm doesn't automatically delete CRDs in Helm 3. It did in Helm 2)
@AdheipSingh we just tested the scenario we discussed in the meeting and Helm was the one responsible for deleting the CRD.
In fact, according to https://github.com/helm/helm/issues/7279#issuecomment-568019521 if a helm release was deployed and the CRD was under the templates
folder when you uninstall the Helm release, Helm would try to delete the CRD. It should be noted that we're installing the Helm chart before merging https://github.com/datainfrahq/druid-operator/pull/162 so what happened makes sense.
Thanks for your time again
We currently have the druid-operator and the druid cluster in the same namespace. We'll soon add a second druid cluster and to keep things cleaner we would like to move the operator to it's own namespace but while we were testing this we ran into a couple of issues.
First, here is how we imaged this should work:
Step 1 and 2 worked as expected, we could bring down the operator without affecting our existing druid cluster. On step 3 we faced a couple of issues, first the operator did not support helms
--skip-crd
flag which prevented the new operator from coming up while the CRD already existed, a fix for this already got merged hereA similar issue then occurs for clusterRoles, we fixed this in our local chart by adding an option to skip the clusterRoles as well and of course we can also open a PR for this.
Now, with both of the above in place we were able to bring up the druid operator in it's own namespace but once the operator was up it somehow removed the existing CRD and because of the owner dependence the whole druid cluster was gone with it. We are yet to figure out why exactly the CRD got removed. It also did not create a new one, we were left with a running operator but no druid cluster.
We saw the following events in our kubernetes cluster which seem to be related
That said, we haven't yet found the root cause of the operator removing the CRD