cockroachdb / cockroach-operator

k8s operator for CRDB
Apache License 2.0
286 stars 95 forks source link

Operator is vulnerable to misoperations and drives the cluster to broken state #975

Open hoyhbx opened 1 year ago

hoyhbx commented 1 year ago

Hello cockroachdb operator developers,

We found that many properties in the CR are very easy to drive the cluster into a broken state if not handled carefully. For example, when specifying a bad value for the properties in spec.Affinity, the operator will perform a rolling upgrade and the restarted pod will fail to get scheduled due to the unsatisfiable Affinity. There are a lot of other examples include spec.Affinity, spec.additionalArgs which will crash the restarted cockroachdb process, spec.podEnvVariables.0.valueFrom.configMapKeyRef prevents the pod to start if the configMap does not exist in the cluster.

A concrete example is to submit the following CR to a 3-node cluster with 5 crdb replicas

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
            - test-cluster
        topologyKey: kubernetes.io/hostname
...

cockroachdb will lose 2 replicas, leaving them unscheduled

It causes severe consequences in production. We believe these are misoperation vulnerabilities where the operator fails to reject the misoperation from users. The operator uses controller-gen to automatically generate validations for many properties, but these static validations fall short for validating more complicated contraints, e.g. to reject an invalid nodeSelector needs knowledge of which nodes are available in Kubernetes cluster. Validating the cockroachdb args also requires application specific information.

We want to open this issue to discuss what do you think should be the best practice to handle this issue, or what functionalities should the Kubernetes provide to make this validation easier. Is there a way to prevent the bad operation from happening in the first place, or there is a way for cockroach-operator to automatically recognize the statefulSet is stuck and perform an automatic recovery. If you know of any practical code fixes for this issue, we are also happy to send a PR for that.