coreos / etcd-operator

etcd operator creates/configures/manages etcd clusters atop Kubernetes
https://coreos.com/blog/introducing-the-etcd-operator.html
Apache License 2.0
1.75k stars 741 forks source link

ETCDCluster stuck in failing state when secret missing upon creation #2129

Open rafaltrojniak opened 4 years ago

rafaltrojniak commented 4 years ago

The problem description

I have working integration between cert-manger and etcd-operator to manage their TLS certificates. Unfortunately because all manifests are deployed at the same time to the cluster, and ECDCluster manifest appears before certificates appear, the resulting ETCDcluster is stuck in failing state (see debug info below).

Deleting and re-creating the same ETCDCluster resolves the situation, but this is not an automatic process anymore.

the expected behavior

I would expect the operator to re-validate dependencies of failed stacks periodically (like every 10s) and when the dependencies (secret in that case) appears, the operator should resume creation process.

debug information

The resulting ETCDCluster object looks like that :

apiVersion: etcd.database.coreos.com/v1beta2
kind: EtcdCluster
metadata:
  annotations:
    etcd.database.coreos.com/scope: clusterwide
    kubectl.kubernetes.io/last-applied-configuration: [...]
  creationTimestamp: "2019-10-18T13:14:28Z"
  generation: 1
  labels:
    app: catalog-apiserver-etcd
    appId: servicecatalog
  name: catalog-apiserver-etcd
  namespace: etcdtest
  resourceVersion: "61728169"
  selfLink: /apis/etcd.database.coreos.com/v1beta2/namespaces/etcdtest/etcdclusters/catalog-apiserver-etcd
  uid: 36dd75a3-f1a9-11e9-8639-0ab63d02cdd0
spec:
  TLS:
    static:
      member:
        peerSecret: catalog-apiserver-etcd-peer-renamed
        serverSecret: catalog-apiserver-etcd-server-renamed
      operatorSecret: catalog-apiserver-etcd-operator-renamed
  pod:
    affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - preference: {}
          weight: 1
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: etcd_cluster
                operator: In
                values:
                - catalog-apiserver-etcd
            topologyKey: kubernetes.io/hostname
          weight: 100
    persistentVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      dataSource: null
      resources:
        requests:
          storage: 1Gi
    resources: {}
  repository: quay.io/coreos/etcd
  size: 3
  version: 3.2.25
status:
  currentVersion: ""
  members: {}
  phase: Failed
  reason: secrets "catalog-apiserver-etcd-operator-renamed" not found
  size: 0
  targetVersion: ""

Even though the secret is already there :

$ kubectl get secret catalog-apiserver-etcd-operator-renamed
NAME                                      TYPE     DATA   AGE
catalog-apiserver-etcd-operator-renamed   Opaque   3      20m

Operator logs contain:

time="2019-10-18T13:14:28Z" level=error msg="cluster failed to setup: secrets \"catalog-apiserver-etcd-operator-renamed\" not found" cluster-name=catalog-apiserver-etcd cluster-namespace=etcdtest pkg=cluster
time="2019-10-18T13:14:28Z" level=warning msg="fail to handle event: ignore failed cluster (catalog-apiserver-etcd). Please delete its CR" pkg=controller