coreos / etcd-operator

etcd operator creates/configures/manages etcd clusters atop Kubernetes
https://coreos.com/blog/introducing-the-etcd-operator.html
Apache License 2.0
1.75k stars 741 forks source link

etcd-operator should add status to etcdclusters/<clustername> when lost quorum #2067

Open nvtkaszpir opened 5 years ago

nvtkaszpir commented 5 years ago

Right now etcd-operator is not updating status of etcdclusters/ in case of loosing quorum.

Steps to reproduce:

  1. create etcd-operator deployment
    kubectl apply -f etcd-operator.deployment.yaml
# etcd-operator.deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: etcd-operator
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: etcd-operator
    spec:
      containers:
      - name: etcd-operator
        image: quay.io/coreos/etcd-operator:v0.9.4
        command:
        - etcd-operator
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
  1. create etcd crd from 3 nodes:
    kubectl apply -f etcd-cluster.crd.yaml
# etcd-cluster.crd.yaml
apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
  name: "etcd"
spec:
  size: 3
  version: "3.2.13"
  1. wait till cluster is set up

    kubectl get etcdclusters/etcd -o yaml
    apiVersion: etcd.database.coreos.com/v1beta2
    kind: EtcdCluster
    metadata:
    annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"etcd.database.coreos.com/v1beta2","kind":"EtcdCluster","metadata":{"annotations":{},"labels":{"etcd-operator-managed":"true"},"name":"etcd","namespace":"default"},"spec":{"size":3,"version":"3.2.13"}}
    creationTimestamp: "2019-03-13T23:37:49Z"
    generation: 1
    labels:
    etcd-operator-managed: "true"
    name: etcd
    namespace: default
    resourceVersion: "2182831"
    selfLink: /apis/etcd.database.coreos.com/v1beta2/namespaces/default/etcdclusters/etcd
    uid: 037a6ceb-45e9-11e9-8b71-42010a8a000a
    spec:
    repository: quay.io/coreos/etcd
    size: 3
    version: 3.2.13
    status:
    clientPort: 2379
    conditions:
    - lastTransitionTime: "2019-03-13T23:38:30Z"
    lastUpdateTime: "2019-03-13T23:38:30Z"
    reason: Cluster available
    status: "True"
    type: Available
    currentVersion: 3.2.13
    members:
    ready:
    - etcd-6r6rpjsmtk
    - etcd-r5fdrln4sh
    - etcd-xkdcxc95vg
    phase: Running
    serviceName: etcd-client
    size: 3
    targetVersion: ""
    kubectl get pods
    NAME                             READY   STATUS    RESTARTS   AGE
    etcd-6r6rpjsmtk                  1/1     Running   0          43s
    etcd-operator-5c6bddb7f6-lxwqb   1/1     Running   0          93s
    etcd-r5fdrln4sh                  1/1     Running   0          27s
    etcd-xkdcxc95vg                  1/1     Running   0          51s
  2. kill 2 pods out of 3:

    kubectl delete pod/etcd-6r6rpjsmtk pod/etcd-r5fdrln4sh
    pod "etcd-6r6rpjsmtk" deleted
    pod "etcd-r5fdrln4sh" deleted
  3. see etcd-operator log it reports that it lost quorum

    stern etcd-operator
    ...
    etcd-operator-5c6bddb7f6-lxwqb etcd-operator time="2019-03-13T23:41:58Z" level=info msg="cluster membership: etcd-6r6rpjsmtk,etcd-r5fdrln4sh,etcd-xkdcxc95vg" cluster-name=etcd cluster-namespace=default pkg=cluster
    etcd-operator-5c6bddb7f6-lxwqb etcd-operator time="2019-03-13T23:41:58Z" level=info msg="Finish reconciling" cluster-name=etcd cluster-namespace=default pkg=cluster
    etcd-operator-5c6bddb7f6-lxwqb etcd-operator time="2019-03-13T23:41:58Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd cluster-namespace=default pkg=cluster
  4. check etcdclusters/etcd

    kubectl get etcdclusters/etcd -o yaml
apiVersion: etcd.database.coreos.com/v1beta2
kind: EtcdCluster
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"etcd.database.coreos.com/v1beta2","kind":"EtcdCluster","metadata":{"annotations":{},"labels":{"etcd-operator-managed":"true"},"name":"etcd","namespace":"default"},"spec":{"size":3,"version":"3.2.13"}}
  creationTimestamp: "2019-03-13T23:37:49Z"
  generation: 1
  labels:
    etcd-operator-managed: "true"
  name: etcd
  namespace: default
  resourceVersion: "2182831"
  selfLink: /apis/etcd.database.coreos.com/v1beta2/namespaces/default/etcdclusters/etcd
  uid: 037a6ceb-45e9-11e9-8b71-42010a8a000a
spec:
  repository: quay.io/coreos/etcd
  size: 3
  version: 3.2.13
status:
  clientPort: 2379
  conditions:
  - lastTransitionTime: "2019-03-13T23:38:30Z"
    lastUpdateTime: "2019-03-13T23:38:30Z"
    reason: Cluster available
    status: "True"
    type: Available
  currentVersion: 3.2.13
  members:
    ready:
    - etcd-6r6rpjsmtk
    - etcd-r5fdrln4sh
    - etcd-xkdcxc95vg
  phase: Running
  serviceName: etcd-client
  size: 3
  targetVersion: ""

inspect status section.

I believe there should be an info that cluster is in bad state.

nvtkaszpir commented 5 years ago

this is a duplicate of #1973 but with much better description ;)