coreos / etcd-operator

etcd operator creates/configures/manages etcd clusters atop Kubernetes
https://coreos.com/blog/introducing-the-etcd-operator.html
Apache License 2.0
1.75k stars 741 forks source link

The operator gets stuck in terminating state #2119

Open pavelnikolov opened 5 years ago

pavelnikolov commented 5 years ago

I have an issue with the operator that I am unable to reproduce consistently but it keeps happening every now and again. I have a 3-node cluster set up in DigitalOcean hosted kubernetes

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.6", GitCommit:"96fac5cd13a5dc064f7d9f4f23030a6aeface6cc", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:16Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

Here is my operator definition:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: etcd-operator
  namespace: etcd
spec:
  replicas: 1
  selector:
    matchLabels:
      name: etcd-operator
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        name: etcd-operator
    spec:
      containers:
        - name: etcd-operator
          image: quay.io/coreos/etcd-operator:v0.9.4
          command:
          - etcd-operator
          # Uncomment to act for resources in all namespaces. More information in doc/user/clusterwide.md
          #- -cluster-wide
          env:
          - name: MY_POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          - name: MY_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          resources:
            limits:
              cpu: 300m
              memory: 200Mi
            requests:
              cpu: 50m
              memory: 50Mi

At some point a second operator pod appears and the first one loses leader election gets stuck in Terminating state with a final log message like this:

level=fatal msg="leader election lost"

What's really strange to me is that my deployment has 2 out of 1 replicas. Any ideas why this might be happening?