coreos / etcd-operator

etcd operator creates/configures/manages etcd clusters atop Kubernetes
https://coreos.com/blog/introducing-the-etcd-operator.html
Apache License 2.0
1.75k stars 741 forks source link

Cluster unable to sustain load when restoring a large number of entries #1968

Open sbarbari opened 6 years ago

sbarbari commented 6 years ago

We have been deploying an etcd cluster (etcd-operator 0.9.2 / etcd 3.2.18) in our project opencord/voltha and have failed to keep a reliable cluster running.

Brief description of the problem

While doing some performance tests with our deployed system, we noticed that the cluster was showing reliability issues when getting to a certain loaded state.

Our test scenario consists in loading our system to 80% of its saturation state. Once the system achieves that state, we delete one component from our system to force a recovery. A new instance is spawned to replace the deleted one and that instance starts retrieving information from the etcd cluster to restore its internal information. At this point, we start seeing some degradation with the etcd cluster. The etcd-operator detects the issue, removes the problematic member and a new etcd instance gets spawned. The etcd instance then tries to restore the data from the previous member. At times, this restore phase may succeed and the cluster comes back to normal, but most times, it will just fail, get terminated and other instances will follow the same path until the operator declares the cluster as being dead.

Reproducing the problem

I simplified my deployment and was able to reproduce the problem by only running the etcd cluster along with a cluster of test containers. The following sections explain how to go about deploying my environment.

Environment

Our system is deployed in a virtual environment hosted on a physical server.

Physical server:

Virtual environment:

Configuration and Installation

Kubernetes cluster installation

The kubernetes cluster was deployed using the voltha-k8s-playground project with some modifications to run in a libvirt based environment.

In order to use the bento box mentioned above, you will need to issue the following instructions

Install some plugins

vagrant plugin install vagrant-libvirt
vagrant plugin install vagrant-mutate

Download and convert vagrant box

vagrant box add bento/ubuntu-16.04
vagrant mutate bento/ubuntu-16.04 libvirt

The following Vagrantfile was used

# -*- mode: ruby -*-
# # vi: set ft=ruby :

ssh_pub_key = File.readlines("#{Dir.home}/.ssh/id_rsa.pub").first.strip

Vagrant.configure(2) do |config|

  (1..3).each do |i|
    config.vm.define "k8s#{i}" do |s|
      s.ssh.forward_agent = true
      s.vm.box = "bento/ubuntu-16.04"
      s.vm.hostname = "k8s#{i}"
      s.vm.synced_folder ".", "/vagrant", type: "nfs"

      s.vm.network "private_network", ip: "172.42.42.10#{i}", netmask: "255.255.255.0",
        auto_config: true

      s.vm.provider "virtualbox" do |v|
        v.name = "k8s#{i}"
        v.memory = 3072
        v.cpus = 2
        v.gui = false
      end
      s.vm.provider "libvirt" do |v|
        v.memory = 6144
        v.cpus = 2
        #v.storage :file, :size => '50G', :type => 'raw', :bus => 'virtio'
        v.storage :file, :size => '50G', :type => 'qcow2'
      end

      s.vm.provision "file", source: "#{Dir.home}/.ssh/id_rsa", destination: "/home/vagrant/.ssh/id_rsa"
      s.vm.provision :shell, :inline =>"
        echo '#{ssh_pub_key}' >> /home/vagrant/.ssh/authorized_keys
      ", privileged: false

      s.vm.provision :shell, path: "scripts/bootstrap_ansible.sh"

      if i == 3
        s.vm.provision :shell, :inline =>"
          echo '#{ssh_pub_key}' >> /home/vagrant/.ssh/authorized_keys
        ", privileged: false

        s.vm.provision "ansible_local" do |ansible|
          ansible.playbook = "kubespray/cluster.yml"
          ansible.config_file = "kubespray/ansible.cfg"
          ansible.inventory_path = "kubespray/inventory/voltha/hosts.ini"
          ansible.become = true
          ansible.become_user = "root"
          ansible.verbose = true
          ansible.limit = "all"
        end
      end
    end
  end

  if Vagrant.has_plugin?("vagrant-cachier")
    config.cache.scope = :box
  end

end

GlusterFS installation

Login

vagrant ssh k8s1

Wipe storage and load module

cd /vagrant && ansible kube-node -a "wipefs -af /dev/vda" -b --become-user=root -i kubespray/inventory/voltha/hosts.ini
cd /vagrant && ansible kube-node -a "modprobe dm_thin_pool" -b --become-user=root -i kubespray/inventory/voltha/hosts.ini

Install glusterfs client and dependencies

cd /vagrant && ansible kube-node -a "add-apt-repository ppa:gluster/glusterfs-3.12" -b --become-user=root -i kubespray/inventory/voltha/hosts.ini
cd /vagrant && ansible kube-node -a "apt-get update" -b --become-user=root -i kubespray/inventory/voltha/hosts.ini
cd /vagrant && ansible kube-node -a "apt-get install -y glusterfs-client" -b --become-user=root -i kubespray/inventory/voltha/hosts.ini

Clone gluster-kubernetes framework

cd /vagrant && git clone https://github.com/gluster/gluster-kubernetes.git

Create a topology

cat <<EOF >> /vagrant/gluster-kubernetes/deploy/topology.json
{
  "clusters": [
    {
      "nodes": [
        {
          "node": {
            "hostnames": {
              "manage": [
                "k8s1"
              ],
              "storage": [
                "172.42.42.101"
              ]
            },
            "zone": 1
          },
          "devices": [
            "/dev/vda"
          ]
        },
        {
          "node": {
            "hostnames": {
              "manage": [
                "k8s2"
              ],
              "storage": [
                "172.42.42.102"
              ]
            },
            "zone": 1
          },
          "devices": [
            "/dev/vda"
          ]
        },
        {
          "node": {
            "hostnames": {
              "manage": [
                "k8s3"
              ],
              "storage": [
                "172.42.42.103"
              ]
            },
            "zone": 1
          },
          "devices": [
            "/dev/vda"
          ]
        }
      ]
    }
  ]
}
EOF

Install GlusterFS server cluster

cd /vagrant/gluster-kubernetes/deploy && ./gk-deploy -gyv --object-capacity 150Gi

Setup kubernetes manifests

Create storage class manifest. Update the resturl based on what was returned in the previous command.

cat <<EOF >> /vagrant/glusterfs-storage.yml
apiVersion: storage.k8s.io/v1beta1
kind: StorageClass
metadata:
  name: glusterfs-storage
provisioner: kubernetes.io/glusterfs
parameters:
  resturl: "http://10.233.102.198:8080"
  restuser: "admin"
  restuserkey: "My Secret"
  volumetype: "replicate:2"
EOF

Create cluster role manifests

cat <<EOF >> /vagrant/cluster_role_binding.yml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: etcd-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: etcd-operator
subjects:
- kind: ServiceAccount
  name: default
  namespace: default
EOF

cat <<EOF >> /vagrant/cluster_role.yml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: etcd-operator
rules:
- apiGroups:
  - etcd.database.coreos.com
  resources:
  - etcdclusters
  verbs:
  - "*"
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - "*"
- apiGroups:
  - storage.k8s.io
  resources:
  - storageclasses
  verbs:
  - "*"
- apiGroups:
  - ""
  resources:
  - pods
  - services
  - endpoints
  - persistentvolumeclaims
  - events
  verbs:
  - "*"
- apiGroups:
  - apps
  resources:
  - deployments
  verbs:
  - "*"
EOF

Create operator manifest

cat <<EOF >> /vagrant/etcd_operator.yml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: etcd-operator
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: etcd-operator
      annotations:
        cni: "calico"
    spec:
      containers:
      - name: etcd-operator
        image: quay.io/coreos/etcd-operator:v0.9.2
        command:
        - etcd-operator
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
EOF

Create etcd cluster manifest

cat <<EOF >> /vagrant/etcd_cluster.yml
apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
  name: etcd
spec:
  size: 3
  version: "3.2.18"
  pod:
    annotations:
      cni: "calico"
    etcdEnv:
      - name: ETCD_QUOTA_BACKEND_BYTES
        value: "6442450944"
      - name: ETCD_MAX_WALS
        value: "0"
      - name: ETCD_HEARTBEAT_INTERVAL
        value: "250"
      - name: ETCD_ELECTION_TIMEOUT
        value: "5000"
      - name: ETCD_SNAPSHOT_COUNT
        value: "5000"
    persistentVolumeClaimSpec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 10Gi
      storageClassName: glusterfs-storage
EOF

Create etcd tester manifests

cat <<EOF >> /vagrant/etcd_tester_config.yml
apiVersion: v1
data:
  leader_assert.py: |
    import argparse
    import logging
    import os
    import random
    from twisted.internet.defer import Deferred, inlineCallbacks
    from txaioetcd import Client, KeySet, Transaction, CompCreated, CompValue, CompVersion, OpGet, OpSet, Failed

    log = logging.getLogger('etcd-client')
    logging.basicConfig(level=os.environ.get('LOGLEVEL', 'INFO'))

    def asleep(dt):
        """
        Async (event driven) wait for given time period (in seconds)
        :param dt: Delay in seconds
        :return: Deferred to be fired with value None when time expires.
        """
        d = Deferred()
        reactor.callLater(dt, lambda: d.callback(None))
        return d

    class EtcdClient(object):

        def __init__(self, host, port, reactor):
            log.info('__init__')
            self.host = host
            self.port = port
            self.kv_client = None
            self.reactor = reactor
            self.startup_components()

        def startup_components(self):
            log.info('startup_components')
            url = u'http://' + self.host + u':' + self.port
            log.info('Etcd endpoint {}'.format(url))
            self.kv_client = Client(self.reactor, url)
            log.info('getting all keys')

            for idx in range(0, 499):
                self.reactor.callLater(10, self._reconcile_data, 'group/1/entry', idx)
            for idx in range(500, 999):
                self.reactor.callLater(10, self._reconcile_data, 'group/2/entry', idx)
            for idx in range(1000, 1499):
                self.reactor.callLater(10, self._reconcile_data, 'group/3/entry', idx)

            #    try:
            #        key = b''.join(['group1_', str(i)])
            #        #result = yield self.kv_client.get(KeySet(key, prefix=True))
            #        result = yield self.kv_client.get(key)
            #        log.info('Result - key:{} len:{}'.format(key, len(result.kvs)))
            #    except Exception as e:
            #        log.exception('Problem while getting group 1 keys')

            log.info('got all keys')

        @inlineCallbacks
        def _reconcile_data(self, prefix, idx):
            try:
                key = b''.join([prefix, str(idx)])
                log.info('getting key:{}'.format(key))
                result = yield self.kv_client.get(key)
                log.info('got key:{} len:{}'.format(key, len(result.kvs)))
            except Exception as e:
                log.exception('Problem while getting key:{}'.format(key))

        @inlineCallbacks
        def _kill_leadership(self):
            log.info('_kill_leadership')

            status = yield self.kv_client.status()
            log.info('status:')
            log.info(status)

            # Delete old key
            yield self.kv_client.delete(b'service/voltha/leader')
            log.info('Deleted leader key...confirming:')
            result = yield self.kv_client.get(b'service/voltha/leader')
            log.info(result)

            txn = Transaction(
                compare=[
                    CompVersion(b'service/voltha/leader', '==', 0)
                ],
                success=[],
                failure=[]
            )

            try:
                result = yield self.kv_client.submit(txn)
            except Failed as failed:
                log.info('FAILED to delete leader key!')
                for response in failed.responses:
                    print(response)
            else:
                log.info('Leader key DELETED successfully')
                for response in result.responses:
                    log.info(response)

            self.reactor.callLater(5, self._leadership_tracking_loop)

        @inlineCallbacks
        def _leadership_tracking_loop(self):
    #        log.info('_leadership_tracking_loop')
            result = yield self.kv_client.get(b'service/voltha/leader')
            if result.kvs:
                kv = result.kvs[0]
                log.info('_leadership_tracking_loop: Leader = "%s"', kv.value)
            else:
                # Initial startup or leader failure, try to grab
                log.info('_leadership_tracking_loop: Leader not found')
                instance_id = os.environ.get('HOSTNAME', 'voltha-core')

                # Create lease
                lease_duration = 9 * random.randint(1,10)
                log.info('%s trying to assert leadership for %d sec', instance_id, lease_duration)
                lease = yield self.kv_client.lease(lease_duration)

                # Test-and-set
                txn = Transaction(
                    compare=[
                        CompVersion(b'service/voltha/leader', '==', 0)
                    ],
                    success=[
                        OpSet(b'service/voltha/leader', instance_id, lease=lease),
                        OpGet(b'service/voltha/leader')
                    ],
                    failure=[]
                )
                try:
                    result = yield self.kv_client.submit(txn)
                except Failed as failed:
                    log.info('Leader key PRESENT')
                    for response in failed.responses:
                        log.info(response)
                else:
                    log.info('Leader key ABSENT')
                    for response in result.responses:
                        log.info(response)

                # Check if assertion succeeded
                log.info('Confirming assertion')
                result = yield self.kv_client.get(b'service/voltha/leader')
                if result.kvs:
                    kv = result.kvs[0]
                    leader = kv.value
                    log.info('service/voltha/leader = "%s"' % leader)
                    log.info('instance_id = "%s"' % instance_id)
                    if leader == instance_id:
                        log.info("************* I'M THE LEADER! *************")
                    else:
                        log.info("************* FALSE ALARM: I'm NOT the leader *************")

            loop_interval = random.randint(2,11)
            self.reactor.callLater(loop_interval, self._leadership_tracking_loop)

    if __name__ == '__main__':
        parser = argparse.ArgumentParser()

        _help = 'Etcd host name'
        parser.add_argument('-H', '--host', dest='host', action='store', help=_help)

        _help = 'Etcd port'
        parser.add_argument('-P', '--port', dest='port', action='store', help=_help)

        args = parser.parse_args()
        host = args.host
        port = args.port

        from twisted.internet import reactor
        etcd = EtcdClient(host, port, reactor)
        reactor.callLater(0, etcd._kill_leadership)
        reactor.run()

kind: ConfigMap
metadata:
  name: etcd-tester-config
EOF
cat <<EOF >> /vagrant/etcd_tester.yml
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: etcd-tester
spec:
   replicas: 3
   template:
     metadata:
       labels:
         app: etcd-tester
       annotations:
         cni: "calico"
     spec:
       containers:
         - name: rg
           image: voltha/voltha-tester
           env:
             - name: POD_IP
               valueFrom:
                 fieldRef:
                   fieldPath: status.podIP
             - name: NAMESPACE
               valueFrom:
                 fieldRef:
                   fieldPath: metadata.namespace
           command: [ "/usr/bin/python", "/root/leader_assert.py", "--host", "etcd.default.svc.cluster.local", "--port", "2379" ]
           imagePullPolicy: IfNotPresent
           volumeMounts:
             - name: etcd-tester-config
               mountPath: /root/leader_assert.py
               subPath: leader_assert.py
       volumes:
         - name: etcd-tester-config
           configMap:
             name: etcd-tester-config
EOF

Deployment

Start the etcd cluster

kubectl apply -f /vagrant/glusterfs-storage.yml
kubectl apply -f /vagrant/cluster_role.yml
kubectl apply -f /vagrant/cluster_role_binding.yml
kubectl apply -f /vagrant/etcd_operator.yml
kubectl apply -f /vagrant/etcd_cluster.yml

Start the etcd tester cluster. The etcd-tester cluster is and etcd client that simulates the leadership functionality of our system.

kubectl apply -f /vagrant/etcd_tester_config.yml
kubectl apply -f /vagrant/etcd_tester.yml

Execute script to produce 1500 keys in the etcd cluster. This script is executed within and etcd instance.

Enter the etcd instance

kubectl exec etcd-<instance id> -ti sh

Copy the following script in the etcd instance session to execute it

counter=0
while [ 1 ]; do
  if [ "$counter" -eq 1500 ]; then
    break
  elif [ "$counter" -ge 0 ] && [ "$counter" -lt 500 ]; then
    key_prefix="group/1/entry"
  elif [ "$counter" -ge 500 ] && [ "$counter" -lt 1000 ]; then
    key_prefix="group/2/entry"
  elif [ "$counter" -ge 1000 ] && [ "$counter" -lt 1500 ]; then
    key_prefix="group/3/entry"
  fi
  key=$key_prefix$counter
  dd if=/dev/urandom bs=1024 count=1024  | ETCDCTL_API=3 etcdctl put $key || break > /dev/null;
  echo $key
  counter=$((counter+1))
done

Trigger Cluster Failure

Once the key generation script has completed, you can issue the following command to try and trigger the cluster failure.

Delete one of the etcd-tester instances

kubectl delete pod etcd-tester-<instance id>

When a new etcd-tester instance is spawned to replace the deleted one, it will start loading the data from the etcd cluster. The etcd instances will show a ready status of 0/1 and the logs will show some connection errors. The operator usually re-spawns a new instance, but it will eventually fail. The cluster failure may occur with the first try, or you may need to issue the command once again.

Logs

Etcd Operator

...

time="2018-06-06T14:21:57Z" level=info msg="start running..." cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:05Z" level=info msg="skip reconciliation: running ([]), pending ([etcd-8vfjdbfvnx])" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:13Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:13Z" level=info msg="running members: etcd-8vfjdbfvnx" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:13Z" level=info msg="cluster membership: etcd-8vfjdbfvnx" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:18Z" level=info msg="added member (etcd-qlrsndshdv)" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:18Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:26Z" level=info msg="skip reconciliation: running ([etcd-8vfjdbfvnx]), pending ([etcd-qlrsndshdv])" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:34Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:34Z" level=info msg="running members: etcd-8vfjdbfvnx,etcd-qlrsndshdv" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:34Z" level=info msg="cluster membership: etcd-8vfjdbfvnx,etcd-qlrsndshdv" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:36Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:36Z" level=error msg="failed to reconcile: fail to add new member (etcd-7wwrg6c947): etcdserver: unhealthy cluster" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:44Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:44Z" level=info msg="running members: etcd-8vfjdbfvnx,etcd-qlrsndshdv" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:44Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-8vfjdbfvnx" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:44Z" level=info msg="added member (etcd-vk2zv2bbsl)" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:44Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:22:52Z" level=info msg="skip reconciliation: running ([etcd-8vfjdbfvnx etcd-qlrsndshdv]), pending ([etcd-vk2zv2bbsl])" cluster-name=etcd pkg=cluster
time="2018-06-06T14:23:00Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:23:00Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster

...

time="2018-06-06T14:49:10Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:18Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:18Z" level=info msg="running members: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:18Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-8vfjdbfvnx,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:18Z" level=info msg="removing one dead member" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:18Z" level=info msg="removing dead member \"etcd-8vfjdbfvnx\"" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:18Z" level=info msg="removed member (etcd-8vfjdbfvnx) with ID (7176662864922386385)" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:18Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:26Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:26Z" level=info msg="running members: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:26Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:31Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:31Z" level=error msg="failed to reconcile: fail to add new member (etcd-6v8h5d8974): context deadline exceeded" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:39Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:39Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:39Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:39Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:39Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:47Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:47Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:47Z" level=info msg="cluster membership: etcd-vk2zv2bbsl,etcd-qlrsndshdv" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:47Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:47Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:55Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:55Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:55Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:55Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:49:55Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:03Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:03Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:03Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:03Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:03Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:11Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:11Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:11Z" level=info msg="cluster membership: etcd-vk2zv2bbsl,etcd-qlrsndshdv" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:11Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:11Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:19Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:19Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:19Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:19Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:19Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:27Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:27Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:27Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:27Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:27Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:35Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:35Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:35Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:35Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:35Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:43Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:43Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:43Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:43Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:43Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:51Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:51Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:51Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:51Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:51Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:59Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:59Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:59Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:59Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:50:59Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:07Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:07Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:07Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:07Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:07Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:15Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:15Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:15Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:15Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:15Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:23Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:23Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:23Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:23Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:23Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:31Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:31Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:31Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:31Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:31Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:39Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:39Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:39Z" level=info msg="cluster membership: etcd-vk2zv2bbsl,etcd-qlrsndshdv" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:39Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:39Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:47Z" level=info msg="Start reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:47Z" level=info msg="running members: etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:47Z" level=info msg="cluster membership: etcd-qlrsndshdv,etcd-vk2zv2bbsl" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:47Z" level=info msg="Finish reconciling" cluster-name=etcd pkg=cluster
time="2018-06-06T14:51:47Z" level=error msg="failed to reconcile: lost quorum" cluster-name=etcd pkg=cluster
time="2018-06-06T14:52:00Z" level=error msg="failed to update members: list members failed: creating etcd client failed: dial tcp 10.233.70.3:2379: connect: connection refused" cluster-name=etcd pkg=cluster
time="2018-06-06T14:52:08Z" level=warning msg="all etcd pods are dead." cluster-name=etcd pkg=cluster
time="2018-06-06T14:52:16Z" level=warning msg="all etcd pods are dead." cluster-name=etcd pkg=cluster

Etcd

2018-06-06 14:42:42.692364 W | etcdserver: apply entries took too long [268.466851ms for 1 entries]
2018-06-06 14:42:42.692391 W | etcdserver: avoid queries with large range/delete range!
2018-06-06 14:42:43.560049 W | etcdserver: apply entries took too long [710.87362ms for 1 entries]
2018-06-06 14:42:43.560164 W | etcdserver: avoid queries with large range/delete range!
2018-06-06 14:42:44.097188 W | etcdserver: apply entries took too long [366.497657ms for 1 entries]
2018-06-06 14:42:44.099156 W | etcdserver: avoid queries with large range/delete range!
2018-06-06 14:42:44.299414 I | wal: segmented wal file /var/etcd/data/member/wal/0000000000000018-0000000000000625.wal is created
2018-06-06 14:42:44.819691 W | etcdserver: apply entries took too long [469.074347ms for 1 entries]
2018-06-06 14:42:44.819719 W | etcdserver: avoid queries with large range/delete range!

...

2018-06-06 14:46:41.886089 W | etcdserver: failed to send out heartbeat on time (exceeded the 250ms timeout for 25.979255ms)
2018-06-06 14:46:41.886220 W | etcdserver: server is likely overloaded
2018-06-06 14:46:41.886357 W | etcdserver: failed to send out heartbeat on time (exceeded the 250ms timeout for 27.885743ms)
2018-06-06 14:46:41.886442 W | etcdserver: server is likely overloaded
2018-06-06 14:46:44.152179 W | etcdserver: failed to send out heartbeat on time (exceeded the 250ms timeout for 29.027882ms)
2018-06-06 14:46:44.152205 W | etcdserver: server is likely overloaded
2018-06-06 14:46:44.152213 W | etcdserver: failed to send out heartbeat on time (exceeded the 250ms timeout for 29.076613ms)
2018-06-06 14:46:44.152217 W | etcdserver: server is likely overloaded
WARNING: 2018/06/06 14:46:58 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:47:03 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:47:08 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2018-06-06 14:47:11.757937 W | etcdserver: failed to send out heartbeat on time (exceeded the 250ms timeout for 13.330617ms)
2018-06-06 14:47:11.758680 W | etcdserver: server is likely overloaded
2018-06-06 14:47:11.758699 W | etcdserver: failed to send out heartbeat on time (exceeded the 250ms timeout for 14.101156ms)
2018-06-06 14:47:11.758704 W | etcdserver: server is likely overloaded
2018-06-06 14:47:12.991020 W | etcdserver: failed to send out heartbeat on time (exceeded the 250ms timeout for 1.925275ms)
2018-06-06 14:47:12.991235 W | etcdserver: server is likely overloaded
2018-06-06 14:47:12.991364 W | etcdserver: failed to send out heartbeat on time (exceeded the 250ms timeout for 2.256268ms)
2018-06-06 14:47:12.991409 W | etcdserver: server is likely overloaded
WARNING: 2018/06/06 14:47:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:47:23 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:47:23 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
2018-06-06 14:47:23.491639 W | etcdserver: failed to send out heartbeat on time (exceeded the 250ms timeout for 506.094197ms)
2018-06-06 14:47:23.491661 W | etcdserver: server is likely overloaded
2018-06-06 14:47:23.491670 W | etcdserver: failed to send out heartbeat on time (exceeded the 250ms timeout for 506.130679ms)

...

2018-06-06 14:48:48.928082 I | raft: 6398a0da206f9fd1 [logterm: 3, index: 1663, vote: 6398a0da206f9fd1] ignored MsgVote from 3338a611114a25c6 [logterm: 3, index: 1661] at term 3: lease is not expired (remaining ticks: 3)
2018-06-06 14:48:48.928123 I | raft: 6398a0da206f9fd1 [term: 3] received a MsgHeartbeat message with higher term from 3338a611114a25c6 [term: 4]
2018-06-06 14:48:48.930402 I | raft: 6398a0da206f9fd1 became follower at term 4
WARNING: 2018/06/06 14:48:48 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2018-06-06 14:48:49.561528 I | raft: raft.node: 6398a0da206f9fd1 changed leader from 6398a0da206f9fd1 to 3338a611114a25c6 at term 4
2018-06-06 14:48:49.561717 I | raft: found conflict at index 1662 [existing term: 3, conflicting term: 4]
2018-06-06 14:48:49.561739 I | raft: replace the unstable entries from index 1662
2018-06-06 14:48:49.561910 N | pkg/osutil: received terminated signal, shutting down...
2018-06-06 14:48:49.564003 W | rafthttp: closed an existing TCP streaming connection with peer 3338a611114a25c6 (stream Message writer)
2018-06-06 14:48:49.564180 I | rafthttp: established a TCP streaming connection with peer 3338a611114a25c6 (stream Message writer)
2018-06-06 14:48:49.564292 W | rafthttp: closed an existing TCP streaming connection with peer 6f80204d3efb278c (stream Message writer)
2018-06-06 14:48:49.565098 I | rafthttp: established a TCP streaming connection with peer 6f80204d3efb278c (stream Message writer)
2018-06-06 14:48:49.570320 W | rafthttp: closed an existing TCP streaming connection with peer 6f80204d3efb278c (stream MsgApp v2 writer)
2018-06-06 14:48:49.570406 I | rafthttp: established a TCP streaming connection with peer 6f80204d3efb278c (stream MsgApp v2 writer)
2018-06-06 14:48:49.570523 W | rafthttp: closed an existing TCP streaming connection with peer 3338a611114a25c6 (stream MsgApp v2 writer)
2018-06-06 14:48:49.570565 I | rafthttp: established a TCP streaming connection with peer 3338a611114a25c6 (stream MsgApp v2 writer)
WARNING: 2018/06/06 14:48:49 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: getsockopt: connection refused"; Reconnecting to {0.0.0.0:2379 0  <nil>}
WARNING: 2018/06/06 14:48:50 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: getsockopt: connection refused"; Reconnecting to {0.0.0.0:2379 0  <nil>}
WARNING: 2018/06/06 14:48:52 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: getsockopt: connection refused"; Reconnecting to {0.0.0.0:2379 0  <nil>}
WARNING: 2018/06/06 14:48:53 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2018-06-06 14:48:55.647774 W | etcdserver: timed out waiting for read index response
WARNING: 2018/06/06 14:48:55 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: getsockopt: connection refused"; Reconnecting to {0.0.0.0:2379 0  <nil>}
WARNING: 2018/06/06 14:49:00 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: getsockopt: connection refused"; Reconnecting to {0.0.0.0:2379 0  <nil>}
2018-06-06 14:49:05.621050 I | etcdserver: skipped leadership transfer for stopping non-leader member
WARNING: 2018/06/06 14:49:05 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:49:05 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"

...

WARNING: 2018/06/06 14:49:06 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:49:06 grpc: addrConn.transportMonitor exits due to: context canceled
WARNING: 2018/06/06 14:49:06 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
2018-06-06 14:22:31.458261 I | pkg/flags: recognized and used environment variable ETCD_ELECTION_TIMEOUT=5000
2018-06-06 14:22:31.458372 I | pkg/flags: recognized and used environment variable ETCD_HEARTBEAT_INTERVAL=250
2018-06-06 14:22:31.458392 I | pkg/flags: recognized and used environment variable ETCD_MAX_WALS=0
2018-06-06 14:22:31.458444 I | pkg/flags: recognized and used environment variable ETCD_QUOTA_BACKEND_BYTES=6442450944
2018-06-06 14:22:31.458474 I | pkg/flags: recognized and used environment variable ETCD_SNAPSHOT_COUNT=5000
2018-06-06 14:22:31.458614 W | pkg/flags: unrecognized environment variable ETCD_CLIENT_SERVICE_HOST=10.233.26.252
2018-06-06 14:22:31.458627 W | pkg/flags: unrecognized environment variable ETCD_CLIENT_PORT_2379_TCP_PORT=2379
2018-06-06 14:22:31.458635 W | pkg/flags: unrecognized environment variable ETCD_CLIENT_PORT_2379_TCP=tcp://10.233.26.252:2379
2018-06-06 14:22:31.458644 W | pkg/flags: unrecognized environment variable ETCD_CLIENT_PORT=tcp://10.233.26.252:2379
2018-06-06 14:22:31.458648 W | pkg/flags: unrecognized environment variable ETCD_CLIENT_PORT_2379_TCP_PROTO=tcp
2018-06-06 14:22:31.458771 W | pkg/flags: unrecognized environment variable ETCD_CLIENT_SERVICE_PORT_CLIENT=2379
2018-06-06 14:22:31.458786 W | pkg/flags: unrecognized environment variable ETCD_CLIENT_SERVICE_PORT=2379
2018-06-06 14:22:31.458794 W | pkg/flags: unrecognized environment variable ETCD_CLIENT_PORT_2379_TCP_ADDR=10.233.26.252
2018-06-06 14:22:31.458829 I | etcdmain: etcd Version: 3.2.18
2018-06-06 14:22:31.458887 I | etcdmain: Git SHA: eddf599c6
2018-06-06 14:22:31.458898 I | etcdmain: Go Version: go1.8.7
2018-06-06 14:22:31.458961 I | etcdmain: Go OS/Arch: linux/amd64
2018-06-06 14:22:31.458975 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2018-06-06 14:22:31.461681 I | embed: listening for peers on http://0.0.0.0:2380
2018-06-06 14:22:31.461755 I | embed: listening for client requests on 0.0.0.0:2379
2018-06-06 14:22:31.509915 I | pkg/netutil: resolving etcd-8vfjdbfvnx.etcd.voltha.svc:2380 to 10.233.102.198:2380
2018-06-06 14:22:31.511601 I | pkg/netutil: resolving etcd-8vfjdbfvnx.etcd.voltha.svc:2380 to 10.233.102.198:2380
2018-06-06 14:22:31.513389 I | pkg/netutil: resolving etcd-qlrsndshdv.etcd.voltha.svc:2380 to 10.233.68.3:2380
2018-06-06 14:22:31.515269 I | pkg/netutil: resolving etcd-qlrsndshdv.etcd.voltha.svc:2380 to 10.233.68.3:2380
2018-06-06 14:22:31.516040 I | etcdserver: name = etcd-qlrsndshdv
2018-06-06 14:22:31.516089 I | etcdserver: data dir = /var/etcd/data
2018-06-06 14:22:31.516117 I | etcdserver: member dir = /var/etcd/data/member
2018-06-06 14:22:31.516141 I | etcdserver: heartbeat = 250ms
2018-06-06 14:22:31.516174 I | etcdserver: election = 5000ms
2018-06-06 14:22:31.516195 I | etcdserver: snapshot count = 5000
2018-06-06 14:22:31.516244 I | etcdserver: advertise client URLs = http://etcd-qlrsndshdv.etcd.voltha.svc:2379
2018-06-06 14:22:31.542118 I | etcdserver: starting member 3338a611114a25c6 in cluster ca10d9cda4aabbc0
2018-06-06 14:22:31.542175 I | raft: 3338a611114a25c6 became follower at term 0
2018-06-06 14:22:31.542189 I | raft: newRaft 3338a611114a25c6 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2018-06-06 14:22:31.542195 I | raft: 3338a611114a25c6 became follower at term 1
2018-06-06 14:22:31.586788 W | auth: simple token is not cryptographically signed
2018-06-06 14:22:31.603990 I | rafthttp: started HTTP pipelining with peer 6398a0da206f9fd1
2018-06-06 14:22:31.604027 I | rafthttp: starting peer 6398a0da206f9fd1...
2018-06-06 14:22:31.604041 I | rafthttp: started HTTP pipelining with peer 6398a0da206f9fd1
2018-06-06 14:22:31.606347 I | rafthttp: started streaming with peer 6398a0da206f9fd1 (writer)
2018-06-06 14:22:31.606377 I | rafthttp: started streaming with peer 6398a0da206f9fd1 (writer)
2018-06-06 14:22:31.607317 I | rafthttp: started peer 6398a0da206f9fd1
2018-06-06 14:22:31.607343 I | rafthttp: added peer 6398a0da206f9fd1
2018-06-06 14:22:31.607358 I | etcdserver: starting server... [version: 3.2.18, cluster version: to_be_decided]
2018-06-06 14:22:31.607369 I | rafthttp: started streaming with peer 6398a0da206f9fd1 (stream MsgApp v2 reader)
2018-06-06 14:22:31.608130 I | rafthttp: started streaming with peer 6398a0da206f9fd1 (stream Message reader)
2018-06-06 14:22:31.609097 I | rafthttp: peer 6398a0da206f9fd1 became active
2018-06-06 14:22:31.609114 I | rafthttp: established a TCP streaming connection with peer 6398a0da206f9fd1 (stream Message writer)
2018-06-06 14:22:31.609410 I | rafthttp: established a TCP streaming connection with peer 6398a0da206f9fd1 (stream MsgApp v2 writer)
2018-06-06 14:22:31.619210 I | rafthttp: established a TCP streaming connection with peer 6398a0da206f9fd1 (stream MsgApp v2 reader)
2018-06-06 14:22:31.619240 I | rafthttp: established a TCP streaming connection with peer 6398a0da206f9fd1 (stream Message reader)
2018-06-06 14:22:36.443092 I | raft: 3338a611114a25c6 [term: 1] received a MsgVote message with higher term from 6398a0da206f9fd1 [term: 3]
2018-06-06 14:22:36.443351 I | raft: 3338a611114a25c6 became follower at term 3
2018-06-06 14:22:36.443451 I | raft: 3338a611114a25c6 [logterm: 0, index: 0, vote: 0] cast MsgVote for 6398a0da206f9fd1 [logterm: 2, index: 5] at term 3
2018-06-06 14:22:36.453171 I | raft: raft.node: 3338a611114a25c6 elected leader 6398a0da206f9fd1 at term 3
2018-06-06 14:22:36.460030 I | etcdserver/membership: added member 6398a0da206f9fd1 [http://etcd-8vfjdbfvnx.etcd.voltha.svc:2380] to cluster ca10d9cda4aabbc0
2018-06-06 14:22:36.460351 N | etcdserver/membership: set the initial cluster version to 3.2
2018-06-06 14:22:36.460577 I | etcdserver/api: enabled capabilities for version 3.2
2018-06-06 14:22:36.460858 I | etcdserver/membership: added member 3338a611114a25c6 [http://etcd-qlrsndshdv.etcd.voltha.svc:2380] to cluster ca10d9cda4aabbc0
2018-06-06 14:22:36.465676 I | etcdserver: published {Name:etcd-qlrsndshdv ClientURLs:[http://etcd-qlrsndshdv.etcd.voltha.svc:2379]} to cluster ca10d9cda4aabbc0
2018-06-06 14:22:36.465878 I | embed: ready to serve client requests
2018-06-06 14:22:36.466402 N | embed: serving insecure client requests on [::]:2379, this is strongly discouraged!
2018-06-06 14:22:36.467522 W | etcdserver: not healthy for reconfigure, rejecting member add {ID:b5087f8b5b58ee92 RaftAttributes:{PeerURLs:[http://etcd-7wwrg6c947.etcd.voltha.svc:2380]} Attributes:{Name: ClientURLs:[]}}
2018-06-06 14:22:44.502644 I | etcdserver/membership: added member 6f80204d3efb278c [http://etcd-vk2zv2bbsl.etcd.voltha.svc:2380] to cluster ca10d9cda4aabbc0
2018-06-06 14:22:44.502687 I | rafthttp: starting peer 6f80204d3efb278c...
2018-06-06 14:22:44.502726 I | rafthttp: started HTTP pipelining with peer 6f80204d3efb278c
2018-06-06 14:22:44.503252 I | rafthttp: started streaming with peer 6f80204d3efb278c (writer)
2018-06-06 14:22:44.503529 I | rafthttp: started streaming with peer 6f80204d3efb278c (writer)
2018-06-06 14:22:44.503999 I | rafthttp: started peer 6f80204d3efb278c
2018-06-06 14:22:44.504038 I | rafthttp: added peer 6f80204d3efb278c
2018-06-06 14:22:44.504786 I | rafthttp: started streaming with peer 6f80204d3efb278c (stream MsgApp v2 reader)
2018-06-06 14:22:44.505312 I | rafthttp: started streaming with peer 6f80204d3efb278c (stream Message reader)
2018-06-06 14:22:49.506575 W | rafthttp: health check for peer 6f80204d3efb278c could not connect: dial tcp: lookup etcd-vk2zv2bbsl.etcd.voltha.svc on 10.233.0.3:53: no such host
2018-06-06 14:22:54.507003 W | rafthttp: health check for peer 6f80204d3efb278c could not connect: dial tcp: lookup etcd-vk2zv2bbsl.etcd.voltha.svc on 10.233.0.3:53: no such host
2018-06-06 14:22:59.507213 W | rafthttp: health check for peer 6f80204d3efb278c could not connect: dial tcp 10.233.70.3:2380: getsockopt: connection refused
2018-06-06 14:23:00.223315 I | rafthttp: peer 6f80204d3efb278c became active
2018-06-06 14:23:00.223857 I | rafthttp: established a TCP streaming connection with peer 6f80204d3efb278c (stream Message writer)
2018-06-06 14:23:00.224294 I | rafthttp: established a TCP streaming connection with peer 6f80204d3efb278c (stream MsgApp v2 writer)
2018-06-06 14:23:00.231870 I | rafthttp: established a TCP streaming connection with peer 6f80204d3efb278c (stream Message reader)
2018-06-06 14:23:00.237835 I | rafthttp: established a TCP streaming connection with peer 6f80204d3efb278c (stream MsgApp v2 reader)
2018-06-06 14:25:15.259015 W | etcdserver: apply entries took too long [164.019066ms for 1 entries]
2018-06-06 14:25:15.259090 W | etcdserver: avoid queries with large range/delete range!
2018-06-06 14:28:16.612753 W | etcdserver: apply entries took too long [122.239226ms for 1 entries]
2018-06-06 14:28:16.612799 W | etcdserver: avoid queries with large range/delete range!

...

2018-06-06 14:42:44.188171 W | etcdserver: apply entries took too long [259.61503ms for 1 entries]
2018-06-06 14:42:44.188212 W | etcdserver: avoid queries with large range/delete range!
2018-06-06 14:42:44.548649 I | wal: segmented wal file /var/etcd/data/member/wal/0000000000000018-0000000000000625.wal is created
2018-06-06 14:42:44.944672 W | etcdserver: apply entries took too long [387.834872ms for 1 entries]
2018-06-06 14:42:44.944735 W | etcdserver: avoid queries with large range/delete range!
2018-06-06 14:42:45.596188 W | etcdserver: apply entries took too long [398.203983ms for 1 entries]
2018-06-06 14:42:45.596225 W | etcdserver: avoid queries with large range/delete range!
2018-06-06 14:42:46.386990 W | etcdserver: apply entries took too long [618.439161ms for 1 entries]
2018-06-06 14:42:46.387090 W | etcdserver: avoid queries with large range/delete range!
2018-06-06 14:42:48.385658 W | etcdserver: apply entries took too long [124.894069ms for 1 entries]
2018-06-06 14:42:48.385730 W | etcdserver: avoid queries with large range/delete range!
2018-06-06 14:42:49.773485 W | etcdserver: apply entries took too long [184.624159ms for 1 entries]
2018-06-06 14:42:49.773535 W | etcdserver: avoid queries with large range/delete range!
2018-06-06 14:42:50.354581 W | etcdserver: apply entries took too long [119.126028ms for 1 entries]
2018-06-06 14:42:50.354615 W | etcdserver: avoid queries with large range/delete range!
2018-06-06 14:42:51.130465 W | etcdserver: apply entries took too long [196.798654ms for 1 entries]
2018-06-06 14:42:51.130508 W | etcdserver: avoid queries with large range/delete range!
2018-06-06 14:42:51.723926 W | etcdserver: apply entries took too long [250.724299ms for 1 entries]
2018-06-06 14:42:51.724000 W | etcdserver: avoid queries with large range/delete range!
2018-06-06 14:42:52.495984 W | etcdserver: apply entries took too long [413.399996ms for 1 entries]
2018-06-06 14:42:52.496060 W | etcdserver: avoid queries with large range/delete range!
WARNING: 2018/06/06 14:47:03 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:47:12 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:47:37 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:47:37 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:47:38 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:47:43 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:47:48 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:47:58 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:48:03 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:48:03 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:48:08 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2018-06-06 14:48:10.437708 W | etcdserver: apply entries took too long [1m21.966261143s for 1 entries]
2018-06-06 14:48:10.437754 W | etcdserver: avoid queries with large range/delete range!
2018-06-06 14:48:46.290893 W | rafthttp: lost the TCP streaming connection with peer 6398a0da206f9fd1 (stream Message reader)
2018-06-06 14:48:46.292886 E | rafthttp: failed to read 6398a0da206f9fd1 on stream Message (read tcp 10.233.68.3:43826->10.233.102.198:2380: i/o timeout)
2018-06-06 14:48:46.292973 I | rafthttp: peer 6398a0da206f9fd1 became inactive
2018-06-06 14:48:46.298757 W | rafthttp: lost the TCP streaming connection with peer 6398a0da206f9fd1 (stream MsgApp v2 reader)
WARNING: 2018/06/06 14:48:47 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2018-06-06 14:48:48.426884 I | raft: 3338a611114a25c6 is starting a new election at term 3
2018-06-06 14:48:48.427392 I | raft: 3338a611114a25c6 became candidate at term 4
2018-06-06 14:48:48.431828 I | raft: 3338a611114a25c6 received MsgVoteResp from 3338a611114a25c6 at term 4
2018-06-06 14:48:48.431911 I | raft: 3338a611114a25c6 [logterm: 3, index: 1661] sent MsgVote request to 6398a0da206f9fd1 at term 4
2018-06-06 14:48:48.431945 I | raft: 3338a611114a25c6 [logterm: 3, index: 1661] sent MsgVote request to 6f80204d3efb278c at term 4
2018-06-06 14:48:48.432006 I | raft: raft.node: 3338a611114a25c6 lost leader 6398a0da206f9fd1 at term 4
2018-06-06 14:48:48.473680 I | raft: 3338a611114a25c6 received MsgVoteResp from 6f80204d3efb278c at term 4
2018-06-06 14:48:48.474272 I | raft: 3338a611114a25c6 [quorum:2] has received 2 MsgVoteResp votes and 0 vote rejections
2018-06-06 14:48:48.474332 I | raft: 3338a611114a25c6 became leader at term 4
2018-06-06 14:48:48.474788 I | raft: raft.node: 3338a611114a25c6 elected leader 3338a611114a25c6 at term 4
2018-06-06 14:48:49.303383 I | rafthttp: peer 6398a0da206f9fd1 became active
2018-06-06 14:48:49.303517 I | rafthttp: established a TCP streaming connection with peer 6398a0da206f9fd1 (stream Message reader)
2018-06-06 14:48:49.304236 I | rafthttp: established a TCP streaming connection with peer 6398a0da206f9fd1 (stream MsgApp v2 reader)
2018-06-06 14:48:53.116213 W | etcdserver: timed out waiting for read index response
2018-06-06 14:48:55.857435 W | etcdserver: failed to send out heartbeat on time (exceeded the 250ms timeout for 57.154996ms)
2018-06-06 14:48:55.858095 W | etcdserver: server is likely overloaded
2018-06-06 14:48:55.858491 W | etcdserver: failed to send out heartbeat on time (exceeded the 250ms timeout for 59.794219ms)
2018-06-06 14:48:55.858796 W | etcdserver: server is likely overloaded
2018-06-06 14:49:04.182721 N | pkg/osutil: received terminated signal, shutting down...
WARNING: 2018/06/06 14:49:04 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: getsockopt: connection refused"; Reconnecting to {0.0.0.0:2379 0  <nil>}
WARNING: 2018/06/06 14:49:05 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: getsockopt: connection refused"; Reconnecting to {0.0.0.0:2379 0  <nil>}
WARNING: 2018/06/06 14:49:06 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: getsockopt: connection refused"; Reconnecting to {0.0.0.0:2379 0  <nil>}
WARNING: 2018/06/06 14:49:07 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:49:09 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: getsockopt: connection refused"; Reconnecting to {0.0.0.0:2379 0  <nil>}
2018-06-06 14:49:11.741700 W | rafthttp: lost the TCP streaming connection with peer 6398a0da206f9fd1 (stream MsgApp v2 reader)
2018-06-06 14:49:11.741740 E | rafthttp: failed to read 6398a0da206f9fd1 on stream MsgApp v2 (unexpected EOF)
2018-06-06 14:49:11.741761 I | rafthttp: peer 6398a0da206f9fd1 became inactive
2018-06-06 14:49:11.741891 W | rafthttp: lost the TCP streaming connection with peer 6398a0da206f9fd1 (stream Message reader)
2018-06-06 14:49:12.355510 W | rafthttp: lost the TCP streaming connection with peer 6398a0da206f9fd1 (stream Message writer)
WARNING: 2018/06/06 14:49:13 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: getsockopt: connection refused"; Reconnecting to {0.0.0.0:2379 0  <nil>}
2018-06-06 14:49:14.615069 W | etcdserver: failed to reach the peerURL(http://etcd-8vfjdbfvnx.etcd.voltha.svc:2380) of member 6398a0da206f9fd1 (Get http://etcd-8vfjdbfvnx.etcd.voltha.svc:2380/version: dial tcp 10.233.102.198:2380: i/o timeout)
2018-06-06 14:49:14.615110 W | etcdserver: cannot get the version of member 6398a0da206f9fd1 (Get http://etcd-8vfjdbfvnx.etcd.voltha.svc:2380/version: dial tcp 10.233.102.198:2380: i/o timeout)
2018-06-06 14:49:16.805076 W | rafthttp: lost the TCP streaming connection with peer 6398a0da206f9fd1 (stream MsgApp v2 writer)
2018-06-06 14:49:19.356282 I | etcdserver: 3338a611114a25c6 starts leadership transfer from 3338a611114a25c6 to 6f80204d3efb278c
WARNING: 2018/06/06 14:49:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2018-06-06 14:49:19.367598 I | raft: 3338a611114a25c6 [term 4] starts to transfer leadership to 6f80204d3efb278c
2018-06-06 14:49:19.367631 I | raft: 3338a611114a25c6 sends MsgTimeoutNow to 6f80204d3efb278c immediately as 6f80204d3efb278c already has up-to-date log
WARNING: 2018/06/06 14:49:19 grpc: addrConn.transportMonitor exits due to: context canceled
2018-06-06 14:49:19.386449 I | raft: 3338a611114a25c6 [term: 4] received a MsgVote message with higher term from 6f80204d3efb278c [term: 5]
2018-06-06 14:49:19.386492 I | raft: 3338a611114a25c6 became follower at term 5
2018-06-06 14:49:19.386507 I | raft: 3338a611114a25c6 [logterm: 4, index: 1668, vote: 0] cast MsgVote for 6f80204d3efb278c [logterm: 4, index: 1668] at term 5
2018-06-06 14:49:19.386515 I | raft: raft.node: 3338a611114a25c6 lost leader 3338a611114a25c6 at term 5
2018-06-06 14:49:19.510260 I | raft: raft.node: 3338a611114a25c6 elected leader 6f80204d3efb278c at term 5
WARNING: 2018/06/06 14:49:19 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
2018-06-06 14:49:19.789617 I | etcdserver: 3338a611114a25c6 finished leadership transfer from 3338a611114a25c6 to 6f80204d3efb278c (took 433.325885ms)
WARNING: 2018/06/06 14:49:19 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:49:20 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:49:20 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:49:20 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:49:20 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
2018-06-06 14:49:20.841437 W | etcdserver: failed to reach the peerURL(http://etcd-8vfjdbfvnx.etcd.voltha.svc:2380) of member 6398a0da206f9fd1 (Get http://etcd-8vfjdbfvnx.etcd.voltha.svc:2380/version: dial tcp 10.233.102.198:2380: i/o timeout)
2018-06-06 14:49:20.841468 W | etcdserver: cannot get the version of member 6398a0da206f9fd1 (Get http://etcd-8vfjdbfvnx.etcd.voltha.svc:2380/version: dial tcp 10.233.102.198:2380: i/o timeout)
WARNING: 2018/06/06 14:49:20 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:49:21 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
WARNING: 2018/06/06 14:49:33 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
sbarbari commented 6 years ago

This issue has not yet been acknowledged. Can anyone provide recommendations?

nvtkaszpir commented 5 years ago

Can you provide more info about underlying hardware of the system which hosts libvirt machines? From the logs it explictly displays disk performance issues: apply entries took too long

Also putting k8s, glusterfs and etcd together on 2core/6GB vms on qcow2 will lead to generally poor performance. Things to try:

But first of all, don't use gluster for etcd backend storage.