Closed rkharya closed 8 years ago
@rkharya
looks like UCP is not able to add the re-commissioned node to it's internal KV store:
Jun 02 18:27:05 Docker-4.cisco.com ucp.sh[28442]: INFO[0040] Starting UCP Controller replica containers
Jun 02 18:27:05 Docker-4.cisco.com ucp.sh[28442]: ERRO[0040] Server response: {"message":"etcdserver: peerURL exists"}
Jun 02 18:27:05 Docker-4.cisco.com ucp.sh[28442]: ERRO[0040] Failed to start KV store. Run "docker logs ucp-kv" for more details
Jun 02 18:27:05 Docker-4.cisco.com ucp.sh[28442]: FATA[0040] Failed to add member to KV store: {"message":"etcdserver: peerURL exists"}
Right now when we decommission a node we stop UCP containers as based on my understanding of UCP installer there is no option to cleanly take a node out of UCP cluster.
Unless this is a UCP bug, my guess is that the current way of decommissioning might be leading UCP into trouble. In any case we will need to bring this issue up with Docker folks and ask them for recommended steps for cleanly removing a node from UCP cluster.
@mapuri Informed Uday about this.
closing this as not a cluster issue, needs to be tracked with UCP folks.
Docker - 1.11.1.cs2 UCP - 1.1.0 Contiv_cluster - v0.1-05-14-2016.00-33-02.UTC
Problem - 3 node UCP master setup. Docker-2/3/4 were commissioned as service-master and UCP dashboard displaying all 3 controller as healthy. Node Docker-4 was de-commissioned from Contive-Cluster and recommissioned again. After successful re-commissioning UCP services failed to start on it and UCP dashboard reporting this node in failed state.
steps -
[cluster-admin@Docker-2 ~]$ etcdctl member list c5583e158c122eef: name=Docker-2-FLM19379EU8 peerURLs=http://10.65.122.65:2380,http://10.65.122.65:7001 clientURLs=http://10.65.122.65:2379,http://10.65.122.65:4001 isLeader=false c69f2991ffeeb3a6: name=Docker-3-FCH19517CF9 peerURLs=http://10.65.122.66:2380,http://10.65.122.66:7001 clientURLs=http://10.65.122.66:2379,http://10.65.122.66:4001 isLeader=true
[cluster-admin@Docker-2 ~]$ ifconfig -a|grep enp6s0_0 enp6s0_0: flags=195<UP,BROADCAST,RUNNING,NOARP> mtu 1500
[cluster-admin@Docker-4 ~]$ etcdctl member list aae67246ba4f3b45: name=Docker-4-FCH19517CER peerURLs=http://10.65.122.67:2380,http://10.65.122.67:7001 clientURLs=http://10.65.122.67:2379,http://10.65.122.67:4001 isLeader=true
[cluster-admin@Docker-4 ~]$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 94499aa40c13 quay.io/coreos/etcd:v2.3.1 "/etcd" 18 hours ago Up 18 hours etcd 3af4192f05bb skynetservices/skydns:latest "/skydns" 44 hours ago Up 18 hours 53/tcp, 53/udp defaultdns
[cluster-admin@Docker-4 ~]$ sudo systemctl status -l -n 1000 ucp.service ● ucp.service - Ucp Loaded: loaded (/etc/systemd/system/ucp.service; static; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2016-06-02 18:27:07 IST; 18h ago Process: 30039 ExecStop=/usr/bin/ucp.sh stop (code=exited, status=0/SUCCESS) Process: 28442 ExecStart=/usr/bin/ucp.sh start (code=exited, status=1/FAILURE) Main PID: 28442 (code=exited, status=1/FAILURE)
Jun 02 18:26:21 Docker-4.cisco.com systemd[1]: Started Ucp. Jun 02 18:26:21 Docker-4.cisco.com systemd[1]: Starting Ucp... Jun 02 18:26:21 Docker-4.cisco.com ucp.sh[28442]: starting ucp on Docker-4-FCH19517CER[10.65.122.67] Jun 02 18:26:22 Docker-4.cisco.com ucp.sh[28442]: INFO[0000] Your engine version 1.11.1-cs2 is compatible Jun 02 18:26:22 Docker-4.cisco.com ucp.sh[28442]: WARN[0000] Your system uses devicemapper. We can not accurately detect available storage space. Please make sure you have at least 3.00 GB available in /var/lib/docker Jun 02 18:26:24 Docker-4.cisco.com ucp.sh[28442]: INFO[0002] All required images are present Jun 02 18:26:25 Docker-4.cisco.com ucp.sh[28442]: INFO[0000] This engine will join UCP and advertise itself with host address 10.65.122.67 - If this is incorrect, please specify an alternative address with the '--host-address' flag Jun 02 18:26:25 Docker-4.cisco.com ucp.sh[28442]: INFO[0000] Verifying your system is compatible with UCP Jun 02 18:26:25 Docker-4.cisco.com ucp.sh[28442]: INFO[0000] Checking that required ports are available and accessible Jun 02 18:27:02 Docker-4.cisco.com ucp.sh[28442]: INFO[0037] Starting local swarm containers Jun 02 18:27:05 Docker-4.cisco.com ucp.sh[28442]: INFO[0040] Starting UCP Controller replica containers Jun 02 18:27:05 Docker-4.cisco.com ucp.sh[28442]: ERRO[0040] Server response: {"message":"etcdserver: peerURL exists"} Jun 02 18:27:05 Docker-4.cisco.com ucp.sh[28442]: ERRO[0040] Failed to start KV store. Run "docker logs ucp-kv" for more details Jun 02 18:27:05 Docker-4.cisco.com ucp.sh[28442]: FATA[0040] Failed to add member to KV store: {"message":"etcdserver: peerURL exists"} Jun 02 18:27:06 Docker-4.cisco.com systemd[1]: ucp.service: main process exited, code=exited, status=1/FAILURE Jun 02 18:27:06 Docker-4.cisco.com ucp.sh[30039]: 986ef385fd95 Jun 02 18:27:07 Docker-4.cisco.com ucp.sh[30039]: 2d60e11357c4 Jun 02 18:27:07 Docker-4.cisco.com ucp.sh[30039]: 986ef385fd95 Jun 02 18:27:07 Docker-4.cisco.com ucp.sh[30039]: 2d60e11357c4 Jun 02 18:27:07 Docker-4.cisco.com ucp.sh[30039]: ucp-auth-api-certs Jun 02 18:27:07 Docker-4.cisco.com ucp.sh[30039]: ucp-auth-store-certs Jun 02 18:27:07 Docker-4.cisco.com ucp.sh[30039]: ucp-auth-store-data Jun 02 18:27:07 Docker-4.cisco.com ucp.sh[30039]: ucp-auth-worker-certs Jun 02 18:27:07 Docker-4.cisco.com ucp.sh[30039]: ucp-auth-worker-data Jun 02 18:27:07 Docker-4.cisco.com ucp.sh[30039]: ucp-client-root-ca Jun 02 18:27:07 Docker-4.cisco.com ucp.sh[30039]: ucp-cluster-root-ca Jun 02 18:27:07 Docker-4.cisco.com ucp.sh[30039]: ucp-controller-client-certs Jun 02 18:27:07 Docker-4.cisco.com ucp.sh[30039]: ucp-controller-server-certs Jun 02 18:27:07 Docker-4.cisco.com ucp.sh[30039]: ucp-kv-certs Jun 02 18:27:07 Docker-4.cisco.com ucp.sh[30039]: ucp-node-certs Jun 02 18:27:07 Docker-4.cisco.com systemd[1]: Unit ucp.service entered failed state. Jun 02 18:27:07 Docker-4.cisco.com systemd[1]: ucp.service failed.
[cluster-admin@Docker-1 ~]$ clusterctl nodes get | egrep "inventory_name|status" Ceph-1-FCH1936V1EJ: prev_status: Unallocated Ceph-1-FCH1936V1EJ: status: Unallocated Ceph-1-FCH1936V1EJ: inventory_name: Ceph-1-FCH1936V1EJ Ceph-2-FCH1936V1EX: prev_status: Unallocated Ceph-2-FCH1936V1EX: status: Unallocated Ceph-2-FCH1936V1EX: inventory_name: Ceph-2-FCH1936V1EX Ceph-3-FCH1936V1EZ: prev_status: Unallocated Ceph-3-FCH1936V1EZ: status: Unallocated Ceph-3-FCH1936V1EZ: inventory_name: Ceph-3-FCH1936V1EZ Docker-1-FLM19379EUC: prev_status: Provisioning Docker-1-FLM19379EUC: status: Allocated Docker-1-FLM19379EUC: inventory_name: Docker-1-FLM19379EUC Docker-2-FLM19379EU8: prev_status: Allocated Docker-2-FLM19379EU8: status: Allocated Docker-2-FLM19379EU8: inventory_name: Docker-2-FLM19379EU8 Docker-3-FCH19517CF9: prev_status: Allocated Docker-3-FCH19517CF9: status: Allocated Docker-3-FCH19517CF9: inventory_name: Docker-3-FCH19517CF9 Docker-4-FCH19517CER: prev_status: Provisioning Docker-4-FCH19517CER: status: Allocated Docker-4-FCH19517CER: inventory_name: Docker-4-FCH19517CER Docker-5-FCH19517CAT: prev_status: Allocated Docker-5-FCH19517CAT: status: Allocated Docker-5-FCH19517CAT: inventory_name: Docker-5-FCH19517CAT Docker-6-FCH1945JJ4F: prev_status: Allocated Docker-6-FCH1945JJ4F: status: Allocated Docker-6-FCH1945JJ4F: inventory_name: Docker-6-FCH1945JJ4F [cluster-admin@Docker-1 ~]$ clusterctl job get last
Description: commissionEvent: nodes:[Docker-4-FCH19517CER] extra-vars:{} host-group:service-master Status: Complete Error: Logs: [DEPRECATION WARNING]: Instead of sudo/sudo_user, use become/become_user and make sure become_method is 'sudo' (default). This feature will be removed in a future release. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.