etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.28k stars 9.71k forks source link

etcd2 fails to start after reboot - member node has previously registered with discovery service token #6292

Closed simoninkin closed 8 years ago

simoninkin commented 8 years ago

I'm running into this...problem. I have a cloud config that is being pushed from XenOrchestra (via docker-supplemental-pack for XenServer 7) to the VM. It's sort of fine and etcd2 works fine....until I reboot the machine. My cloud config is parsed again, and when the units are initialized, the etcd2 service attempts to rediscover already existing node, thus it fails..

gyuho commented 8 years ago

Can you provide error logs from etcd?

gyuho commented 8 years ago

Ping @simoninkin ?

gyuho commented 8 years ago

You might want to try using a new token.

simoninkin commented 8 years ago

@gyuho sorry, I was too busy with other stuff

You might want to try using a new token.

Maybe I'm not understanding the logic correctly, but are you suggesting using a new token each time the machine is restarted? I though all of this cloud-config stuff was created to make our lives simpler? Shouldn't the cluster members stay in the cluster upon reboot? Because, say, there are 100+ servers running CoreOS in 5 or more clusters. Then, upon update each member would have to be added and deleted from the cluster by hand. What is the point of cloud-config then? Or are the machines not supposed to be restarted at all?...

Do you still need the logs?

Sorry if I sound a bit pushy, I am just trying to figure out hows and whys.

Thanks, Simon

gyuho commented 8 years ago

@simoninkin You need to generate a new discovery token per bootstrap. Can you share your etcd error logs so that we can make sure what's happening?

simoninkin commented 8 years ago

I'll simulate this behaviour tomorrow morning and add the logs.

simoninkin commented 8 years ago

Ok, finally found time to do this. So basically, I made a mistake of coming to conclusions prematurely. I was trying to reboot the machine before the bootstraping took place and not all members were discovered in the cluster. This time I've started all 3 machines and they discovered each other. Then I rebooted one of them and this is what I got:

-- Logs begin at Fri 2016-09-02 15:07:15 UTC, end at Fri 2016-09-02 15:09:36 UTC. -- Sep 02 15:07:34 cgcoreossimon2 systemd[1]: Starting etcd2... Sep 02 15:07:34 cgcoreossimon2 etcd2[832]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://192.168.1.202:2379 Sep 02 15:07:34 cgcoreossimon2 etcd2[832]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2 Sep 02 15:07:34 cgcoreossimon2 etcd2[832]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/57e3996361eee042738b74631ef2858a Sep 02 15:07:34 cgcoreossimon2 etcd2[832]: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://192.168.1.202:2380 Sep 02 15:07:34 cgcoreossimon2 etcd2[832]: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379,http://0.0.0.0:4001 Sep 02 15:07:34 cgcoreossimon2 etcd2[832]: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://192.168.1.202:2380,http://192.168.1.202:7001 Sep 02 15:07:34 cgcoreossimon2 etcd2[832]: recognized and used environment variable ETCD_NAME=cgcoreossimon2 Sep 02 15:07:34 cgcoreossimon2 etcd2[832]: etcd Version: 2.3.2 Sep 02 15:07:34 cgcoreossimon2 etcd2[832]: Git SHA: ce63f10 Sep 02 15:07:34 cgcoreossimon2 etcd2[832]: Go Version: go1.5.4 Sep 02 15:07:34 cgcoreossimon2 etcd2[832]: Go OS/Arch: linux/amd64 Sep 02 15:07:34 cgcoreossimon2 etcd2[832]: setting maximum number of CPUs to 1, total number of available CPUs is 1 Sep 02 15:07:34 cgcoreossimon2 etcd2[832]: listening for peers on http://192.168.1.202:2380 Sep 02 15:07:34 cgcoreossimon2 etcd2[832]: listening for peers on http://192.168.1.202:7001 Sep 02 15:07:34 cgcoreossimon2 etcd2[832]: listening for client requests on http://0.0.0.0:2379 Sep 02 15:07:34 cgcoreossimon2 etcd2[832]: listening for client requests on http://0.0.0.0:4001 Sep 02 15:07:36 cgcoreossimon2 etcd2[832]: found peer e72ae65ad982d9d5 in the cluster Sep 02 15:07:36 cgcoreossimon2 etcd2[832]: found self b36d590337db8b09 in the cluster Sep 02 15:07:36 cgcoreossimon2 etcd2[832]: found 2 peer(s), waiting for 1 more Sep 02 15:08:36 cgcoreossimon2 etcd2[832]: error #0: client: etcd member https://discovery.etcd.io returns server error [Gateway Timeout] Sep 02 15:08:36 cgcoreossimon2 etcd2[832]: waiting for other nodes: error connecting to https://discovery.etcd.io, retrying in 2s Sep 02 15:08:38 cgcoreossimon2 etcd2[832]: found peer e72ae65ad982d9d5 in the cluster Sep 02 15:08:38 cgcoreossimon2 etcd2[832]: found self b36d590337db8b09 in the cluster Sep 02 15:08:38 cgcoreossimon2 etcd2[832]: found 2 peer(s), waiting for 1 more Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: found peer 741ddd4ca87f2464 in the cluster Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: found 3 needed peer(s) Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: name = cgcoreossimon2 Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: data dir = /var/lib/etcd2 Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: member dir = /var/lib/etcd2/member Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: heartbeat = 100ms Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: election = 1000ms Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: snapshot count = 10000 Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: discovery URL= https://discovery.etcd.io/57e3996361eee042738b74631ef2858a Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: advertise client URLs = http://192.168.1.202:2379 Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: initial advertise peer URLs = http://192.168.1.202:2380 Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: initial cluster = cgcoreossimon2=http://192.168.1.202:2380 Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: starting member b36d590337db8b09 in cluster 8c4e0441c72a4a70 Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: b36d590337db8b09 became follower at term 0 Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: newRaft b36d590337db8b09 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0] Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: b36d590337db8b09 became follower at term 1 Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: starting server... [version: 2.3.2, cluster version: to_be_decided] Sep 02 15:08:40 cgcoreossimon2 systemd[1]: Started etcd2. Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: added member 741ddd4ca87f2464 [http://192.168.1.203:2380] to cluster 8c4e0441c72a4a70 Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: added local member b36d590337db8b09 [http://192.168.1.202:2380] to cluster 8c4e0441c72a4a70 Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: added member e72ae65ad982d9d5 [http://192.168.1.201:2380] to cluster 8c4e0441c72a4a70 Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: the connection with e72ae65ad982d9d5 became active Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: b36d590337db8b09 [term: 1] received a MsgVote message with higher term from e72ae65ad982d9d5 [term: 2] Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: b36d590337db8b09 became follower at term 2 Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: b36d590337db8b09 [logterm: 1, index: 3, vote: 0] voted for e72ae65ad982d9d5 [logterm: 1, index: 3] at term 2 Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: raft.node: b36d590337db8b09 elected leader e72ae65ad982d9d5 at term 2 Sep 02 15:08:40 cgcoreossimon2 etcd2[832]: published {Name:cgcoreossimon2 ClientURLs:[http://192.168.1.202:2379]} to cluster 8c4e0441c72a4a70 Sep 02 15:08:41 cgcoreossimon2 etcd2[832]: the connection with 741ddd4ca87f2464 became active Sep 02 15:08:41 cgcoreossimon2 etcd2[832]: set the initial cluster version to 2.3 Sep 02 15:11:48 cgcoreossimon2 etcd2[832]: received terminated signal, shutting down... Sep 02 15:11:48 cgcoreossimon2 etcd2[832]: failed to read 741ddd4ca87f2464 on stream MsgApp v2 (net/http: request canceled) Sep 02 15:11:48 cgcoreossimon2 etcd2[832]: the connection with 741ddd4ca87f2464 became inactive Sep 02 15:11:48 cgcoreossimon2 etcd2[832]: failed to read e72ae65ad982d9d5 on stream Message (net/http: request canceled) Sep 02 15:11:48 cgcoreossimon2 etcd2[832]: the connection with e72ae65ad982d9d5 became inactive Sep 02 15:11:48 cgcoreossimon2 systemd[1]: Stopping etcd2... Sep 02 15:11:48 cgcoreossimon2 systemd[1]: Stopped etcd2. -- Reboot -- Sep 02 15:12:09 cgcoreossimon2 systemd[1]: Starting etcd2... Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://192.168.1.202:2379 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/57e3996361eee042738b74631ef2858a Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://192.168.1.202:2380 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379,http://0.0.0.0:4001 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://192.168.1.202:2380,http://192.168.1.202:7001 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: recognized and used environment variable ETCD_NAME=cgcoreossimon2 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: etcd Version: 2.3.2 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: Git SHA: ce63f10 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: Go Version: go1.5.4 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: Go OS/Arch: linux/amd64 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: setting maximum number of CPUs to 1, total number of available CPUs is 1 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: the server is already initialized as member before, starting as etcd member... Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: listening for peers on http://192.168.1.202:2380 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: listening for peers on http://192.168.1.202:7001 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: listening for client requests on http://0.0.0.0:2379 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: listening for client requests on http://0.0.0.0:4001 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: discovery token ignored since a cluster has already been initialized. Valid log found at "/var/lib/etcd2/member/wal" Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: name = cgcoreossimon2 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: data dir = /var/lib/etcd2 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: member dir = /var/lib/etcd2/member Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: heartbeat = 100ms Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: election = 1000ms Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: snapshot count = 10000 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: discovery URL= https://discovery.etcd.io/57e3996361eee042738b74631ef2858a Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: advertise client URLs = http://192.168.1.202:2379 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: restarting member b36d590337db8b09 in cluster 8c4e0441c72a4a70 at commit index 528 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: b36d590337db8b09 became follower at term 2 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: newRaft b36d590337db8b09 [peers: [], term: 2, commit: 528, applied: 0, lastindex: 528, lastterm: 2] Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: starting server... [version: 2.3.2, cluster version: to_be_decided] Sep 02 15:12:09 cgcoreossimon2 systemd[1]: Started etcd2. Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: raft.node: b36d590337db8b09 elected leader e72ae65ad982d9d5 at term 2 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: failed to find member e72ae65ad982d9d5 in cluster 8c4e0441c72a4a70 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: added member 741ddd4ca87f2464 [http://192.168.1.203:2380] to cluster 8c4e0441c72a4a70 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: added local member b36d590337db8b09 [http://192.168.1.202:2380] to cluster 8c4e0441c72a4a70 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: added member e72ae65ad982d9d5 [http://192.168.1.201:2380] to cluster 8c4e0441c72a4a70 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: set the initial cluster version to 2.3 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: failed to find member 741ddd4ca87f2464 in cluster 8c4e0441c72a4a70 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: failed to find member e72ae65ad982d9d5 in cluster 8c4e0441c72a4a70 Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: the connection with 741ddd4ca87f2464 became active Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: the connection with e72ae65ad982d9d5 became active Sep 02 15:12:09 cgcoreossimon2 etcd2[712]: published {Name:cgcoreossimon2 ClientURLs:[http://192.168.1.202:2379]} to cluster 8c4e0441c72a4a70

Is this a healthy behaviour?

xiang90 commented 8 years ago

Yes. It is. Glad you figured it out.

simoninkin commented 8 years ago

Thanks for your help! And sorry for taking you time on this.