aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.95k stars 284 forks source link

EKS-Anywhere etcd permanently removed from cluster on node reboot #6847

Open czomo opened 10 months ago

czomo commented 10 months ago

What happened: I am using EKS-A on intel nuc x3 control-plane deployed with tinkerbell provider. Cluster works fine time after provisioning. After reboot one of the nodes for maintenance ectd starts to malfunctioning. Imiediatly member is removed from etcd cluster. Node becomes unready.

What you expected to happen: After reboot cluster should return to previous configuration without any distrubstion for workload. Etcd member shouldn't leave cluster.

How to reproduce it (as minimally and precisely as possible):

  1. Provision EKS-A on intel nuc with tinkerbell, set 3 control-planes nodes.
  2. Restart one of the nodes
  3. Watch ETCD

Anything else we need to know?:

  1. I tried to add faulty member to ectd but it is removed after a while
  2. After node is rebooted it is no longer control plane node. It loses all of its labels and annotations. Is is in NOT Ready state all the time. ETCD configuration of faulty node is no longer pointing to the rest of members

ETCD logs

{"level":"info","ts":"2023-10-17T10:54:19.138406Z","caller":"etcdserver/corrupt.go:95","msg":"starting initial corruption check","local-member-id":"9e7e12990119e47f","timeout":"7s"}
{"level":"info","ts":"2023-10-17T10:54:19.138438Z","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"9e7e12990119e47f","remote-peer-id":"a91069ae26d20721"}
{"level":"warn","ts":"2023-10-17T10:54:19.14612Z","caller":"etcdserver/corrupt.go:398","msg":"failed hash kv request","local-member-id":"9e7e12990119e47f","requested-revision":1002595,"remote-peer-endpoint":"https://192.168.27.11:2380/","error":"etcdserver: mvcc: required revision has been compacted"}
{"level":"warn","ts":"2023-10-17T10:54:19.152547Z","caller":"etcdserver/corrupt.go:398","msg":"failed hash kv request","local-member-id":"9e7e12990119e47f","requested-revision":1002595,"remote-peer-endpoint":"https://192.168.27.12:2380/","error":"etcdserver: mvcc: required revision has been compacted"}
{"level":"info","ts":"2023-10-17T10:54:19.152567Z","caller":"etcdserver/corrupt.go:165","msg":"initial corruption checking passed; no corruption","local-member-id":"9e7e12990119e47f"}
{"level":"info","ts":"2023-10-17T10:54:19.152586Z","caller":"etcdserver/server.go:845","msg":"starting etcd server","local-member-id":"9e7e12990119e47f","local-server-version":"3.5.8","cluster-id":"5545aa7cd99e03c","cluster-version":"3.5"}
{"level":"warn","ts":"2023-10-17T10:54:19.152612Z","caller":"etcdserver/server.go:1127","msg":"server error","error":"the member has been permanently removed from the cluster"}
{"level":"warn","ts":"2023-10-17T10:54:19.152644Z","caller":"etcdserver/server.go:1128","msg":"data-dir used by this member must be removed"}
{"level":"info","ts":"2023-10-17T10:54:19.152641Z","caller":"etcdserver/server.go:754","msg":"starting initial election tick advance","election-ticks":10}
{"level":"info","ts":"2023-10-17T10:54:19.152647Z","caller":"fileutil/purge.go:44","msg":"started to purge file","dir":"/var/lib/etcd/member/snap","suffix":"snap.db","max":5,"interval":"30s"}
{"level":"warn","ts":"2023-10-17T10:54:19.15268Z","caller":"etcdserver/server.go:2083","msg":"failed to publish local member to cluster through raft","local-member-id":"9e7e12990119e47f","local-member-attributes":"{Name:eksa-01 ClientURLs:[https://192.168.27.10:2379/]}","request-path":"/0/members/9e7e12990119e47f/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"}
{"level":"info","ts":"2023-10-17T10:54:19.152692Z","caller":"fileutil/purge.go:44","msg":"started to purge file","dir":"/var/lib/etcd/member/snap","suffix":"snap","max":5,"interval":"30s"}
{"level":"info","ts":"2023-10-17T10:54:19.152701Z","caller":"fileutil/purge.go:44","msg":"started to purge file","dir":"/var/lib/etcd/member/wal","suffix":"wal","max":5,"interval":"30s"}
{"level":"warn","ts":"2023-10-17T10:54:19.152695Z","caller":"etcdserver/server.go:2073","msg":"stopped publish because server is stopped","local-member-id":"9e7e12990119e47f","local-member-attributes":"{Name:eksa-01 ClientURLs:[https://192.168.27.10:2379/]}","publish-timeout":"7s","error":"etcdserver: server stopped"}
{"level":"warn","ts":"2023-10-17T10:54:19.152734Z","caller":"etcdserver/server.go:2745","msg":"server has stopped; skipping GoAttach"}
{"level":"info","ts":"2023-10-17T10:54:19.152754Z","caller":"rafthttp/peer.go:330","msg":"stopping remote peer","remote-peer-id":"72285395cb2ca380"}
{"level":"info","ts":"2023-10-17T10:54:19.152771Z","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"72285395cb2ca380"}
{"level":"info","ts":"2023-10-17T10:54:19.152791Z","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"72285395cb2ca380"}
{"level":"info","ts":"2023-10-17T10:54:19.152816Z","caller":"rafthttp/pipeline.go:85","msg":"stopped HTTP pipelining with remote peer","local-member-id":"9e7e12990119e47f","remote-peer-id":"72285395cb2ca380"}
{"level":"info","ts":"2023-10-17T10:54:19.152825Z","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"9e7e12990119e47f","remote-peer-id":"72285395cb2ca380"}
{"level":"info","ts":"2023-10-17T10:54:19.152842Z","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"9e7e12990119e47f","remote-peer-id":"72285395cb2ca380"}
{"level":"info","ts":"2023-10-17T10:54:19.152854Z","caller":"rafthttp/peer.go:335","msg":"stopped remote peer","remote-peer-id":"72285395cb2ca380"}
{"level":"info","ts":"2023-10-17T10:54:19.152858Z","caller":"rafthttp/peer.go:330","msg":"stopping remote peer","remote-peer-id":"a91069ae26d20721"}
{"level":"info","ts":"2023-10-17T10:54:19.152865Z","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"a91069ae26d20721"}
{"level":"info","ts":"2023-10-17T10:54:19.152882Z","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"a91069ae26d20721"}
{"level":"info","ts":"2023-10-17T10:54:19.152899Z","caller":"rafthttp/pipeline.go:85","msg":"stopped HTTP pipelining with remote peer","local-member-id":"9e7e12990119e47f","remote-peer-id":"a91069ae26d20721"}
{"level":"info","ts":"2023-10-17T10:54:19.152942Z","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"9e7e12990119e47f","remote-peer-id":"a91069ae26d20721"}
{"level":"info","ts":"2023-10-17T10:54:19.152971Z","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"9e7e12990119e47f","remote-peer-id":"a91069ae26d20721"}
{"level":"info","ts":"2023-10-17T10:54:19.152979Z","caller":"rafthttp/peer.go:335","msg":"stopped remote peer","remote-peer-id":"a91069ae26d20721"}
{"level":"info","ts":"2023-10-17T10:54:19.153752Z","caller":"embed/etcd.go:726","msg":"starting with client TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/server.crt, key = /etc/kubernetes/pki/etcd/server.key, client-cert=, client-key=, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
{"level":"info","ts":"2023-10-17T10:54:19.153809Z","caller":"embed/etcd.go:597","msg":"serving peer traffic","address":"192.168.27.10:2380"}
{"level":"info","ts":"2023-10-17T10:54:19.153819Z","caller":"embed/etcd.go:569","msg":"cmux::serve","address":"192.168.27.10:2380"}
{"level":"info","ts":"2023-10-17T10:54:19.153848Z","caller":"embed/etcd.go:278","msg":"now serving peer/client/metrics","local-member-id":"9e7e12990119e47f","initial-advertise-peer-urls":["https://192.168.27.10:2380/"],"listen-peer-urls":["https://192.168.27.10:2380/"],"advertise-client-urls":["https://192.168.27.10:2379/"],"listen-client-urls":["https://127.0.0.1:2379/","https://192.168.27.10:2379/"],"listen-metrics-urls":["http://127.0.0.1:2381/"]}
{"level":"info","ts":"2023-10-17T10:54:19.153863Z","caller":"embed/etcd.go:855","msg":"serving metrics","address":"http://127.0.0.1:2381/"}
{"level":"info","ts":"2023-10-17T10:54:19.158266Z","caller":"etcdmain/main.go:44","msg":"notifying init daemon"}
{"level":"info","ts":"2023-10-17T10:54:19.158281Z","caller":"etcdmain/main.go:50","msg":"successfully notified init daemon"}
"level":"warn","ts":"2023-10-18T11:42:46.521552Z","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"72285395cb2ca380","remote-peer-id":"9382d72df1c17134","error":"context canceled"}
{"level":"warn","ts":"2023-10-18T11:42:46.521565Z","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"9382d72df1c17134","error":"failed to read 9382d72df1c17134 on stream MsgApp v2 (context canceled)"}
{"level":"info","ts":"2023-10-18T11:42:46.521574Z","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"72285395cb2ca380","remote-peer-id":"9382d72df1c17134"}
{"level":"warn","ts":"2023-10-18T11:42:46.521603Z","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"72285395cb2ca380","remote-peer-id":"9382d72df1c17134","error":"context canceled"}
{"level":"info","ts":"2023-10-18T11:42:46.52161Z","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"72285395cb2ca380","remote-peer-id":"9382d72df1c17134"}
{"level":"info","ts":"2023-10-18T11:42:46.521629Z","caller":"rafthttp/peer.go:335","msg":"stopped remote peer","remote-peer-id":"9382d72df1c17134"}
{"level":"info","ts":"2023-10-18T11:42:46.521639Z","caller":"rafthttp/transport.go:355","msg":"removed remote peer","local-member-id":"72285395cb2ca380","removed-remote-peer-id":"9382d72df1c17134"}
{"level":"warn","ts":"2023-10-18T11:32:38.862633Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"9382d72df1c17134","rtt":"10.990561ms","error":"dial tcp 192.168.27.10:2380: connect: connection refused"}
{"level":"warn","ts":"2023-10-18T11:32:38.862786Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"9382d72df1c17134","rtt":"1.115333ms","error":"dial tcp 192.168.27.10:2380: connect: connection refused"}

Environment:

pokearu commented 10 months ago

Hi @czomo, thanks for using EKS-A.

So tinkerbell (baremetal) with EKS-A uses stacked ETCD and not external. So if one of the 3 CP nodes goes down ETCD would have an even number of nodes and run into issues.

Did you notice when you rebooted the CP node, did it try to join back into the cluster? We run cloud-init and that would join the node back.

czomo commented 10 months ago

Hi @pokearu

Did you notice when you rebooted the CP node

I observed it both when rebooting the CP node and during networking issues

did it try to join back into the cluster?

yes, multiple times. However each time etcd member is removed from cluster after some time

We run cloud-init and that would join the node back.

The problem is that seems not to be working as expected. As I mentioned above rebooted CP gets up with missing annotation and labels(its not marked as CP anymore). Any idea what could happen that cloud-init was modified? Could you point me where cloud-init is located?