etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
46.79k stars 9.64k forks source link

When the long TCP connection between two nodes in the cluster is not disconnected, but data transmission is no longer possible through the connection, etcd will not perform a quick reconnection. #18052

Open SingleThread opened 1 month ago

SingleThread commented 1 month ago

Bug report criteria

What happened?

I have an etcd cluster with five members. I simulated a scenario where the TCP connection was not disconnected but data transmission could not be performed. I found that the long connection established as a client would still last for a long time before a reconnection request could be initiated. image

My simulation process is as follows:

  1. Restart node A to create a new TCP persistent connection with node B. Use the following command to check the current connection between node A and node B. lsof -i:2380 | grep B IP:etcd-server image
  2. Configure the firewall on node B iptables -A INPUT -p tcp --dport 2380 -s node A IP -j REJECT image
  3. Observe how long the TCP connection established between node A and node B lasts before the connection request is reinitiated. image
  4. Finally, after about seventeen minutes, the original TCP connection was destroyed and a new connection request was reinitiated (the requested random port changed). etcd 2309775 arbiter 15u IPv4 6180799 0t0 TCP Arbitration-DC1-01:35146->52.135.41.222:etcd-server (ESTABLISHED) etcd 2309775 arbiter 16u IPv4 6180800 0t0 TCP Arbitration-DC1-01:35156->52.135.41.222:etcd-server (ESTABLISHED)
  5. About seventeen minutes after the original TCP connection was established, one end of the firewall was configured, and the etcd log showed that the connection was invalid. {"level":"warn","ts":"2024-05-22T16:45:28.045+0800","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"f3f3a4077540c49f","error":"failed to heartbeat f3f3a4077540c49f on stream MsgApp v2 (write tcp 52.135.41.222:2380->52.135.41.221:44720: write: connection timed out)"}

What did you expect to happen?

In a scenario where the TCP connection is not disconnected but data transmission cannot be performed, etcd should be able to quickly detect that the current long connection is unavailable and should reinitiate a request to establish a connection.

How can we reproduce it (as minimally and precisely as possible)?

The problem can be reproduced by configuring the firewall and rejecting port 2380 requests.

Anything else we need to know?

System default TCP connection configuration image

Etcd version (please run commands below)

etcd Version: 3.5.4 Git SHA: 1f6298c3c Go Version: go1.19.6 Go OS/Arch: linux/amd64

Etcd configuration (command line flags or environment variables)

etcd -name xxxx -data-dir=xxxx -cert-file=xxxx/server.crt -key-file=xxxx/server.tkey --peer-client-cert-auth --peer-trusted-ca-file=xxxx/server.crt --peer-cert-file=xxxx/server.crt --peer-key-file=xxxx/server.tkey -advertise-client-urls https://52.135.41.221:2379,http://127.0.0.1:4001 -listen-client-urls https://52.135.41.221:2379,http://127.0.0.1:4001 -initial-advertise-peer-urls http://52.135.41.221:2380 -listen-peer-urls http://52.135.41.221:2380 -initial-cluster-token arbitration-etcd-cluster -initial-cluster arbitration-etcd0=http://52.135.41.221:2380,arbitration-etcd1=http://52.135.41.222:2380,arbitration-etcd2=http://52.135.41.223:2380,arbitration-etcd3=http://52.135.41.224:2380,arbitration-etcd4=http://52.135.41.225:2380 -initial-cluster-state new -heartbeat-interval=1000 -election-timeout=10000 -enable-v2=true --cipher-suites TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 --logger=zap --log-level info

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

```console $ etcdctl member list -w table # paste output here $ etcdctl --endpoints= endpoint status -w table # paste output here ```

Relevant log output

No response

tjungblu commented 1 month ago

could you kindly upload some log files from those two peers? if you also bring a tcpdump/PCAP along, that would be useful.

just a note to myself, POST also has some interesting implementation: https://github.com/etcd-io/etcd/blob/release-3.5/server/etcdserver/api/rafthttp/pipeline.go#L133 also going through the history, #12870 seems vaguely related

SingleThread commented 1 month ago

could you kindly upload some log files from those two peers? if you also bring a tcpdump/PCAP along, that would be useful.

just a note to myself, POST also has some interesting implementation: https://github.com/etcd-io/etcd/blob/release-3.5/server/etcdserver/api/rafthttp/pipeline.go#L133 also going through the history, #12870 seems vaguely related

I missed some information. The etcd API version we use is v2. I can organize the relevant logs and send them to you, as well as tcpdump/PCAP

SingleThread commented 1 month ago

could you kindly upload some log files from those two peers? if you also bring a tcpdump/PCAP along, that would be useful.

just a note to myself, POST also has some interesting implementation: https://github.com/etcd-io/etcd/blob/release-3.5/server/etcdserver/api/rafthttp/pipeline.go#L133 also going through the history, #12870 seems vaguely related

I need to explain the simulation process again

preconditions: The etcd services of the five members of the cluster are all running normally, and there are no shut down members. 以下信息有些

  1. DC1-02 performs packet capture ./tcpdump_x86 tcp port 2380 and host 52.135.41.221 -w /var/log/etcd.cap

  2. DC1-01 monitors connection changes while : ; do lsof -i:2380 | grep 52.135.41.222:etcd-server>>/var/log/check.log; echo $(date)>>/var/log/check.log; sleep 1; done;

  3. DC1-01 restarts the service su - arbiter -s /bin/bash -c "sh xxx/bin/service.sh restart"

  4. DC1-02 configure firewall iptables -A INPUT -p tcp --dport 2380 -s 52.135.41.221 -j REJECT

  5. Recovery iptables -D INPUT -p tcp --dport 2380 -s 52.135.41.221 -j REJECT

The following log is the log I obtained from the above operation when 52.135.41.221 and 52.135.41.222 are connected normally

2024-05-23T08:56:34.286+0800 I restored the firewall policy on 52.135.41.222, and the log on 52.135.41.221 shows that the TCP connection was restored.

{"level":"info","ts":"2024-05-23T08:56:34.286+0800","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"f3f3a4077540c49f","remote-peer-id":"cee2c0be4d046b6"} {"level":"info","ts":"2024-05-23T08:56:35.309+0800","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"f3f3a4077540c49f","remote-peer-id":"cee2c0be4d046b6"}

Thu May 23 09:08:45 AM CST 2024 At this point in time I started configuring the 52.135.41.222 firewall The etcd log of member 52.135.41.221 always shows the connection 52.135.41.222 However, through the command lsof -i:2380 | grep 52.135.41.222:etcd-server, we can see that the TCP connection will not be disconnected or reconnected for a long time.

{"level":"warn","ts":"2024-05-23T09:09:29.748+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"cee2c0be4d046b6","rtt":"233.675µs","error":"dial tcp 52.135.41.222:2380: connect: connection refused"}

Check the continuously monitored connection information below. At CST 2024 Thu May 23 09:27:23 AM CST 2024, the original TCP connection was disconnected, and 52.135.41.221 retried to establish a connection with 2.135.41.222.

Thu May 23 09:27:22 AM CST 2024 etcd 261386 arbiter 14u IPv4 12165446 0t0 TCP Arbitration-DC1-01:38948->52.135.41.222:etcd-server (ESTABLISHED) etcd 261386 arbiter 15u IPv4 12165447 0t0 TCP Arbitration-DC1-01:38954->52.135.41.222:etcd-server (ESTABLISHED) Thu May 23 09:27:23 AM CST 2024 etcd 261386 arbiter 14u IPv4 12307212 0t0 TCP Arbitration-DC1-01:53342->52.135.41.222:etcd-server (SYN_SENT) etcd 261386 arbiter 15u IPv4 12307213 0t0 TCP Arbitration-DC1-01:53346->52.135.41.222:etcd-server (SYN_SENT) 52.135.41.222-etcd.log The tcp packet capture data file format is cap format and does not support uploading. image

SingleThread commented 1 month ago

could you kindly upload some log files from those two peers? if you also bring a tcpdump/PCAP along, that would be useful.

just a note to myself, POST also has some interesting implementation: https://github.com/etcd-io/etcd/blob/release-3.5/server/etcdserver/api/rafthttp/pipeline.go#L133 also going through the history, #12870 seems vaguely related

Supplementing the log of 52.135.41.221. The file upload failed. I can only paste part of the log. Most of the printing in the middle is the same.

{"level":"warn","ts":"2024-05-23T09:09:09.745+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"cee2c0be4d046b6","rtt":"233.675µs","error":"dial tcp 52.135.41.222:2380: connect: connection refused"}

Until 2024-05-23T09:27:23.437+0800, the log printing changed

status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"cee2c0be4d046b6","rtt":"285.081µs","error":"dial tcp 52.135.41.222:2380: connect: connection refused"} {"level":"warn","ts":"2024-05-23T09:27:23.437+0800","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"f3f3a4077540c49f","remote-peer-id":"cee2c0be4d046b6","error":"read tcp 52.135.41.221:38948->52.135.41.222:2380: read: connection refused"} {"level":"warn","ts":"2024-05-23T09:27:23.437+0800","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"cee2c0be4d046b6","error":"failed to read cee2c0be4d046b6 on stream MsgApp v2 (read tcp 52.135.41.221:38948->52.135.41.222:2380: read: connection refused)"} {"level":"warn","ts":"2024-05-23T09:27:23.437+0800","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"f3f3a4077540c49f","remote-peer-id":"cee2c0be4d046b6","error":"read tcp 52.135.41.221:38954->52.135.41.222:2380: read: connection refused"}


{"level":"info","ts":"2024-05-23T08:56:34.286+0800","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"f3f3a4077540c49f","remote-peer-id":"cee2c0be4d046b6"} {"level":"info","ts":"2024-05-23T08:56:35.309+0800","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"f3f3a4077540c49f","remote-peer-id":"cee2c0be4d046b6"} {"level":"info","ts":"2024-05-23T09:08:23.394+0800","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["/opt/arbitration/arbitration-etcd/foss/etcd/etcd","-name","arbitration-etcd0","-data-dir=/opt/arbitration/arbitration-etcd/arbitration-etcd0","-cert-file=/opt/arbitration/arbitration-etcd/keystore/server.crt","-key-file=/opt/arbitration/arbitration-etcd/keystore/server.tkey","--peer-client-cert-auth","--peer-trusted-ca-file=/opt/arbitration/arbitration-etcd/keystore/server.crt","--peer-cert-file=/opt/arbitration/arbitration-etcd/keystore/server.crt","--peer-key-file=/opt/arbitration/arbitration-etcd/keystore/server.tkey","-advertise-client-urls","https://52.135.41.221:2379,http://127.0.0.1:4001","-listen-client-urls","https://52.135.41.221:2379,http://127.0.0.1:4001","-initial-advertise-peer-urls","http://52.135.41.221:2380","-listen-peer-urls","http://52.135.41.221:2380","-initial-cluster-token","arbitration-etcd-cluster","-initial-cluster","arbitration-etcd0=http://52.135.41.221:2380,arbitration-etcd1=http://52.135.41.222:2380,arbitration-etcd2=http://52.135.41.223:2380,arbitration-etcd3=http://52.135.41.224:2380,arbitration-etcd4=http://52.135.41.225:2380","-initial-cluster-state","new","-heartbeat-interval=1000","-election-timeout=10000","-enable-v2=true","--cipher-suites","TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384","--logger=zap","--log-level","info"]} {"level":"info","ts":"2024-05-23T09:08:23.395+0800","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/opt/arbitration/arbitration-etcd/arbitration-etcd0","dir-type":"member"} {"level":"info","ts":"2024-05-23T09:08:23.395+0800","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["http://52.135.41.221:2380"]} {"level":"info","ts":"2024-05-23T09:08:23.395+0800","caller":"embed/etcd.go:479","msg":"starting with peer TLS","tls-info":"cert = /opt/arbitration/arbitration-etcd/keystore/server.crt, key = /opt/arbitration/arbitration-etcd/keystore/server.tkey, client-cert=, client-key=, trusted-ca = /opt/arbitration/arbitration-etcd/keystore/server.crt, client-cert-auth = true, crl-file = ","cipher-suites":["TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"]} {"level":"warn","ts":"2024-05-23T09:08:23.395+0800","caller":"embed/etcd.go:508","msg":"scheme is HTTP while key and cert files are present; ignoring key and cert files","peer-url":"http://52.135.41.221:2380"} {"level":"warn","ts":"2024-05-23T09:08:23.395+0800","caller":"embed/etcd.go:511","msg":"scheme is HTTP while --peer-client-cert-auth is enabled; ignoring client cert auth for this URL","peer-url":"http://52.135.41.221:2380"} {"level":"info","ts":"2024-05-23T09:08:23.395+0800","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["http://127.0.0.1:4001","https://52.135.41.221:2379"]} {"level":"warn","ts":"2024-05-23T09:08:23.395+0800","caller":"embed/etcd.go:607","msg":"scheme is HTTP while key and cert files are present; ignoring key and cert files","client-url":"http://127.0.0.1:4001"} {"level":"info","ts":"2024-05-23T09:08:23.395+0800","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.4","git-sha":"1f6298c3c","go-version":"go1.19.6","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2,"member-initialized":true,"name":"arbitration-etcd0","data-dir":"/opt/arbitration/arbitration-etcd/arbitration-etcd0","wal-dir":"","wal-dir-dedicated":"","member-dir":"/opt/arbitration/arbitration-etcd/arbitration-etcd0/member","force-new-cluster":false,"heartbeat-interval":"1s","election-timeout":"10s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://52.135.41.221:2380"],"listen-peer-urls":["http://52.135.41.221:2380"],"advertise-client-urls":["http://127.0.0.1:4001","https://52.135.41.221:2379"],"listen-client-urls":["http://127.0.0.1:4001","https://52.135.41.221:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"} {"level":"info","ts":"2024-05-23T09:08:23.396+0800","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/opt/arbitration/arbitration-etcd/arbitration-etcd0/member/snap/db","took":"194.408µs"} {"level":"warn","ts":"2024-05-23T09:08:24.572+0800","caller":"etcdserver/server.go:714","msg":"detected custom v2store content. Etcd v3.5 is the last version allowing to access it using API v2. Please remove the content."} {"level":"info","ts":"2024-05-23T09:08:24.572+0800","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":1300023,"snapshot-size":"349 kB"} {"level":"info","ts":"2024-05-23T09:08:24.572+0800","caller":"etcdserver/server.go:521","msg":"recovered v3 backend from snapshot","backend-size-bytes":20480,"backend-size":"20 kB","backend-size-in-use-bytes":12288,"backend-size-in-use":"12 kB"} {"level":"info","ts":"2024-05-23T09:08:24.709+0800","caller":"etcdserver/raft.go:483","msg":"restarting local member","cluster-id":"f1f8153b914f5d3c","local-member-id":"f3f3a4077540c49f","commit-index":1322772} {"level":"info","ts":"2024-05-23T09:08:24.710+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"f3f3a4077540c49f switched to configuration voters=(931730602507388598 1280822766390070114 2311079438966115826 10506311064610083283 17578574122496083103)"} {"level":"info","ts":"2024-05-23T09:08:24.710+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"f3f3a4077540c49f became follower at term 2"} {"level":"info","ts":"2024-05-23T09:08:24.710+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"newRaft f3f3a4077540c49f [peers: [cee2c0be4d046b6,11c665851ae9cb62,20129a814f48cdf2,91cdea3f99d711d3,f3f3a4077540c49f], term: 2, commit: 1322772, applied: 1300023, lastindex: 1322776, lastterm: 2]"} {"level":"info","ts":"2024-05-23T09:08:24.711+0800","caller":"api/capability.go:75","msg":"enabled capabilities for version","cluster-version":"3.5"} {"level":"info","ts":"2024-05-23T09:08:24.711+0800","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"f1f8153b914f5d3c","local-member-id":"f3f3a4077540c49f","recovered-remote-peer-id":"20129a814f48cdf2","recovered-remote-peer-urls":["http://52.135.41.224:2380"]} {"level":"info","ts":"2024-05-23T09:08:24.711+0800","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"f1f8153b914f5d3c","local-member-id":"f3f3a4077540c49f","recovered-remote-peer-id":"91cdea3f99d711d3","recovered-remote-peer-urls":["http://52.135.41.223:2380"]} {"level":"info","ts":"2024-05-23T09:08:24.711+0800","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"f1f8153b914f5d3c","local-member-id":"f3f3a4077540c49f","recovered-remote-peer-id":"cee2c0be4d046b6","recovered-remote-peer-urls":["http://52.135.41.222:2380"]} {"level":"info","ts":"2024-05-23T09:08:24.711+0800","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"f1f8153b914f5d3c","local-member-id":"f3f3a4077540c49f","recovered-remote-peer-id":"f3f3a4077540c49f","recovered-remote-peer-urls":["http://52.135.41.221:2380"]} {"level":"info","ts":"2024-05-23T09:08:24.711+0800","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"f1f8153b914f5d3c","local-member-id":"f3f3a4077540c49f","recovered-remote-peer-id":"11c665851ae9cb62","recovered-remote-peer-urls":["http://52.135.41.225:2380"]} {"level":"info","ts":"2024-05-23T09:08:24.711+0800","caller":"membership/cluster.go:287","msg":"set cluster version from store","cluster-version":"3.5"} {"level":"warn","ts":"2024-05-23T09:08:24.718+0800","caller":"auth/store.go:1220","msg":"simple token is not cryptographically signed"} {"level":"info","ts":"2024-05-23T09:08:24.727+0800","caller":"mvcc/kvstore.go:415","msg":"kvstore restored","current-rev":1} {"level":"info","ts":"2024-05-23T09:08:24.733+0800","caller":"etcdserver/quota.go:94","msg":"enabled backend quota with default value","quota-name":"v3-applier","quota-size-bytes":2147483648,"quota-size":"2.1 GB"} {"level":"info","ts":"2024-05-23T09:08:24.738+0800","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"cee2c0be4d046b6"} {"level":"info","ts":"2024-05-23T09:08:24.738+0800","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"f3f3a4077540c49f","remote-peer-id":"cee2c0be4d046b6"} {"level":"info","ts":"2024-05-23T09:08:24.739+0800","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"f3f3a4077540c49f","remote-peer-id":"cee2c0be4d046b6"} {"level":"info","ts":"2024-05-23T09:08:24.741+0800","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"f3f3a4077540c49f","remote-peer-id":"cee2c0be4d046b6"} {"level":"info","ts":"2024-05-23T09:08:24.742+0800","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"cee2c0be4d046b6"} {"level":"info","ts":"2024-05-23T09:08:24.742+0800","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"f3f3a4077540c49f","remote-peer-id":"cee2c0be4d046b6"} {"level":"info","ts":"2024-05-23T09:08:24.742+0800","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"f3f3a4077540c49f","remote-peer-id":"cee2c0be4d046b6","remote-peer-urls":["http://52.135.41.222:2380"]} {"level":"info","ts":"2024-05-23T09:08:24.742+0800","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"11c665851ae9cb62"} {"level":"info","ts":"2024-05-23T09:08:24.742+0800","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"f3f3a4077540c49f","remote-peer-id":"11c665851ae9cb62"} {"level":"info","ts":"2024-05-23T09:08:24.742+0800","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"f3f3a4077540c49f","remote-peer-id":"cee2c0be4d046b6"} {"level":"info","ts":"2024-05-23T09:08:24.743+0800","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"cee2c0be4d046b6"} {"level":"info","ts":"2024-05-23T09:08:24.743+0800","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"f3f3a4077540c49f","remote-peer-id":"cee2c0be4d046b6"} {"level":"info","ts":"2024-05-23T09:08:24.743+0800","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"f3f3a4077540c49f","remote-peer-id":"cee2c0be4d046b6"} {"level":"info","ts":"2024-05-23T09:08:24.743+0800","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"f3f3a4077540c49f","remote-peer-id":"11c665851ae9cb62"} {"level":"info","ts":"2024-05-23T09:08:24.743+0800","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"f3f3a4077540c49f","remote-peer-id":"11c665851ae9cb62"} {"level":"info","ts":"2024-05-23T09:08:24.744+0800","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"11c665851ae9cb62"} {"level":"info","ts":"2024-05-23T09:08:24.744+0800","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"f3f3a4077540c49f","remote-peer-id":"11c665851ae9cb62","remote-peer-urls":["http://52.135.41.225:2380"]} {"level":"info","ts":"2024-05-23T09:08:24.744+0800","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"f3f3a4077540c49f","remote-peer-id":"11c665851ae9cb62"} {"level":"info","ts":"2024-05-23T09:08:24.744+0800","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"20129a814f48cdf2"} {"level":"info","ts":"2024-05-23T09:08:24.744+0800","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"f3f3a4077540c49f","remote-peer-id":"20129a814f48cdf2"} {"level":"info","ts":"2024-05-23T09:08:24.744+0800","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"f3f3a4077540c49f","remote-peer-id":"11c665851ae9cb62"} {"level":"info","ts":"2024-05-23T09:08:24.748+0800","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"20129a814f48cdf2"} {"level":"info","ts":"2024-05-23T09:08:24.748+0800","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"f3f3a4077540c49f","remote-peer-id":"20129a814f48cdf2","remote-peer-urls":["http://52.135.41.224:2380"]} {"level":"info","ts":"2024-05-23T09:08:24.748+0800","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"91cdea3f99d711d3"} {"level":"info","ts":"2024-05-23T09:08:24.748+0800","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"f3f3a4077540c49f","remote-peer-id":"91cdea3f99d711d3"} {"level":"info","ts":"2024-05-23T09:08:24.749+0800","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"91cdea3f99d711d3"} {"level":"info","ts":"2024-05-23T09:08:24.749+0800","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"f3f3a4077540c49f","remote-peer-id":"91cdea3f99d711d3","remote-peer-urls":["http://52.135.41.223:2380"]} {"level":"info","ts":"2024-05-23T09:08:24.749+0800","caller":"etcdserver/server.go:842","msg":"starting etcd server","local-member-id":"f3f3a4077540c49f","local-server-version":"3.5.4","cluster-id":"f1f8153b914f5d3c","cluster-version":"3.5"} {"level":"info","ts":"2024-05-23T09:08:24.750+0800","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"f3f3a4077540c49f","remote-peer-id":"20129a814f48cdf2"} {"level":"info","ts":"2024-05-23T09:08:24.750+0800","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"f3f3a4077540c49f","remote-peer-id":"20129a814f48cdf2"} {"level":"info","ts":"2024-05-23T09:08:24.750+0800","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"f3f3a4077540c49f","remote-peer-id":"20129a814f48cdf2"} {"level":"info","ts":"2024-05-23T09:08:24.750+0800","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"f3f3a4077540c49f","remote-peer-id":"20129a814f48cdf2"} {"level":"info","ts":"2024-05-23T09:08:24.750+0800","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"f3f3a4077540c49f","remote-peer-id":"91cdea3f99d711d3"} {"level":"info","ts":"2024-05-23T09:08:24.750+0800","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"f3f3a4077540c49f","remote-peer-id":"91cdea3f99d711d3"} {"level":"info","ts":"2024-05-23T09:08:24.751+0800","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"f3f3a4077540c49f","remote-peer-id":"91cdea3f99d711d3"} {"level":"info","ts":"2024-05-23T09:08:24.751+0800","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"f3f3a4077540c49f","remote-peer-id":"91cdea3f99d711d3"} {"level":"info","ts":"2024-05-23T09:08:24.751+0800","caller":"etcdserver/server.go:752","msg":"starting initial election tick advance","election-ticks":10} {"level":"info","ts":"2024-05-23T09:08:24.752+0800","caller":"embed/etcd.go:688","msg":"starting with client TLS","tls-info":"cert = /opt/arbitration/arbitration-etcd/keystore/server.crt, key = /opt/arbitration/arbitration-etcd/keystore/server.tkey, client-cert=, client-key=, trusted-ca = , client-cert-auth = false, crl-file = ","cipher-suites":["TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"]} {"level":"warn","ts":"2024-05-23T09:08:24.752+0800","caller":"embed/etcd.go:701","msg":"Flag enable-v2 is deprecated and will get removed in etcd 3.6."} {"level":"info","ts":"2024-05-23T09:08:24.752+0800","caller":"embed/etcd.go:277","msg":"now serving peer/client/metrics","local-member-id":"f3f3a4077540c49f","initial-advertise-peer-urls":["http://52.135.41.221:2380"],"listen-peer-urls":["http://52.135.41.221:2380"],"advertise-client-urls":["http://127.0.0.1:4001","https://52.135.41.221:2379"],"listen-client-urls":["http://127.0.0.1:4001","https://52.135.41.221:2379"],"listen-metrics-urls":[]} {"level":"info","ts":"2024-05-23T09:08:24.752+0800","caller":"embed/etcd.go:581","msg":"serving peer traffic","address":"52.135.41.221:2380"} {"level":"info","ts":"2024-05-23T09:08:24.752+0800","caller":"embed/etcd.go:553","msg":"cmux::serve","address":"52.135.41.221:2380"} {"level":"info","ts":"2024-05-23T09:08:24.753+0800","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"f3f3a4077540c49f","to":"91cdea3f99d711d3","stream-type":"stream Message"} {"level":"info","ts":"2024-05-23T09:08:24.753+0800","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"91cdea3f99d711d3"} {"level":"info","ts":"2024-05-23T09:08:24.753+0800","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"f3f3a4077540c49f","remote-peer-id":"91cdea3f99d711d3"} {"level":"info","ts":"2024-05-23T09:08:24.753+0800","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"f3f3a4077540c49f","to":"91cdea3f99d711d3","stream-type":"stream MsgApp v2"} {"level":"info","ts":"2024-05-23T09:08:24.753+0800","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","local-member-id":"f3f3a4077540c49f","remote-peer-id":"91cdea3f99d711d3"} {"level":"info","ts":"2024-05-23T09:08:24.753+0800","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"f3f3a4077540c49f","to":"cee2c0be4d046b6","stream-type":"stream Message"} {"level":"info","ts":"2024-05-23T09:08:24.753+0800","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"f3f3a4077540c49f","remote-peer-id":"cee2c0be4d046b6"} {"level":"info","ts":"2024-05-23T09:08:24.753+0800","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"f3f3a4077540c49f","to":"cee2c0be4d046b6","stream-type":"stream MsgApp v2"} {"level":"info","ts":"2024-05-23T09:08:24.753+0800","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","local-member-id":"f3f3a4077540c49f","remote-peer-id":"cee2c0be4d046b6"} {"level":"info","ts":"2024-05-23T09:08:24.754+0800","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"f3f3a4077540c49f","to":"11c665851ae9cb62","stream-type":"stream MsgApp v2"} {"level":"info","ts":"2024-05-23T09:08:24.754+0800","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"11c665851ae9cb62"} {"level":"info","ts":"2024-05-23T09:08:24.754+0800","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","local-member-id":"f3f3a4077540c49f","remote-peer-id":"11c665851ae9cb62"} {"level":"info","ts":"2024-05-23T09:08:24.754+0800","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"f3f3a4077540c49f","to":"11c665851ae9cb62","stream-type":"stream Message"} {"level":"info","ts":"2024-05-23T09:08:24.754+0800","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"f3f3a4077540c49f","remote-peer-id":"11c665851ae9cb62"} {"level":"info","ts":"2024-05-23T09:08:24.754+0800","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"f3f3a4077540c49f","to":"20129a814f48cdf2","stream-type":"stream Message"} {"level":"info","ts":"2024-05-23T09:08:24.754+0800","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"20129a814f48cdf2"} {"level":"info","ts":"2024-05-23T09:08:24.754+0800","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"f3f3a4077540c49f","remote-peer-id":"20129a814f48cdf2"} {"level":"info","ts":"2024-05-23T09:08:24.754+0800","caller":"rafthttp/stream.go:249","msg":"set message encoder","from":"f3f3a4077540c49f","to":"20129a814f48cdf2","stream-type":"stream MsgApp v2"} {"level":"info","ts":"2024-05-23T09:08:24.754+0800","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","local-member-id":"f3f3a4077540c49f","remote-peer-id":"20129a814f48cdf2"} {"level":"info","ts":"2024-05-23T09:08:24.755+0800","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"f3f3a4077540c49f","remote-peer-id":"11c665851ae9cb62"} {"level":"info","ts":"2024-05-23T09:08:24.755+0800","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"f3f3a4077540c49f","remote-peer-id":"11c665851ae9cb62"} {"level":"info","ts":"2024-05-23T09:08:24.756+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: f3f3a4077540c49f elected leader 11c665851ae9cb62 at term 2"} {"level":"info","ts":"2024-05-23T09:08:24.770+0800","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"f3f3a4077540c49f","remote-peer-id":"20129a814f48cdf2"} {"level":"info","ts":"2024-05-23T09:08:24.770+0800","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"f3f3a4077540c49f","remote-peer-id":"91cdea3f99d711d3"} {"level":"info","ts":"2024-05-23T09:08:24.770+0800","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"f3f3a4077540c49f","remote-peer-id":"91cdea3f99d711d3"} {"level":"info","ts":"2024-05-23T09:08:24.770+0800","caller":"rafthttp/stream.go:412","msg":"established TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"f3f3a4077540c49f","remote-peer-id":"20129a814f48cdf2"} {"level":"info","ts":"2024-05-23T09:08:24.802+0800","caller":"etcdserver/server.go:775","msg":"initialized peer connections; fast-forwarding election ticks","local-member-id":"f3f3a4077540c49f","forward-ticks":8,"forward-duration":"8s","election-ticks":10,"election-timeout":"10s","active-remote-members":4} {"level":"info","ts":"2024-05-23T09:08:24.873+0800","caller":"etcdserver/server.go:2042","msg":"published local member to cluster through raft","local-member-id":"f3f3a4077540c49f","local-member-attributes":"{Name:arbitration-etcd0 ClientURLs:[http://127.0.0.1:4001 https://52.135.41.221:2379]}","request-path":"/0/members/f3f3a4077540c49f/attributes","cluster-id":"f1f8153b914f5d3c","publish-timeout":"25s"} {"level":"info","ts":"2024-05-23T09:08:24.873+0800","caller":"embed/serve.go:98","msg":"ready to serve client requests"} {"level":"info","ts":"2024-05-23T09:08:24.873+0800","caller":"embed/serve.go:98","msg":"ready to serve client requests"} {"level":"info","ts":"2024-05-23T09:08:24.874+0800","caller":"embed/serve.go:188","msg":"serving client traffic securely","address":"52.135.41.221:2379"} {"level":"info","ts":"2024-05-23T09:08:24.875+0800","caller":"embed/serve.go:140","msg":"serving client traffic insecurely; this is strongly discouraged!","address":"127.0.0.1:4001"} {"level":"info","ts":"2024-05-23T09:08:24.875+0800","caller":"etcdmain/main.go:44","msg":"notifying init daemon"} {"level":"info","ts":"2024-05-23T09:08:24.875+0800","caller":"etcdmain/main.go:50","msg":"successfully notified init daemon"} {"level":"warn","ts":"2024-05-23T09:08:59.743+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"cee2c0be4d046b6","rtt":"285.081µs","error":"dial tcp 52.135.41.222:2380: connect: connection refused"} {"level":"warn","ts":"2024-05-23T09:08:59.744+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"cee2c0be4d046b6","rtt":"233.675µs","error":"dial tcp 52.135.41.222:2380: connect: connection refused"} {"level":"warn","ts":"2024-05-23T09:09:04.744+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"cee2c0be4d046b6","rtt":"285.081µs","error":"dial tcp 52.135.41.222:2380: connect: connection refused"} {"level":"warn","ts":"2024-05-23T09:09:04.744+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"cee2c0be4d046b6","rtt":"233.675µs","error":"dial tcp 52.135.41.222:2380: connect: connection refused"} {"level":"warn","ts":"2024-05-23T09:09:09.745+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"cee2c0be4d046b6","rtt":"233.675µs","error":"dial tcp 52.135.41.222:2380: connect: connection refused"}

tjungblu commented 1 month ago

The tcp packet capture data file format is cap format and does not support uploading.

I think you can tar/zip everything and upload


seems the deactivation routine triggered correctly, didn't it?

"msg":"lost TCP streaming connection with remote peer"

edit: ok now I get it - the TCP connection is not closed when we detect an inactive peer, is that the bug you're referring to?

SingleThread commented 1 month ago

The tcp packet capture data file format is cap format and does not support uploading.

I think you can tar/zip everything and upload

seems the deactivation routine triggered correctly, didn't it?

"msg":"lost TCP streaming connection with remote peer"

edit: ok now I get it - the TCP connection is not closed when we detect an inactive peer, is that the bug you're referring to?

Yes, in some rapid situations, it is possible that the TCP connection is not disconnected, but the data packet cannot be sent (such as an instantaneous traffic switch on the switch, and there is no corresponding session synchronization after the switch). You only need to re-establish the connection, and the cluster members Inactive nodes will not be detected during this time. I would like to know how the etcd keepalived mechanism works in this case, and whether there is any configuration that can optimize this scenario.

Since there may be security monitoring on the local network and file uploading may trigger monitoring, I can only provide relevant information through stickers.

SingleThread commented 1 month ago

The tcp packet capture data file format is cap format and does not support uploading.

I think you can tar/zip everything and upload

seems the deactivation routine triggered correctly, didn't it?

"msg":"lost TCP streaming connection with remote peer"

edit: ok now I get it - the TCP connection is not closed when we detect an inactive peer, is that the bug you're referring to?

Through tcpdump packet capture, we found that there are many TCP Retransmission messages among members.

370 29.149013 52.135.41.221 52.135.41.222 TCP 78 [TCP Dup ACK 366#1] 38970 → 2380 [ACK] Seq=335 Ack=348 Win=64128 Len=0 TSval=2927475326 TSecr=4135473181 SLE=347 SRE=348 370 29.149013 52.135.41.221 52.135.41.222 TCP 78 [TCP Dup ACK 366#1] 38970 → 2380 [ACK] Seq=335 Ack=348 Win=64128 Len=0 TSval=2927475326 TSecr=4135473181 SLE=347 SRE=348 371 29.275624 52.135.41.221 52.135.41.222 TCP 72 2380 → 39042 [PSH, ACK] Seq=207 Ack=326 Win=64896 Len=6 TSval=2927475453 TSecr=4135471641 [TCP segment of a reassembled PDU] 372 29.275656 52.135.41.222 52.135.41.221 TCP 66 39042 → 2380 [ACK] Seq=326 Ack=213 Win=64128 Len=0 TSval=4135473307 TSecr=2927475453 373 29.277870 52.135.41.221 52.135.41.222 TCP 110 2380 → 39038 [PSH, ACK] Seq=587 Ack=327 Win=64896 Len=44 TSval=2927475455 TSecr=4135471642 [TCP segment of a reassembled PDU] 374 29.277896 52.135.41.222 52.135.41.221 TCP 66 39038 → 2380 [ACK] Seq=327 Ack=631 Win=64128 Len=0 TSval=4135473310 TSecr=2927475455 375 29.356709 52.135.41.222 52.135.41.221 TCP 66 [TCP Retransmission] 2380 → 38970 [FIN, ACK] Seq=347 Ack=223 Win=65152 Len=0 TSval=4135473389 TSecr=2927465123 376 29.357021 52.135.41.221 52.135.41.222 TCP 78 [TCP Dup ACK 366#2] 38970 → 2380 [ACK] Seq=335 Ack=348 Win=64128 Len=0 TSval=2927475534 TSecr=4135473389 SLE=347 SRE=348 377 29.511081 52.135.41.221 52.135.41.222 TCP 177 [TCP Retransmission] 38970 → 2380 [FIN, PSH, ACK] Seq=223 Ack=348 Win=64128 Len=111 TSval=2927475689 TSecr=4135473389 378 29.764708 52.135.41.222 52.135.41.221 TCP 66 [TCP Retransmission] 2380 → 38970 [FIN, ACK] Seq=347 Ack=223 Win=65152 Len=0 TSval=4135473797 TSecr=2927465123 379 29.765029 52.135.41.221 52.135.41.222 TCP 78 [TCP Dup ACK 366#3] 38970 → 2380 [ACK] Seq=335 Ack=348 Win=64128 Len=0 TSval=2927475942 TSecr=4135473797 SLE=347 SRE=348 380 29.959178 52.135.41.221 52.135.41.222 TCP 74 [TCP Retransmission] 36470 → 2380 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM TSval=2927476137 TSecr=0 WS=128 381 30.580686 52.135.41.222 52.135.41.221 TCP 66 [TCP Retransmission] 2380 → 38970 [FIN, ACK] Seq=347 Ack=223 Win=65152 Len=0 TSval=4135474613 TSecr=2927465123 382 30.580959 52.135.41.221 52.135.41.222 TCP 78 [TCP Dup ACK 366#4] 38970 → 2380 [ACK] Seq=335 Ack=348 Win=64128 Len=0 TSval=2927476758 TSecr=4135474613 SLE=347 SRE=348

tjungblu commented 1 month ago

Just trying to make it a bit easier to repro locally. Starting two etcd members:

$ bin/etcd --name=one --data-dir=one-datadir --initial-advertise-peer-urls="http://localhost:2700" --listen-peer-urls="http://localhost:2700" --enable-v2 --listen-client-urls="http://localhost:2379" --initial-cluster="one=http://localhost:2700,two=http://localhost:2701" --advertise-client-urls="http://localhost:2379,http://localhost:23790"

$ bin/etcd --name=two --data-dir=two-datadir --initial-advertise-peer-urls="http://localhost:2700,http://localhost:2701" --listen-peer-urls="http://localhost:2701" --enable-v2 --listen-client-urls="http://localhost:23790" --initial-cluster="one=http://localhost:2700,two=http://localhost:2701" --advertise-client-urls="http://localhost:2379,http://localhost:23790"

$ etcdctl member list
9fc778a76c8b088a, started, one, http://localhost:2700, http://localhost:2379,http://localhost:23790, false
e019c701f3740e30, started, two, http://localhost:2701, http://localhost:2379,http://localhost:23790, false

$ sudo iptables -A INPUT -p tcp --dport 2701 -j REJECT

{"level":"warn","ts":"2024-05-23T10:32:56.56916+0200","logger":"raft","caller":"etcdserver/zap_raft.go:85","msg":"9fc778a76c8b088a stepped down to follower since quorum is not active"} {"level":"warn","ts":"2024-05-23T10:33:16.63537+0200","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"e019c701f3740e30","rtt":"398.876µs","error":"dial tcp [::1]:2701: connect: connection refused"}

The prober(s) seem to detect it just fine. Quickly stepping through the code with a debugger doesn't trigger the actual connection closure however.

My expectation would be that the heartbeat fails, so it would eventually error out and close the underlying connection here: https://github.com/etcd-io/etcd/blob/release-3.5/server/etcdserver/api/rafthttp/stream.go#L178-L202

The encoder seems to happily continue to write on the response stream without any problems though.

SingleThread commented 1 month ago

The prober(s) seem to detect it just fine. Quickly stepping through the code with a debugger doesn't trigger the actual connection closure however.

It is true that connection problems can be detected quickly. That is because after you configure the firewall policy, new connection requests continue to be initiated, thus responding to failures. However, long connections have been established between the two cluster members before configuring the firewall policy. It will not be disconnected because of the firewall policy, so I don't fully understand the heartbeat mechanism. After adding the firewall policy, what is the timing for the long connection that has been established to be disconnected? For example, after how many times it has been detected, Or how long it took for no packets to be sent.