Closed allenporter closed 3 years ago
I looked closer at the code and see this isn't necessarily a lost TCP connection due to a flaky network, but due to failing to parse a message on the wire from here: https://github.com/etcd-io/etcd/blob/b7e5f5bc1213c5d6a5ad37d269f1e9ad9086f53a/etcdserver/api/rafthttp/msgappv2_codec.go#L245
io.ReadFull
returns a message but the first byte is not the right type.
When this happened again this morning, i got a different type
that was invalid.
Apr 09 02:11:57 ruby etcd[614]: lost the TCP streaming connection with peer a92007feedce6099 (stream MsgApp v2 reader)
Apr 09 02:11:57 ruby etcd[614]: failed to read a92007feedce6099 on stream MsgApp v2 (failed to parse type 210 in msgappv2 stream)
Apr 09 02:11:57 ruby etcd[614]: peer a92007feedce6099 became inactive (message send to peer failed)
Apr 09 02:11:57 ruby etcd[614]: unmarshal should never fail (proto: wrong wireType = 5 for field Dir)
This happened 2 or 3 more times in the last 12 days after removing the member, wiping the database, and re-adding the members. Each time it could be a different member that would fail with a similar result. For example, today's crash looks similar:
Apr 22 17:43:56 gilbert etcd[5930]: lost the TCP streaming connection with peer fdb2ed28f2a14137 (stream MsgApp v2 reader)
Apr 22 17:43:56 gilbert etcd[5930]: failed to read fdb2ed28f2a14137 on stream MsgApp v2 (failed to parse type 97 in msgappv2 stream)
Apr 22 17:43:56 gilbert etcd[5930]: peer fdb2ed28f2a14137 became inactive (message send to peer failed)
Apr 22 17:43:56 gilbert etcd[5930]: unmarshal should never fail (proto: wrong wireType = 6 for field Sorted)
Apr 22 17:43:56 gilbert etcd[5930]: panic: unmarshal should never fail (proto: wrong wireType = 6 for field Sorted)
Apr 22 17:43:56 gilbert etcd[5930]: goroutine 197 [running]:
Apr 22 17:43:56 gilbert etcd[5930]: github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc00010b660, 0x10bdffe, 0x20, 0xc001257058, 0x1, 0x1)
Apr 22 17:43:56 gilbert etcd[5930]: /root/go/pkg/mod/github.com/coreos/pkg@v0.0.0-20160727233714-3ac0863d7acf/capnslog/pkg_logger.go:75 +0x135
Apr 22 17:43:56 gilbert etcd[5930]: go.etcd.io/etcd/pkg/pbutil.MustUnmarshal(0x1277b60, 0xc0001be420, 0xc003466000, 0x2ac, 0x2c0)
Apr 22 17:43:56 gilbert etcd[5930]: /tmp/etcd-release-3.4.15/etcd/release/etcd/pkg/pbutil/pbutil.go:42 +0xbb
Apr 22 17:43:56 gilbert etcd[5930]: go.etcd.io/etcd/etcdserver.(*EtcdServer).applyEntryNormal(0xc000200600, 0xc0012574d0)
Apr 22 17:43:56 gilbert etcd[5930]: /tmp/etcd-release-3.4.15/etcd/release/etcd/etcdserver/server.go:2198 +0xaba
Apr 22 17:43:56 gilbert etcd[5930]: go.etcd.io/etcd/etcdserver.(*EtcdServer).apply(0xc000200600, 0xc007007680, 0x2, 0x2, 0xc0001665a0, 0x42f5bf, 0x10f5c20, 0xc00004c500)
Apr 22 17:43:56 gilbert etcd[5930]: /tmp/etcd-release-3.4.15/etcd/release/etcd/etcdserver/server.go:2138 +0x596
Apr 22 17:43:56 gilbert etcd[5930]: go.etcd.io/etcd/etcdserver.(*EtcdServer).applyEntries(0xc000200600, 0xc0001665a0, 0xc0099e1000)
Apr 22 17:43:56 gilbert etcd[5930]: /tmp/etcd-release-3.4.15/etcd/release/etcd/etcdserver/server.go:1390 +0xd4
Apr 22 17:43:56 gilbert etcd[5930]: go.etcd.io/etcd/etcdserver.(*EtcdServer).applyAll(0xc000200600, 0xc0001665a0, 0xc0099e1000)
Apr 22 17:43:56 gilbert etcd[5930]: /tmp/etcd-release-3.4.15/etcd/release/etcd/etcdserver/server.go:1114 +0x88
Apr 22 17:43:56 gilbert etcd[5930]: go.etcd.io/etcd/etcdserver.(*EtcdServer).run.func8(0x12979a0, 0xc00007a640)
Apr 22 17:43:56 gilbert etcd[5930]: /tmp/etcd-release-3.4.15/etcd/release/etcd/etcdserver/server.go:1059 +0x3c
Apr 22 17:43:56 gilbert etcd[5930]: go.etcd.io/etcd/pkg/schedule.(*fifo).run(0xc0001ab140)
Apr 22 17:43:56 gilbert etcd[5930]: /tmp/etcd-release-3.4.15/etcd/release/etcd/pkg/schedule/schedule.go:157 +0xdb
Apr 22 17:43:56 gilbert etcd[5930]: created by go.etcd.io/etcd/pkg/schedule.NewFIFOScheduler
Apr 22 17:43:56 gilbert etcd[5930]: /tmp/etcd-release-3.4.15/etcd/release/etcd/pkg/schedule/schedule.go:70 +0x13d
Apr 22 17:43:56 gilbert systemd[1]: etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
I'm considering learning go, figuring out debug statements to add to a custom build to aid in debugging, though very open to other suggestions.
If member crashes, does the problem persists after restarting member (i.e. the entry deserialization fails again during server restart) or it crashes and comes back to healthy ?
It's this boolean field: https://github.com/etcd-io/etcd/blob/bd4f8e2b6c6a0bdcd52f4593f68d9f2415ab5293/api/etcdserverpb/etcdserver.proto#L24 wire-types in protos should be 0-5: https://developers.google.com/protocol-buffers/docs/encoding so '6' is very surprising.
But maybe the real problem starts earlier, during the RAFT:
Apr 22 17:43:56 gilbert etcd[5930]: lost the TCP streaming connection with peer fdb2ed28f2a14137 (stream MsgApp v2 reader)
Apr 22 17:43:56 gilbert etcd[5930]: failed to read fdb2ed28f2a14137 on stream MsgApp v2 (failed to parse type 97 in msgappv2 stream)
https://github.com/etcd-io/etcd/blob/bd4f8e2b6c6a0bdcd52f4593f68d9f2415ab5293/server/etcdserver/api/rafthttp/msgappv2_codec.go#L30-L32 shows there are only 3 supported messages and not 97.
I suspect something is intercepting the over-network communication between peers. Maybe worth to try running the setup with mtls for peer<->peer comunication to guarantee integrity.
For debugging io.Reader: https://github.com/etcd-io/etcd/blob/bd4f8e2b6c6a0bdcd52f4593f68d9f2415ab5293/server/etcdserver/api/rafthttp/msgappv2_codec.go#L163 wrapped with TeeReader (https://golang.org/pkg/io/#TeeReader) might be a way to go.
The problem persists if I restart the member.
Good point on integrity end to end, I didn't consider that and only was thinking about a bug in sending/receiving in etcd itself. These hosts are within a VM so very possible, so I can explore that.
Maybe running tcpdump, I could capture the corruption happening as well.
I am only not using tls because it's a little more complex to automate setup with ansible and I had not considered the value from an integrity perspective.
Thank you for the debugging suggestions I'll give this a try and report back.
I got a packet capture of this happening, and was able to confirm your theory about network corruption. I dug into the response and i see during the middle of the AppEntries the response side see:
0180 02 76 31 7a 00 8a 01 b3 02 0a 07 6b 75 62 65 6c .v1z.......kubel
0190 65 74 12 06 55 70 64 61 74 65 1a 16 63 6f 6f 72 et..Update..coor
01a0 64 69 6e 61 74 69 6f 6e 2e 6b 38 73 2e 69 6f 2f dination.k8s.io/
01b0 76 31 22 08 08 89 a4 ed 81 06 10 00 32 08 46 69 v1".........2.Fi
01c0 65 6c 64 73 56 31 3a f3 01 0a f0 01 7b 22 66 3a eldsV1:.....{"f:
01d0 6d 65 74 61 64 61 74 61 22 3a 7b 22 66 3a 6f 77 metadata":{"f:ow
01e0 6e 65 72 52 65 66 65 72 65 6e 63 65 73 22 3a 7b nerReferences":{
01f0 22 2e 22 3a 7b 7d 2c 22 6b 3a 7b 5c 22 75 69 64 ".":{},"k:{\"uid
0200 5c 22 3a 5c 22 31 33 37 66 32 62 31 33 2d 32 63 \":\"137f2b13-2c
0210 36 31 2d 34 34 65 39 2d 62 33 38 37 2d 64 33 62 61-44e9-b387-d3b
0220 61 64 32 31 32 64 65 65 30 5c 22 7d 22 3a 7b 22 ad212dee0\"}":{"
0230 2e 22 3a 7b 7d 2c 22 66 3a 61 70 69 56 65 72 73 .":{},"f:apiVers
0240 69 6f 6e 22 3a 7b 7d 2c 22 66 3a 6b 69 6e 64 22 ion":{},"f:kind"
0250 3a 7b 7d 2c 22 66 3a 6e 61 6d 65 22 3a 7b 7d 2c :{},"f:name":{},
0260 22 66 3a 75 69 64 22 3a 7b 7d 7d 7d 7d 2c 22 66 "f:uid":{}}}},"f
0270 3a 73 70 65 63 22 3a 7b 22 66 3a 68 6f 6c 64 65 :spec":{"f:holde
0280 72 49 64 65 6e 74 69 74 79 22 3a 7b 7d 2c 22 66 rIdentity":{},"f
0290 3a 6c 65 61 73 65 44 75 72 61 74 69 6f 6e 53 65 :leaseDurationSe
02a0 63 6f 6e 64 73 22 3a 7b 7d 2c 22 66 3a 72 65 6e conds":{},"f:ren
02b0 65 77 54 69 6d 65 22 3a 7b 7d 7d 7d 12 16 0a 05 ewTime":{}}}....
02c0 6f 72 62 69 74 10 28 22 0b 08 ba ec b1 84 06 10 orbit.("........
02d0 9e 8e db 1d 1a 00 22 00 1a 2a 0a 28 0a 26 2f 72 ......"..*.(.&/r
02e0 65 67 69 73 74 72 79 2f 6c 65 61 73 65 73 2f 6b egistry/leases/k
02f0 75 62 65 2d 6e 6f 64 65 2d 6c 65 61 73 65 2f 6f ube-node-lease/o
0300 72 62 69 74 a2 06 0a 08 c1 b1 d0 b0 cd 9f de b0 rbit............
0310 3c 00 00 00 00 01 6a 6b 96 01 00 00 00 00 00 00 <.....jk........
0320 00 01 00 00 00 00 00 00 02 b9 08 00 10 cf 33 18 ..............3.
0330 9b d7 a9 0b 22 ac 05 32 9c 05 0a 2f 10 02 1a 26 ...."..2.../...&
0340 2f 72 65 67 69 73 74 72 79 2f 6c 65 61 73 65 73 /registry/leases
And the receiving side sees:
0180 02 76 31 7a 00 8a 01 b3 02 0a 07 6b 75 62 65 6c .v1z.......kubel
0190 65 74 12 06 55 70 64 61 74 65 1a 16 63 6f 6f 72 et..Update..coor
01a0 64 69 6e 61 74 69 6f 6e 2e 6b 38 73 2e 69 6f 2f dination.k8s.io/
01b0 76 31 22 08 08 89 a4 ed 81 06 10 00 32 08 46 69 v1".........2.Fi
01c0 65 6c 64 73 56 31 3a f3 01 0a f0 01 7b 22 66 3a eldsV1:.....{"f:
01d0 6d 65 74 61 64 61 74 61 22 3a 7b 22 66 3a 6f 77 metadata":{"f:ow
01e0 6e 65 72 52 65 66 65 72 65 6e 63 65 73 22 3a 7b nerReferences":{
01f0 22 2e 22 3a 7b 7d 2c 22 6b 3a 7b 5c 22 75 69 64 ".":{},"k:{\"uid
0200 5c 22 3a 5c 22 31 33 37 66 32 62 31 33 2d 32 63 \":\"137f2b13-2c
0210 36 31 2d 34 34 65 39 2d 62 33 38 37 2d 64 33 62 61-44e9-b387-d3b
0220 61 64 32 31 32 64 65 65 30 5c 22 7d 22 3a 7b 22 ad212dee0\"}":{"
0230 2e 22 3a 7b 7d 2c 22 66 3a 61 70 69 56 65 72 73 .":{},"f:apiVers
0240 65 6c 64 73 56 31 3a f3 01 0a f0 01 7b 22 66 3a eldsV1:.....{"f:
0250 6d 65 74 61 64 61 74 61 22 3a 7b 22 66 3a 6f 77 metadata":{"f:ow
0260 6e 65 72 52 65 66 65 72 65 6e 63 65 73 22 3a 7b nerReferences":{
0270 22 2e 22 3a 7b 7d 2c 22 6b 3a 7b 5c 22 75 69 64 ".":{},"k:{\"uid
0280 5c 22 3a 5c 22 31 33 37 66 32 62 31 33 2d 32 63 \":\"137f2b13-2c
0290 36 31 2d 34 34 65 39 2d 62 33 38 37 2d 64 33 62 61-44e9-b387-d3b
02a0 61 64 32 31 32 64 65 65 30 5c 22 7d 22 3a 7b 22 ad212dee0\"}":{"
02b0 2e 22 3a 7b 7d 2c 22 66 3a 61 70 69 56 65 72 73 .":{},"f:apiVers
02c0 69 6f 6e 22 3a 7b 7d 2c 22 66 3a 6b 69 6e 64 22 ion":{},"f:kind"
02d0 3a 7b 7d 2c 22 66 3a 6e 61 6d 65 22 3a 7b 7d 2c :{},"f:name":{},
02e0 22 66 3a 75 69 64 22 3a 7b 7d 7d 7d 7d 2c 22 66 "f:uid":{}}}},"f
02f0 3a 73 70 65 63 22 3a 7b 22 66 3a 68 6f 6c 64 65 :spec":{"f:holde
0300 72 49 64 65 6e 74 69 74 79 22 3a 7b 7d 2c 22 66 rIdentity":{},"f
0310 3a 6c 65 61 73 65 44 75 72 61 74 69 6f 6e 53 65 :leaseDurationSe
0320 63 6f 6e 64 73 22 3a 7b 7d 2c 22 66 3a 72 65 6e conds":{},"f:ren
0330 65 77 54 69 6d 65 22 3a 7b 7d 7d 7d 12 16 0a 05 ewTime":{}}}....
0340 6f 72 62 69 74 10 28 22 0b 08 ba ec b1 84 06 10 orbit.("........
The chunk at 0240 was repeating the packet at 0200.
Closing, since I don't expect etcd to handle this gracefully. Thanks for your assistance.
Have a 3 node etcd setup and see a panic every 2-5 days, which moves between nodes. I typically rebuild the node, then it will happen again after a few days. I am able to reproduce this in my environment, but not sure that I have been able to reproduce it with a fresh set up -- very likely a misconfiguration or problem on my end and i would love any help or suggestions for further diagnoses.
It seems like after there is a read failure (more on that below) a partially written proto causes a crash. Note the proto field that is invalid may be different each time this happens.
Version information is the same for all nodes:
Here is an example from the host that crashed:
This is what the cluster looks like, with the status after the crash:
Each node is an ubuntu VM, running on a separate physical host. Below are the config files with environment variables, and the systemd config.
cfg01
cfg02
cfg03
The log files for each of these hosts are attached. You may also see ETCD_ELECTION_TIMEOUT and ETCD_HEARTBEAT_INTERVAL were added as an attempt to see if that happened to keep the connections alive longer, which didn't change anything.
I also notice that the cluster has many warnings probably due to I/O delay, probably a catalyst for this:
Each time this happens, the crash looks very similar with a
lost the TCP streaming connection with peer
thenfailed to read XXX on stream
thenpanic: unmarshal should never fail
. The VM disks run on SSD, though don't have tons of free disk I/O which may be a bit constrained -- there are also warnings about slow reads, which may be the culprit for why requests timeout, though I didn't see an obvious correlation at the time of the crash. My assumption here is that my network/hosts are flaky and/or slow, which should be addressed, but also that this should be resilient to this type of failure unless it is a misconfiguration.I would love any help on further ways to understand the internals of what is happening here. (e.g. happy to run commands on the db to inspect the database, and also not an expert at diagnosing disk I/O problems) or if it makes sense I can run a different custom build and add additional logging, but haven't tried that yet.
Full logs attached. cfg01.prod.log.txt cfg02.prod.log.txt cfg03.prod.log.txt