Closed serathius closed 1 year ago
Looks like an issue with raft
cc @tbg @pavelkalinnikov
Could be an issue with raft, though I would instead expect it to be some issue with how etcd implements with raft. Hard to know without a further investigation. Let me talk it through with Pavel
The issue is pretty hard to reproduce, only once or twice per hundred runs. Please let me know what data you need to confirm where the issue is.
I've investigated this a bit. I can see that this is running on raft @ eaa6808e1f7a.
The panic occurs here, e.g. one I'm looking at reports:
panic: index, 7, is out of range [1]
From the stack trace we're seeing that maybeAppend
is called with (index=601, logTerm=2, committed=602, ents=<slice with 1 entry>)
. Normally, the ents
slice should contain entries with contiguous Index
es starting from index+1
, in this case we must see ents[0].Index == 602
. However, I believe ents[0].Index == 609
here, because the findConflict
apparently returned the index of the only entry it has scanned.
We're also seeing that there are two panics happening, at around the same time, with almost matching indexes. Presumably the cluster has a leader and 2 followers. The leader sends a weird MsgApp
message to both followers and kills them.
The question is how we get a MsgApp
message in which Message.Entries[0].Index != Message.Index + 1
. Can it happen that the raftLog.entries function that fetches them returns misplaced entries?
I think we can add some ad-hoc logging or assertions (both on the sending and receiving end, to localize where it occurs) to confirm that the Message.Entries[0].Index == Message.Index + 1
invariant is broken.
@serathius I've created a branch in my fork on top of the commit you're running against: https://github.com/pavelkalinnikov/raft/commits/msg-app-invariant. If you could run the test with this commit included (or cherry-picked, whatever works for you), this might help localizing where the issue occurs.
@serathius I've created a branch in my fork on top of the commit you're running against: https://github.com/pavelkalinnikov/raft/commits/msg-app-invariant. If you could run the test with this commit included (or cherry-picked, whatever works for you), this might help localizing where the issue occurs.
Awesome! On it.
Got it, there are two stacktraces writen at once from two separate members, you can distinguish them by name (TestRobustnessClusterOfSize3LowTraffic-test-2, TestRobustnessClusterOfSize3LowTraffic-test-0) or pid (714482, 714483)
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-1) (714484): {"level":"warn","ts":"2023-03-31T16:42:20.590682+0200","caller":"etcdserver/v3_server.go:831","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6108155669721489885,"retry-timeout":"500ms"}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-1) (714484): {"level":"info","ts":"2023-03-31T16:42:20.59113+0200","caller":"traceutil/trace.go:171","msg":"trace[530800350] linearizableReadLoop","detail":"{readStateIndex:281; appliedIndex:284; }","duration":"1.002163677s","start":"2023-03-31T16:42:19.588942+0200","end":"2023-03-31T16:42:20.591106+0200","steps":["trace[530800350] 'read index received' (duration: 1.002157415s)","trace[530800350] 'applied index is now lower than readState.Index' (duration: 5.2µs)"],"step_count":2}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): {"level":"info","ts":"2023-03-31T16:42:20.59132+0200","caller":"traceutil/trace.go:171","msg":"trace[1978863180] linearizableReadLoop","detail":"{readStateIndex:281; appliedIndex:282; }","duration":"1.020670431s","start":"2023-03-31T16:42:19.570632+0200","end":"2023-03-31T16:42:20.591302+0200","steps":["trace[1978863180] 'read index received' (duration: 1.020666434s)","trace[1978863180] 'applied index is now lower than readState.Index' (duration: 2.895µs)"],"step_count":2}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): {"level":"info","ts":"2023-03-31T16:42:20.591933+0200","caller":"traceutil/trace.go:171","msg":"trace[858852315] linearizableReadLoop","detail":"{readStateIndex:284; appliedIndex:283; }","duration":"1.022533199s","start":"2023-03-31T16:42:19.569387+0200","end":"2023-03-31T16:42:20.59192+0200","steps":["trace[858852315] 'read index received' (duration: 1.022344014s)","trace[858852315] 'applied index is now lower than readState.Index' (duration: 188.474µs)"],"step_count":2}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): {"level":"panic","ts":"2023-03-31T16:42:20.592092+0200","logger":"raft","caller":"v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90","msg":"entries invariant broken: entry[0].Index is 290, want 285","stacktrace":"go.etcd.io/raft/v3.(*raftLog).maybeAppend\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90\ngo.etcd.io/raft/v3.(*raft).handleAppendEntries\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1521\ngo.etcd.io/raft/v3.stepFollower\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1476\ngo.etcd.io/raft/v3.(*raft).Step\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1025\ngo.etcd.io/raft/v3.(*node).run\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:357"}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): {"level":"panic","ts":"2023-03-31T16:42:20.592137+0200","logger":"raft","caller":"v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90","msg":"entries invariant broken: entry[0].Index is 290, want 285","stacktrace":"go.etcd.io/raft/v3.(*raftLog).maybeAppend\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90\ngo.etcd.io/raft/v3.(*raft).handleAppendEntries\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1521\ngo.etcd.io/raft/v3.stepFollower\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1476\ngo.etcd.io/raft/v3.(*raft).Step\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1025\ngo.etcd.io/raft/v3.(*node).run\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:357"}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): panic: entries invariant broken: entry[0].Index is 290, want 285
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482):
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): goroutine 193 [running]:
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): panic: entries invariant broken: entry[0].Index is 290, want 285
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483):
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): goroutine 47 [running]:
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x0?, 0x36?, {0x0?, 0x0?, 0xc00017ca00?})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x65
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x0?, 0x36?, {0x0?, 0x0?, 0xc00033a6e0?})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x65
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0000c1a00, {0x0, 0x0, 0x0})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x3ec
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0004ec1a0, {0x0, 0x0, 0x0})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x3ec
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap.(*SugaredLogger).log(0xc00024c080, 0x4, {0x108b574?, 0x40dc27?}, {0xc0001f86a0?, 0xece6c0?, 0xc000295401?}, {0x0, 0x0, 0x0})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap@v1.24.0/sugar.go:295 +0xee
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap.(*SugaredLogger).Panicf(...)
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap@v1.24.0/sugar.go:189
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap.(*SugaredLogger).log(0xc00011c318, 0x4, {0x108b574?, 0x40dc27?}, {0xc001a94db0?, 0xece6c0?, 0xc0002de501?}, {0x0, 0x0, 0x0})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap@v1.24.0/sugar.go:295 +0xee
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap.(*SugaredLogger).Panicf(...)
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap@v1.24.0/sugar.go:189
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0x10368e0?, {0x108b574?, 0x2?}, {0xc0001f86a0?, 0x0?, 0x7f2caa95c878?})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0x0?, {0x108b574?, 0x2?}, {0xc001a94db0?, 0x0?, 0x1d45d40?})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/etcd/server/v3/etcdserver/zap_raft.go:102 +0x49
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/etcd/server/v3/etcdserver/zap_raft.go:102 +0x49
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3.(*raftLog).maybeAppend(0xc0004d45b0, 0x11c, 0x2, 0x11c, {0xc001ca5c20?, 0x1, 0x1})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3.(*raftLog).maybeAppend(0xc0004e3f10, 0x11c, 0x2, 0x11c, {0xc001a30f30?, 0x1, 0x1})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90 +0xee
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90 +0xee
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3.(*raft).handleAppendEntries(0xc000030b00, {0x3, 0xf2d60a7c86d2c4e1, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001ca5c20, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3.(*raft).handleAppendEntries(0xc0002089a0, {0x3, 0x9b0ed86c9976a482, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001a30f30, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1521 +0x77
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1521 +0x77
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3.stepFollower(0xc000030b00, {0x3, 0xf2d60a7c86d2c4e1, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001ca5c20, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1476 +0x2f8
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3.stepFollower(0xc0002089a0, {0x3, 0x9b0ed86c9976a482, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001a30f30, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1476 +0x2f8
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3.(*raft).Step(0xc0002089a0, {0x3, 0x9b0ed86c9976a482, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001a30f30, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3.(*raft).Step(0xc000030b00, {0x3, 0xf2d60a7c86d2c4e1, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001ca5c20, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1025 +0x1375
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3.(*node).run(0xc0006942a0)
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1025 +0x1375
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3.(*node).run(0xc0006ae840)
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:357 +0x98a
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:357 +0x98a
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): created by go.etcd.io/raft/v3.StartNode
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): created by go.etcd.io/raft/v3.StartNode
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:231 +0x65
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:231 +0x65
The strange part is that this fires at the latest stage: on the follower upon receiving MsgApp
. I added the assertions to the code that creates the MsgApp
message, it is not triggered. So some corruption might be happening while the message is in transit?
@serathius What does the robustness "blackhole" mode do? Does it fiddle with the messages content, or drop them, or something else?
We run proxy before each members peer port and blackhole the traffic. https://github.com/etcd-io/etcd/blob/6a995d2d63091418b3fdfcbbeb56d3c705d6dd3d/pkg/proxy/server.go#L103-L115 It should just drop data.
Looking at the implementation here, I see that the modifiers are applied to a byte stream. Is it possible that, as a result of turning this thing on/off, some message is assembled from non-contiguous parts of the byte stream? Say, could this MsgApp
message actually be the result of 2 messages truncated and concatenated in a weird way? Protobuf is pretty flexible w.r.t. duplicate tags etc, so in theory a "valid" message could be created this way.
@serathius Also, is it possible that this stream drops only a part of the message (which represents a prefix of Entries
), so that the message gets assembled from partial bytes?
The problem with such an application-level "dropping" is that it doesn't look like a stream corruption from the TCP point of view. The stream is tampered with, but to the TCP receiver this is still a valid stream, so it just feeds a modified stream to the application layer. All sorts of application-level corruption can happen this way.
The real TCP-level packet loss would have been spotted by the receiver, and it wouldn't be able to proceed from the middle of the stream when it recovers.
UPD: It might be HTTP that gRPC relies on, not necessarily directly TCP (not sure what the actual stack is). The proxy that you linked tampers with the stream above HTTP, right? And this gets retransmitted over HTTP, and fed into gRPC at the receiver end.
I think you might be right, I have tested peer communication with TLS and I'm unable to reproduce. Need to investigatehow the proxy manages to execute man in the middle attack (I don't provide certs to it).
As for proxy layer, it looks like L4 proxy (TCP) and not L7 (http, grpc). However one thing I noticed that packates buffered and code that drops traffic runs on bytes and not whole packets, which means they could be split in the middle. That would be the incorrect behavior you mentioned that results in non-natural network failure.
Hmm, buffer used in proxy has 48KiB and not 64KiB as is the maximum TCP packet size. Maybe that's cause of packets being cut. I'm looking for alternative way to implement proxy to compare. If time permits, I would like to confirm that this bug is really caused by packet being cut in half.
cc @aojea
Closing based on fact that TLS fixes the issue. Replacing pkg/proxy
with proper network failure simulator should be tracked in separate issue.
What happened?
Robustness Nightly are failing due to etcd panic. https://github.com/etcd-io/etcd/actions/runs/4562802681
What did you expect to happen?
Etcd should not panic due to network issues
How can we reproduce it (as minimally and precisely as possible)?
Run robustness tests blackhole tests enough times. See https://github.com/etcd-io/etcd/actions/workflows/robustness-nightly.yaml
Anything else we need to know?
No response
Etcd version (please run commands below)
N/A
Etcd configuration (command line flags or environment variables)
N/A
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
N/A
Relevant log output