etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.85k stars 9.77k forks source link

Etcd panic in robustness tests blackhole failpoint #15595

Closed serathius closed 1 year ago

serathius commented 1 year ago

What happened?

Robustness Nightly are failing due to etcd panic. https://github.com/etcd-io/etcd/actions/runs/4562802681

What did you expect to happen?

Etcd should not panic due to network issues

How can we reproduce it (as minimally and precisely as possible)?

Run robustness tests blackhole tests enough times. See https://github.com/etcd-io/etcd/actions/workflows/robustness-nightly.yaml

Anything else we need to know?

No response

Etcd version (please run commands below)

N/A

Etcd configuration (command line flags or environment variables)

N/A

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

N/A

Relevant log output

2023-03-30T10:25:27.3144193Z     logger.go:130: 2023-03-30T10:25:26.312Z    INFO    Triggering failpoint
2023-03-30T10:25:27.3144599Z            {"failpoint": "blackhole"}
2023-03-30T10:25:27.3145371Z     failpoints.go:328: Blackholing traffic from and to member "TestRobustnessClusterOfSize3HighTraffic-test-1"
2023-03-30T10:25:27.3146769Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): {"level":"warn","ts":"2023-03-30T10:25:26.811397Z","caller":"etcdserver/v3_server.go:831","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":11854185818228622438,"retry-timeout":"500ms"}
2023-03-30T10:25:27.3148279Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): {"level":"warn","ts":"2023-03-30T10:25:26.812944Z","caller":"etcdserver/v3_server.go:831","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":14186768950229826371,"retry-timeout":"500ms"}
2023-03-30T10:25:27.3149756Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-1) (28035): {"level":"warn","ts":"2023-03-30T10:25:26.813508Z","caller":"etcdserver/v3_server.go:831","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6108155643657295548,"retry-timeout":"500ms"}
2023-03-30T10:25:27.3265010Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): {"level":"warn","ts":"2023-03-30T10:25:27.312399Z","caller":"etcdserver/v3_server.go:831","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":11854185818228622438,"retry-timeout":"500ms"}
2023-03-30T10:25:27.3266115Z     failpoints.go:332: Traffic restored from and to member "TestRobustnessClusterOfSize3HighTraffic-test-1"
2023-03-30T10:25:27.3267790Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): {"level":"info","ts":"2023-03-30T10:25:27.314167Z","caller":"traceutil/trace.go:171","msg":"trace[1342822156] linearizableReadLoop","detail":"{readStateIndex:601; appliedIndex:602; }","duration":"1.003816961s","start":"2023-03-30T10:25:26.310341Z","end":"2023-03-30T10:25:27.314158Z","steps":["trace[1342822156] 'read index received'  (duration: 1.003814761s)","trace[1342822156] 'applied index is now lower than readState.Index'  (duration: 1.5µs)"],"step_count":2}
2023-03-30T10:25:27.3269807Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-1) (28035): {"level":"info","ts":"2023-03-30T10:25:27.313778Z","caller":"traceutil/trace.go:171","msg":"trace[1664249991] linearizableReadLoop","detail":"{readStateIndex:601; appliedIndex:602; }","duration":"1.000933919s","start":"2023-03-30T10:25:26.31283Z","end":"2023-03-30T10:25:27.313764Z","steps":["trace[1664249991] 'read index received'  (duration: 1.000931819s)","trace[1664249991] 'applied index is now lower than readState.Index'  (duration: 1.5µs)"],"step_count":2}
2023-03-30T10:25:27.3271798Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): {"level":"info","ts":"2023-03-30T10:25:27.315185Z","caller":"traceutil/trace.go:171","msg":"trace[2074038161] linearizableReadLoop","detail":"{readStateIndex:601; appliedIndex:601; }","duration":"1.002760046s","start":"2023-03-30T10:25:26.312412Z","end":"2023-03-30T10:25:27.315172Z","steps":["trace[2074038161] 'read index received'  (duration: 1.002757745s)","trace[2074038161] 'applied index is now lower than readState.Index'  (duration: 1.601µs)"],"step_count":2}
2023-03-30T10:25:27.3274893Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): {"level":"panic","ts":"2023-03-30T10:25:27.31731Z","logger":"raft","caller":"v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:99","msg":"index, 7, is out of range [1]","stacktrace":"go.etcd.io/raft/v3.(*raftLog).maybeAppend\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:99\ngo.etcd.io/raft/v3.(*raft).handleAppendEntries\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1517\ngo.etcd.io/raft/v3.stepFollower\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1472\ngo.etcd.io/raft/v3.(*raft).Step\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1021\ngo.etcd.io/raft/v3.(*node).run\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:357"}
2023-03-30T10:25:27.3277939Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): {"level":"panic","ts":"2023-03-30T10:25:27.317178Z","logger":"raft","caller":"v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:99","msg":"index, 6, is out of range [1]","stacktrace":"go.etcd.io/raft/v3.(*raftLog).maybeAppend\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:99\ngo.etcd.io/raft/v3.(*raft).handleAppendEntries\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1517\ngo.etcd.io/raft/v3.stepFollower\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1472\ngo.etcd.io/raft/v3.(*raft).Step\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1021\ngo.etcd.io/raft/v3.(*node).run\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:357"}
2023-03-30T10:25:27.3279523Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): panic: index, 7, is out of range [1]
2023-03-30T10:25:27.3280186Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): 
2023-03-30T10:25:27.3280872Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): goroutine 138 [running]:
2023-03-30T10:25:27.3281737Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x0?, 0x0?, {0x0?, 0x0?, 0xc0010faee0?})
2023-03-30T10:25:27.3282611Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026):     go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x65
2023-03-30T10:25:27.3283467Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0004840d0, {0x0, 0x0, 0x0})
2023-03-30T10:25:27.3284295Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026):     go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x3ec
2023-03-30T10:25:27.3285253Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): go.uber.org/zap.(*SugaredLogger).log(0xc0000123a0, 0x4, {0x108f80c?, 0x40dc27?}, {0xc0010fafc0?, 0xeced00?, 0x1?}, {0x0, 0x0, 0x0})
2023-03-30T10:25:27.3286094Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026):     go.uber.org/zap@v1.24.0/sugar.go:295 +0xee
2023-03-30T10:25:27.3286869Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): go.uber.org/zap.(*SugaredLogger).Panicf(...)
2023-03-30T10:25:27.3287622Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026):     go.uber.org/zap@v1.24.0/sugar.go:189
2023-03-30T10:25:27.3288638Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0xc000460230?, {0x108f80c?, 0x1?}, {0xc0010fafc0?, 0x0?, 0x203000?})
2023-03-30T10:25:27.3289528Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026):     go.etcd.io/etcd/server/v3/etcdserver/zap_raft.go:102 +0x49
2023-03-30T10:25:27.3290438Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): go.etcd.io/raft/v3.(*raftLog).maybeAppend(0xc000460230, 0x259, 0x2, 0x25a, {0xc001ae8c60?, 0x1, 0x1})
2023-03-30T10:25:27.3291677Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026):     go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:99 +0x190
2023-03-30T10:25:27.3292770Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): go.etcd.io/raft/v3.(*raft).handleAppendEntries(0xc0004c9b80, {0x3, 0xf2d60a7c86d2c4e1, 0xb078ae59d09754c4, 0x2, 0x2, 0x259, {0xc001ae8c60, 0x1, 0x1}, ...})
2023-03-30T10:25:27.3293723Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026):     go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1517 +0x77
2023-03-30T10:25:27.3294782Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): go.etcd.io/raft/v3.stepFollower(0xc0004c9b80, {0x3, 0xf2d60a7c86d2c4e1, 0xb078ae59d09754c4, 0x2, 0x2, 0x259, {0xc001ae8c60, 0x1, 0x1}, ...})
2023-03-30T10:25:27.3295693Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026):     go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1472 +0x2f8
2023-03-30T10:25:27.3296667Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): go.etcd.io/raft/v3.(*raft).Step(0xc0004c9b80, {0x3, 0xf2d60a7c86d2c4e1, 0xb078ae59d09754c4, 0x2, 0x2, 0x259, {0xc001ae8c60, 0x1, 0x1}, ...})
2023-03-30T10:25:27.3297577Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026):     go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1021 +0x1375
2023-03-30T10:25:27.3298364Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): go.etcd.io/raft/v3.(*node).run(0xc000492de0)
2023-03-30T10:25:27.3299191Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026):     go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:357 +0x98a
2023-03-30T10:25:27.3299952Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026): created by go.etcd.io/raft/v3.StartNode
2023-03-30T10:25:27.3300787Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-0) (28026):     go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:231 +0x65
2023-03-30T10:25:27.3301657Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): panic: index, 6, is out of range [1]
2023-03-30T10:25:27.3302324Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): 
2023-03-30T10:25:27.3303003Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): goroutine 152 [running]:
2023-03-30T10:25:27.3303876Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x0?, 0x203000?, {0x0?, 0x0?, 0xc00007c080?})
2023-03-30T10:25:27.3304736Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025):     go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x65
2023-03-30T10:25:27.3305589Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00048d380, {0x0, 0x0, 0x0})
2023-03-30T10:25:27.3306489Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025):     go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x3ec
2023-03-30T10:25:27.3307446Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): go.uber.org/zap.(*SugaredLogger).log(0xc000486508, 0x4, {0x108f80c?, 0x40dc27?}, {0xc00198f280?, 0xeced00?, 0x1?}, {0x0, 0x0, 0x0})
2023-03-30T10:25:27.3308268Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025):     go.uber.org/zap@v1.24.0/sugar.go:295 +0xee
2023-03-30T10:25:27.3309043Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): go.uber.org/zap.(*SugaredLogger).Panicf(...)
2023-03-30T10:25:27.3309869Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025):     go.uber.org/zap@v1.24.0/sugar.go:189
2023-03-30T10:25:27.3310857Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0xc00017e4d0?, {0x108f80c?, 0x1?}, {0xc00198f280?, 0x0?, 0x203000?})
2023-03-30T10:25:27.3311754Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025):     go.etcd.io/etcd/server/v3/etcdserver/zap_raft.go:102 +0x49
2023-03-30T10:25:27.3312735Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): go.etcd.io/raft/v3.(*raftLog).maybeAppend(0xc00017e4d0, 0x25a, 0x2, 0x25a, {0xc0019a90e0?, 0x1, 0x1})
2023-03-30T10:25:27.3313619Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025):     go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:99 +0x190
2023-03-30T10:25:27.3314643Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): go.etcd.io/raft/v3.(*raft).handleAppendEntries(0xc00056f1e0, {0x3, 0x9b0ed86c9976a482, 0xb078ae59d09754c4, 0x2, 0x2, 0x25a, {0xc0019a90e0, 0x1, 0x1}, ...})
2023-03-30T10:25:27.3315569Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025):     go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1517 +0x77
2023-03-30T10:25:27.3316553Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): go.etcd.io/raft/v3.stepFollower(0xc00056f1e0, {0x3, 0x9b0ed86c9976a482, 0xb078ae59d09754c4, 0x2, 0x2, 0x25a, {0xc0019a90e0, 0x1, 0x1}, ...})
2023-03-30T10:25:27.3317464Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025):     go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1472 +0x2f8
2023-03-30T10:25:27.3318537Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): go.etcd.io/raft/v3.(*raft).Step(0xc00056f1e0, {0x3, 0x9b0ed86c9976a482, 0xb078ae59d09754c4, 0x2, 0x2, 0x25a, {0xc0019a90e0, 0x1, 0x1}, ...})
2023-03-30T10:25:27.3319435Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025):     go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1021 +0x1375
2023-03-30T10:25:27.3320224Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): go.etcd.io/raft/v3.(*node).run(0xc00055b380)
2023-03-30T10:25:27.3321052Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025):     go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:357 +0x98a
2023-03-30T10:25:27.3321837Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025): created by go.etcd.io/raft/v3.StartNode
2023-03-30T10:25:27.3322660Z /home/runner/work/etcd/etcd/bin/etcd (TestRobustnessClusterOfSize3HighTraffic-test-2) (28025):     go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:231 +0x65
serathius commented 1 year ago

Looks like an issue with raft

serathius commented 1 year ago

cc @tbg @pavelkalinnikov

tbg commented 1 year ago

Could be an issue with raft, though I would instead expect it to be some issue with how etcd implements with raft. Hard to know without a further investigation. Let me talk it through with Pavel

serathius commented 1 year ago

The issue is pretty hard to reproduce, only once or twice per hundred runs. Please let me know what data you need to confirm where the issue is.

pav-kv commented 1 year ago

I've investigated this a bit. I can see that this is running on raft @ eaa6808e1f7a.

The panic occurs here, e.g. one I'm looking at reports:

panic: index, 7, is out of range [1]

From the stack trace we're seeing that maybeAppend is called with (index=601, logTerm=2, committed=602, ents=<slice with 1 entry>). Normally, the ents slice should contain entries with contiguous Indexes starting from index+1, in this case we must see ents[0].Index == 602. However, I believe ents[0].Index == 609 here, because the findConflict apparently returned the index of the only entry it has scanned.

We're also seeing that there are two panics happening, at around the same time, with almost matching indexes. Presumably the cluster has a leader and 2 followers. The leader sends a weird MsgApp message to both followers and kills them.

The question is how we get a MsgApp message in which Message.Entries[0].Index != Message.Index + 1. Can it happen that the raftLog.entries function that fetches them returns misplaced entries?

pav-kv commented 1 year ago

I think we can add some ad-hoc logging or assertions (both on the sending and receiving end, to localize where it occurs) to confirm that the Message.Entries[0].Index == Message.Index + 1 invariant is broken.

pav-kv commented 1 year ago

@serathius I've created a branch in my fork on top of the commit you're running against: https://github.com/pavelkalinnikov/raft/commits/msg-app-invariant. If you could run the test with this commit included (or cherry-picked, whatever works for you), this might help localizing where the issue occurs.

serathius commented 1 year ago

@serathius I've created a branch in my fork on top of the commit you're running against: https://github.com/pavelkalinnikov/raft/commits/msg-app-invariant. If you could run the test with this commit included (or cherry-picked, whatever works for you), this might help localizing where the issue occurs.

Awesome! On it.

serathius commented 1 year ago

Updated https://github.com/etcd-io/etcd/pull/15600

serathius commented 1 year ago

Got it, there are two stacktraces writen at once from two separate members, you can distinguish them by name (TestRobustnessClusterOfSize3LowTraffic-test-2, TestRobustnessClusterOfSize3LowTraffic-test-0) or pid (714482, 714483)

/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-1) (714484): {"level":"warn","ts":"2023-03-31T16:42:20.590682+0200","caller":"etcdserver/v3_server.go:831","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":6108155669721489885,"retry-timeout":"500ms"}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-1) (714484): {"level":"info","ts":"2023-03-31T16:42:20.59113+0200","caller":"traceutil/trace.go:171","msg":"trace[530800350] linearizableReadLoop","detail":"{readStateIndex:281; appliedIndex:284; }","duration":"1.002163677s","start":"2023-03-31T16:42:19.588942+0200","end":"2023-03-31T16:42:20.591106+0200","steps":["trace[530800350] 'read index received'  (duration: 1.002157415s)","trace[530800350] 'applied index is now lower than readState.Index'  (duration: 5.2µs)"],"step_count":2}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): {"level":"info","ts":"2023-03-31T16:42:20.59132+0200","caller":"traceutil/trace.go:171","msg":"trace[1978863180] linearizableReadLoop","detail":"{readStateIndex:281; appliedIndex:282; }","duration":"1.020670431s","start":"2023-03-31T16:42:19.570632+0200","end":"2023-03-31T16:42:20.591302+0200","steps":["trace[1978863180] 'read index received'  (duration: 1.020666434s)","trace[1978863180] 'applied index is now lower than readState.Index'  (duration: 2.895µs)"],"step_count":2}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): {"level":"info","ts":"2023-03-31T16:42:20.591933+0200","caller":"traceutil/trace.go:171","msg":"trace[858852315] linearizableReadLoop","detail":"{readStateIndex:284; appliedIndex:283; }","duration":"1.022533199s","start":"2023-03-31T16:42:19.569387+0200","end":"2023-03-31T16:42:20.59192+0200","steps":["trace[858852315] 'read index received'  (duration: 1.022344014s)","trace[858852315] 'applied index is now lower than readState.Index'  (duration: 188.474µs)"],"step_count":2}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): {"level":"panic","ts":"2023-03-31T16:42:20.592092+0200","logger":"raft","caller":"v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90","msg":"entries invariant broken: entry[0].Index is 290, want 285","stacktrace":"go.etcd.io/raft/v3.(*raftLog).maybeAppend\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90\ngo.etcd.io/raft/v3.(*raft).handleAppendEntries\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1521\ngo.etcd.io/raft/v3.stepFollower\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1476\ngo.etcd.io/raft/v3.(*raft).Step\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1025\ngo.etcd.io/raft/v3.(*node).run\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:357"}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): {"level":"panic","ts":"2023-03-31T16:42:20.592137+0200","logger":"raft","caller":"v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90","msg":"entries invariant broken: entry[0].Index is 290, want 285","stacktrace":"go.etcd.io/raft/v3.(*raftLog).maybeAppend\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90\ngo.etcd.io/raft/v3.(*raft).handleAppendEntries\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1521\ngo.etcd.io/raft/v3.stepFollower\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1476\ngo.etcd.io/raft/v3.(*raft).Step\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1025\ngo.etcd.io/raft/v3.(*node).run\n\tgo.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:357"}
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): panic: entries invariant broken: entry[0].Index is 290, want 285
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): 
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): goroutine 193 [running]:
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): panic: entries invariant broken: entry[0].Index is 290, want 285
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): 
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): goroutine 47 [running]:
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x0?, 0x36?, {0x0?, 0x0?, 0xc00017ca00?})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482):        go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x65
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x0?, 0x36?, {0x0?, 0x0?, 0xc00033a6e0?})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483):        go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x65
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0000c1a00, {0x0, 0x0, 0x0})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482):        go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x3ec
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0004ec1a0, {0x0, 0x0, 0x0})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483):        go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x3ec
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap.(*SugaredLogger).log(0xc00024c080, 0x4, {0x108b574?, 0x40dc27?}, {0xc0001f86a0?, 0xece6c0?, 0xc000295401?}, {0x0, 0x0, 0x0})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482):        go.uber.org/zap@v1.24.0/sugar.go:295 +0xee
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.uber.org/zap.(*SugaredLogger).Panicf(...)
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482):        go.uber.org/zap@v1.24.0/sugar.go:189
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap.(*SugaredLogger).log(0xc00011c318, 0x4, {0x108b574?, 0x40dc27?}, {0xc001a94db0?, 0xece6c0?, 0xc0002de501?}, {0x0, 0x0, 0x0})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483):        go.uber.org/zap@v1.24.0/sugar.go:295 +0xee
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.uber.org/zap.(*SugaredLogger).Panicf(...)
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483):        go.uber.org/zap@v1.24.0/sugar.go:189
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0x10368e0?, {0x108b574?, 0x2?}, {0xc0001f86a0?, 0x0?, 0x7f2caa95c878?})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0x0?, {0x108b574?, 0x2?}, {0xc001a94db0?, 0x0?, 0x1d45d40?})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482):        go.etcd.io/etcd/server/v3/etcdserver/zap_raft.go:102 +0x49
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483):        go.etcd.io/etcd/server/v3/etcdserver/zap_raft.go:102 +0x49
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3.(*raftLog).maybeAppend(0xc0004d45b0, 0x11c, 0x2, 0x11c, {0xc001ca5c20?, 0x1, 0x1})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3.(*raftLog).maybeAppend(0xc0004e3f10, 0x11c, 0x2, 0x11c, {0xc001a30f30?, 0x1, 0x1})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482):        go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90 +0xee
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483):        go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/log.go:90 +0xee
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3.(*raft).handleAppendEntries(0xc000030b00, {0x3, 0xf2d60a7c86d2c4e1, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001ca5c20, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3.(*raft).handleAppendEntries(0xc0002089a0, {0x3, 0x9b0ed86c9976a482, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001a30f30, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483):        go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1521 +0x77
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482):        go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1521 +0x77
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3.stepFollower(0xc000030b00, {0x3, 0xf2d60a7c86d2c4e1, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001ca5c20, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483):        go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1476 +0x2f8
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3.stepFollower(0xc0002089a0, {0x3, 0x9b0ed86c9976a482, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001a30f30, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482):        go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1476 +0x2f8
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3.(*raft).Step(0xc0002089a0, {0x3, 0x9b0ed86c9976a482, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001a30f30, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3.(*raft).Step(0xc000030b00, {0x3, 0xf2d60a7c86d2c4e1, 0xb078ae59d09754c4, 0x2, 0x2, 0x11c, {0xc001ca5c20, 0x1, 0x1}, ...})
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482):        go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1025 +0x1375
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): go.etcd.io/raft/v3.(*node).run(0xc0006942a0)
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483):        go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/raft.go:1025 +0x1375
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): go.etcd.io/raft/v3.(*node).run(0xc0006ae840)
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482):        go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:357 +0x98a
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483):        go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:357 +0x98a
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483): created by go.etcd.io/raft/v3.StartNode
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482): created by go.etcd.io/raft/v3.StartNode
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-2) (714482):        go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:231 +0x65
/src/go.etcd.io/etcd/bin/etcd (TestRobustnessClusterOfSize3LowTraffic-test-0) (714483):        go.etcd.io/raft/v3@v3.0.0-20221201111702-eaa6808e1f7a/node.go:231 +0x65
pav-kv commented 1 year ago

The strange part is that this fires at the latest stage: on the follower upon receiving MsgApp. I added the assertions to the code that creates the MsgApp message, it is not triggered. So some corruption might be happening while the message is in transit?

pav-kv commented 1 year ago

@serathius What does the robustness "blackhole" mode do? Does it fiddle with the messages content, or drop them, or something else?

serathius commented 1 year ago

We run proxy before each members peer port and blackhole the traffic. https://github.com/etcd-io/etcd/blob/6a995d2d63091418b3fdfcbbeb56d3c705d6dd3d/pkg/proxy/server.go#L103-L115 It should just drop data.

pav-kv commented 1 year ago

Looking at the implementation here, I see that the modifiers are applied to a byte stream. Is it possible that, as a result of turning this thing on/off, some message is assembled from non-contiguous parts of the byte stream? Say, could this MsgApp message actually be the result of 2 messages truncated and concatenated in a weird way? Protobuf is pretty flexible w.r.t. duplicate tags etc, so in theory a "valid" message could be created this way.

pav-kv commented 1 year ago

@serathius Also, is it possible that this stream drops only a part of the message (which represents a prefix of Entries), so that the message gets assembled from partial bytes?

pav-kv commented 1 year ago

The problem with such an application-level "dropping" is that it doesn't look like a stream corruption from the TCP point of view. The stream is tampered with, but to the TCP receiver this is still a valid stream, so it just feeds a modified stream to the application layer. All sorts of application-level corruption can happen this way.

The real TCP-level packet loss would have been spotted by the receiver, and it wouldn't be able to proceed from the middle of the stream when it recovers.

UPD: It might be HTTP that gRPC relies on, not necessarily directly TCP (not sure what the actual stack is). The proxy that you linked tampers with the stream above HTTP, right? And this gets retransmitted over HTTP, and fed into gRPC at the receiver end.

serathius commented 1 year ago

I think you might be right, I have tested peer communication with TLS and I'm unable to reproduce. Need to investigatehow the proxy manages to execute man in the middle attack (I don't provide certs to it).

As for proxy layer, it looks like L4 proxy (TCP) and not L7 (http, grpc). However one thing I noticed that packates buffered and code that drops traffic runs on bytes and not whole packets, which means they could be split in the middle. That would be the incorrect behavior you mentioned that results in non-natural network failure.

serathius commented 1 year ago

Hmm, buffer used in proxy has 48KiB and not 64KiB as is the maximum TCP packet size. Maybe that's cause of packets being cut. I'm looking for alternative way to implement proxy to compare. If time permits, I would like to confirm that this bug is really caused by packet being cut in half. cc @aojea

serathius commented 1 year ago

Closing based on fact that TLS fixes the issue. Replacing pkg/proxy with proper network failure simulator should be tracked in separate issue.