Closed aphyr closed 2 years ago
Possibly! However, since they're different kinds of faults and yield different etcd crashes, I suspect they're different issues. I also wouldn't be surprised if #14102 turns out to encompass a half-dozen different issues, just based on the number of distinct ways I've seen it fail so far.
One WAL entry's size is 13563782407139376 bytes, see log below. It's about 13563TB, obviously it isn't correct.
2022/06/17 16:12:16 Failed reading WAL: wal: max entry size limit exceeded, recBytes: 13563782407139376, fileSize(64000000) - offset(196120) - padBytes(0) = entryLimit(63803880)
There are two possible reasons:
I think the best thing to do for now is to let etcd fail to get started in this situation (data files corrupted, including WAL file) , and it's exactly the current behavior.
In the future, we may deliver a solution to recover the data file from a point in time.
So we've traced this behavior to (we think) an issue with lazyfs: truncation filled with ASCII '0' characters (0x30), rather than 0x00. Etcd's WAL reader scanned for 0x00 to determine the end of the file, and in this case got 0x30 and... maybe interpreted those as a part of the size field?
I'm not exactly sure what the correct behavior here is, filesystem-wise (perhaps @devzizu could chime in?), but for the time being we've replaced truncated bytes with 0x00, and that seems to have eliminated this particular crash.
Instead, we get a new kind of crash! Here's an example:
{"level":"panic","ts":"2022-06-21T13:24:29.359-0400","logger":"raft","caller":"etcdserver/zap_raft.go:101","msg":"tocommit(56444) is out of range [lastIndex(2894)]. Was the raft log corrupted, truncated, or lost?","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/zap_raft.go:101\ngo.etcd.io/etcd/raft/v3.(*raftLog).commitTo\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/log.go:237\ngo.etcd.io/etcd/raft/v3.(*raft).handleHeartbeat\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1508\ngo.etcd.io/etcd/raft/v3.stepFollower\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1434\ngo.etcd.io/etcd/raft/v3.(*raft).Step\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:975\ngo.etcd.io/etcd/raft/v3.(*node).run\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/node.go:356"}
{"level":"info","ts":"2022-06-21T13:24:29.359-0400","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"a1ffd5acd6a88a6a"}
panic: tocommit(56444) is out of range [lastIndex(2894)]. Was the raft log corrupted, truncated, or lost?
goroutine 167 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00021e480, 0x0, 0x0, 0x0)
/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*SugaredLogger).log(0xc00012c028, 0x4, 0x124ecb9, 0x5d, 0xc0012a4100, 0x2, 0x2, 0x0, 0x0, 0x0)
/go/pkg/mod/go.uber.org/zap@v1.17.0/sugar.go:227 +0x111
go.uber.org/zap.(*SugaredLogger).Panicf(...)
/go/pkg/mod/go.uber.org/zap@v1.17.0/sugar.go:159
go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0xc0000dc090, 0x124ecb9, 0x5d, 0xc0012a4100, 0x2, 0x2)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/zap_raft.go:101 +0x7d
go.etcd.io/etcd/raft/v3.(*raftLog).commitTo(0xc0001f6000, 0xdc7c)
/go/src/go.etcd.io/etcd/release/etcd/raft/log.go:237 +0x135
go.etcd.io/etcd/raft/v3.(*raft).handleHeartbeat(0xc000716f20, 0x8, 0x4824313a421b2502, 0xa1ffd5acd6a88a6a, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1508 +0x54
go.etcd.io/etcd/raft/v3.stepFollower(0xc000716f20, 0x8, 0x4824313a421b2502, 0xa1ffd5acd6a88a6a, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1434 +0x478
go.etcd.io/etcd/raft/v3.(*raft).Step(0xc000716f20, 0x8, 0x4824313a421b2502, 0xa1ffd5acd6a88a6a, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:975 +0xa55
go.etcd.io/etcd/raft/v3.(*node).run(0xc000238180)
/go/src/go.etcd.io/etcd/release/etcd/raft/node.go:356 +0x798
created by go.etcd.io/etcd/raft/v3.RestartNode
/go/src/go.etcd.io/etcd/release/etcd/raft/node.go:244 +0x330
Hey!
I'm not exactly sure what the correct behavior here is, filesystem-wise (perhaps @devzizu could chime in?), but for the time being we've replaced truncated bytes with 0x00, and that seems to have eliminated this particular crash.
That's right, any filesystem should return null bytes on read operations (0x00) for the truncated file (in case of increasing the size). My apologies with the LazyFS's bug, as @aphyr said, I was writing 0x30 (ascii '0') instead of 0x00, because that helped me debugging at the time. Also, I thought it wouldn't be a huge deal because I was thinking that applications relied on some kind of max readable offset.
Feel free to report bugs or ask me anything about LazyFS, it will be a pleasure to help!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
error starting etcd: wal: max entry size limit in sensu-go
what is the solution for this? Thanking You
What happened?
The lazyfs filesystem lets us simulate the effects of a power failure by losing writes which were not explicitly fsync'ed to disk. When we run etcd 3.5.3 on lazyfs, killing etcd and then losing un-fsynced writes can reliably put etcd into an unbootable state. Every time we try to start the node, it complains:
We're still sanding bugs off of lazyfs, so it's possible this might be an issue in the filesystem itself. That said, this might also point to a problem with how etcd writes WAL files, so I'd like to check and see if this looks plausible to y'all. I know there's been some issues with data file corruption on process crash in the past; this approach might help find more bugs like that!
I've attached a full test run from Jepsen, which includes tarballs of the data directories for each node. Take a look at n1/ as an example: 20220607T150758.000-0400.zip.
This happens both with and without --experimental-initial-corrupt-check.
What did you expect to happen?
I expect that etcd ought to start up without crashing, even if we lose un-fsynced writes.
How can we reproduce it (as minimally and precisely as possible)?
Check out https://github.com/jepsen-io/etcd at adfc820826a947625c94d836b4017b4eaac7064d, and run:
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output