Possible WAL corruption with loss of un-fsynced writes:`max entry size limit exceeded`

aphyr commented 2 years ago

What happened?

The lazyfs filesystem lets us simulate the effects of a power failure by losing writes which were not explicitly fsync'ed to disk. When we run etcd 3.5.3 on lazyfs, killing etcd and then losing un-fsynced writes can reliably put etcd into an unbootable state. Every time we try to start the node, it complains:

{"level":"fatal","ts":"2022-06-07T15:08:26.505-0400","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}

We're still sanding bugs off of lazyfs, so it's possible this might be an issue in the filesystem itself. That said, this might also point to a problem with how etcd writes WAL files, so I'd like to check and see if this looks plausible to y'all. I know there's been some issues with data file corruption on process crash in the past; this approach might help find more bugs like that!

I've attached a full test run from Jepsen, which includes tarballs of the data directories for each node. Take a look at n1/ as an example: 20220607T150758.000-0400.zip.

This happens both with and without --experimental-initial-corrupt-check.

What did you expect to happen?

I expect that etcd ought to start up without crashing, even if we lose un-fsynced writes.

How can we reproduce it (as minimally and precisely as possible)?

Check out https://github.com/jepsen-io/etcd at adfc820826a947625c94d836b4017b4eaac7064d, and run:

lein run test -w append --concurrency 2n --time-limit 300 --rate 500 --lazyfs --nemesis kill

Anything else we need to know?

No response

Etcd version (please run commands below)

```console $ etcd --version etcd Version: 3.5.3 Git SHA: 0452feec7 Go Version: go1.16.15 Go OS/Arch: linux/amd64 $ etcdctl version etcdctl version: 3.5.3 API version: 3.5 ```

Etcd configuration (command line flags or environment variables)

etcd --enable-v2 --log-outputs stderr --logger zap --name n1 --listen-peer-urls http://192.168.122.101:2380 --listen-client-urls http://192.168.122.101:2379 --advertise-client-urls http://192.168.122.101:2379 --initial-cluster-state new --initial-advertise-peer-urls http://192.168.122.101:2380 --initial-cluster n1=http://192.168.122.101:2380,n2=http://192.168.122.102:2380,n3=http://192.168.122.103:2380,n4=http://192.168.122.104:2380,n5=http://192.168.122.105:2380 --snapshot-count 100 --experimental-initial-corrupt-check

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

```console root@n1:/opt/etcd# ./etcdctl member list -w table --endpoints=n1:2379 +------------------+---------+------+-----------------------------+-----------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+------+-----------------------------+-----------------------------+------------+ | 1153c9690d2b2284 | started | n3 | http://192.168.122.103:2380 | http://192.168.122.103:2379 | false | | 4824313a421b2502 | started | n5 | http://192.168.122.105:2380 | http://192.168.122.105:2379 | false | | 4d6e27d122507e9c | started | n4 | http://192.168.122.104:2380 | http://192.168.122.104:2379 | false | | a1ffd5acd6a88a6a | started | n2 | http://192.168.122.102:2380 | http://192.168.122.102:2379 | false | | afa39e55dee6dc2e | started | n1 | http://192.168.122.101:2380 | http://192.168.122.101:2379 | false | +------------------+---------+------+-----------------------------+-----------------------------+------------+ root@n1:/opt/etcd# ./etcdctl --endpoints=http://192.168.122.105:2379,http://192.168.122.103:2379,http://192.168.122.104:2379,http://192.168.122.102:2379,http://192.168.122.101:2379 endpoint status -w table +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | http://192.168.122.105:2379 | 4824313a421b2502 | 3.5.3 | 4.2 MB | true | false | 20 | 60226 | 60226 | | | http://192.168.122.103:2379 | 1153c9690d2b2284 | 3.5.3 | 4.2 MB | false | false | 20 | 60226 | 60226 | | | http://192.168.122.104:2379 | 4d6e27d122507e9c | 3.5.3 | 4.2 MB | false | false | 20 | 60228 | 60228 | | | http://192.168.122.102:2379 | a1ffd5acd6a88a6a | 3.5.3 | 4.2 MB | false | false | 20 | 60228 | 60228 | | | http://192.168.122.101:2379 | afa39e55dee6dc2e | 3.5.3 | 4.2 MB | false | false | 20 | 60228 | 60228 | | +-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ ```

Relevant log output

{"level":"fatal","ts":"2022-06-07T15:13:18.464-0400","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}

ahrtr commented 2 years ago

Is this issue the same as 14102 ?

aphyr commented 2 years ago

Possibly! However, since they're different kinds of faults and yield different etcd crashes, I suspect they're different issues. I also wouldn't be surprised if #14102 turns out to encompass a half-dozen different issues, just based on the number of distinct ways I've seen it fail so far.

ahrtr commented 2 years ago

This is also related to 14025

ahrtr commented 2 years ago

One WAL entry's size is 13563782407139376 bytes, see log below. It's about 13563TB, obviously it isn't correct.

2022/06/17 16:12:16 Failed reading WAL: wal: max entry size limit exceeded, recBytes: 13563782407139376, fileSize(64000000) - offset(196120) - padBytes(0) = entryLimit(63803880)

There are two possible reasons:

It's the filesystem's bug, and it persisted a wrong value for the WAL entry size;
It's caused by bit-flip somehow.

ahrtr commented 2 years ago

I think the best thing to do for now is to let etcd fail to get started in this situation (data files corrupted, including WAL file) , and it's exactly the current behavior.

In the future, we may deliver a solution to recover the data file from a point in time.

aphyr commented 2 years ago

So we've traced this behavior to (we think) an issue with lazyfs: truncation filled with ASCII '0' characters (0x30), rather than 0x00. Etcd's WAL reader scanned for 0x00 to determine the end of the file, and in this case got 0x30 and... maybe interpreted those as a part of the size field?

I'm not exactly sure what the correct behavior here is, filesystem-wise (perhaps @devzizu could chime in?), but for the time being we've replaced truncated bytes with 0x00, and that seems to have eliminated this particular crash.

Instead, we get a new kind of crash! Here's an example:

{"level":"panic","ts":"2022-06-21T13:24:29.359-0400","logger":"raft","caller":"etcdserver/zap_raft.go:101","msg":"tocommit(56444) is out of range [lastIndex(2894)]. Was the raft log corrupted, truncated, or lost?","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/zap_raft.go:101\ngo.etcd.io/etcd/raft/v3.(*raftLog).commitTo\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/log.go:237\ngo.etcd.io/etcd/raft/v3.(*raft).handleHeartbeat\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1508\ngo.etcd.io/etcd/raft/v3.stepFollower\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1434\ngo.etcd.io/etcd/raft/v3.(*raft).Step\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:975\ngo.etcd.io/etcd/raft/v3.(*node).run\n\t/go/src/go.etcd.io/etcd/release/etcd/raft/node.go:356"}
{"level":"info","ts":"2022-06-21T13:24:29.359-0400","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"a1ffd5acd6a88a6a"}
panic: tocommit(56444) is out of range [lastIndex(2894)]. Was the raft log corrupted, truncated, or lost?

goroutine 167 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00021e480, 0x0, 0x0, 0x0)
    /go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*SugaredLogger).log(0xc00012c028, 0x4, 0x124ecb9, 0x5d, 0xc0012a4100, 0x2, 0x2, 0x0, 0x0, 0x0)
    /go/pkg/mod/go.uber.org/zap@v1.17.0/sugar.go:227 +0x111
go.uber.org/zap.(*SugaredLogger).Panicf(...)
    /go/pkg/mod/go.uber.org/zap@v1.17.0/sugar.go:159
go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0xc0000dc090, 0x124ecb9, 0x5d, 0xc0012a4100, 0x2, 0x2)
    /go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/zap_raft.go:101 +0x7d
go.etcd.io/etcd/raft/v3.(*raftLog).commitTo(0xc0001f6000, 0xdc7c)
    /go/src/go.etcd.io/etcd/release/etcd/raft/log.go:237 +0x135
go.etcd.io/etcd/raft/v3.(*raft).handleHeartbeat(0xc000716f20, 0x8, 0x4824313a421b2502, 0xa1ffd5acd6a88a6a, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1508 +0x54
go.etcd.io/etcd/raft/v3.stepFollower(0xc000716f20, 0x8, 0x4824313a421b2502, 0xa1ffd5acd6a88a6a, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1434 +0x478
go.etcd.io/etcd/raft/v3.(*raft).Step(0xc000716f20, 0x8, 0x4824313a421b2502, 0xa1ffd5acd6a88a6a, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:975 +0xa55
go.etcd.io/etcd/raft/v3.(*node).run(0xc000238180)
    /go/src/go.etcd.io/etcd/release/etcd/raft/node.go:356 +0x798
created by go.etcd.io/etcd/raft/v3.RestartNode
    /go/src/go.etcd.io/etcd/release/etcd/raft/node.go:244 +0x330

devzizu commented 2 years ago

Hey!

I'm not exactly sure what the correct behavior here is, filesystem-wise (perhaps @devzizu could chime in?), but for the time being we've replaced truncated bytes with 0x00, and that seems to have eliminated this particular crash.

That's right, any filesystem should return null bytes on read operations (0x00) for the truncated file (in case of increasing the size). My apologies with the LazyFS's bug, as @aphyr said, I was writing 0x30 (ascii '0') instead of 0x00, because that helped me debugging at the time. Also, I thought it wouldn't be a huge deal because I was thinking that applications relied on some kind of max readable offset.

Feel free to report bugs or ask me anything about LazyFS, it will be a pleasure to help!

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

serathius commented 2 years ago

Closing per https://github.com/etcd-io/etcd/issues/14098#issuecomment-1162113633

SELVAKUMARU commented 1 year ago

error starting etcd: wal: max entry size limit in sensu-go

what is the solution for this? Thanking You

etcd-io / etcd