Invalid db page layout - Githubissues

serathius commented 1 year ago

What happened?

Etcd started crashlooping with

From etcd logs:

panic: runtime error: slice bounds out of range [:8] with capacity 5

goroutine 111 [running]:
go.etcd.io/etcd/mvcc.bytesToRev(0x7f760418e0b0, 0x5, 0x5, 0x0, 0x17f7780)
        /go/src/go.etcd.io/etcd/mvcc/revision.go:58 +0x85
go.etcd.io/etcd/mvcc.restoreIntoIndex.func1(0xc000070000, 0xc000074720, 0x11edc08, 0xc00017d620, 0xc00024a060)
        /go/src/go.etcd.io/etcd/mvcc/kvstore.go:515 +0x287
created by go.etcd.io/etcd/mvcc.restoreIntoIndex
        /go/src/go.etcd.io/etcd/mvcc/kvstore.go:490 +0xaf

When analysing the db file I found invalid etcd db file layout. Under bucket branch page, in keys bucket there was a branch page linking to another bucket branch page. This resulted in bbolt returning key alarm when reading whole keys bucket. This is correct layout for bbolt, but not for etcd.

From etcd point of view this is invalid as it assumes that all keys in keys bucket are revision numbers. Panic from above comes from bytesToRev function that parses revision. It failed as main rev has 8 bytes, while key "alarm" has only 5 bytes.

This means that at some point bbolt either:

overridden page that was in use with buckets page
incorrectly pointed branch page into buckets page

We can't exclude hardware issue that resulted in memory corruption.

Providing the dump.txt for further investigation

What did you expect to happen?

Want to report the issue to start the discussion of etcd handling potential memory corruptions.

Assuming that this was indeed a memory corruption, I expect that should avoid writing corrupted page to disk. Running mmapped memory comes with risk with memory stamping, etcd should have mechanism that prevent corruption from being persisted.

Was discussing with @ptabor idea of protected mode for bbolt where it would verifying every write to ensure corruptions are not persisted.

How can we reproduce it (as minimally and precisely as possible)?

Don't think so.

Anything else we need to know?

No response

Etcd version (please run commands below)

v3.4.21

Etcd configuration (command line flags or environment variables)

Nothing unusual

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

N/A

Relevant log output

No response

serathius commented 1 year ago

cc @ahrtr @ptabor