etcd fails to start after power failure

mj-ramos commented 1 year ago

Bug report criteria

[X] This bug report is not security related, security issues should be disclosed privately via security@etcd.io.
[X] This is not a support request, support requests should be raised in the etcd discussion forums.
[X] You have read the etcd bug reporting guidelines.
[x] Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

After experiencing a power failure while an etcd server is bootstrapping, the server is no longer able to recover and restart again.

This issue occurs in both single-node and three-node clusters. The root cause of the problem is that some writes to the member/snap/db file exceed the common size of a page at the page cache. This can result in a "torn write" scenario where only part of the write's payload is persisted while the rest is not, since the pages of the page cache can be flushed out of order. There are several references about this problem:

What did you expect to happen?

That the server where the power failure happened restarted correctly.

How can we reproduce it (as minimally and precisely as possible)?

This issue can be replicated using LazyFS, which is now capable of simulating out of order persistence of file system pages, at the disk. The main problem is a write to the file member/snap/db which is 16384 bytes long. LazyFS will persist portions (in sizes of 8192 bytes) of this write out of order and will crash, simulating a power failure. To reproduce this problem, one can follow these steps:

Mount LazyFS on a directory where etcd data will be saved, with a specified root directory. Assuming the data path for etcd is /home/data/data.etcdand the root directory is /home/data-root/data.etcd, add the following lines to the default configuration file (located in the config/default.toml directory):
```
[[injection]]
type="split_write"
file="/home/data-r/data.etcd/member/snap/db"
persist=[2]
parts=2
occurrence=1
```
These lines define a fault to be injected. A power failure will be simulated after writing to the /home/data-r/data.etcd/member/snap/db file. Since this write is large (16384 bytes), it is split into 2 parts (each with 8192 bytes), and only the second part is persisted. Specify that it's the first write issued to this file (with the parameter occurrence).
Start LazyFS with the following command: ./scripts/mount-lazyfs.sh -c config/default.toml -m /home/data/data.etcd -r /home/data-r/data.etcd -f
Start etcd with the command ./etcd --data-dir '/home/data/data.etcd'.

Immediately after this step, etcd will shut down because LazyFS was unmounted, simulating the power failure. At this point, you can analyze the logs produced by LazyFS to see the system calls issued until the moment of the fault. Here is a simplified version of the log:

{'syscall': 'create', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/.touch', 'mode': 'O_TRUNC'}
{'syscall': 'release', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/.touch'}
{'syscall': 'create', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/member/snap/.touch', 'mode': 'O_TRUNC'}
{'syscall': 'release', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/member/snap/.touch'}
{'syscall': 'create', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/member/snap/db', 'mode': 'O_RDWR'}
{'syscall': 'write', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/member/snap/db', 'size': '16384', 'off': '0'}
{'syscall': 'fault'}

Remove the fault from the configuration file, unmount the filesystem with fusermount -uz /home/data/data.etcd
Mount LazyFS again with the previously provided command.
Attemp to start etcd (it fails).

By following these steps, you can replicate the issue and analyze the effects of the power failure on etcd's restart process.

The same problem (but with a different error) happens when we persist the first 8192 bytes of the write (for this change the parameter persist to [1]).

Note that no problem happens when persist is changed to [1,2]. The whole write will be persisted and etcd will succeed to restart.

Anything else we need to know?

Here is the output produced by etcd on restarting. The first file corresponds to the error reported after only persisting the first 8192 bytes of the member/snap/db file and the second file to the error reported after only persisting the second 8192 bytes of the member/snap/db file. persist_first_part.txt persist_second_part.txt

Etcd version (please run commands below)

```console $ etcd --version etcd Version: 3.4.25 Git SHA: 94593e63d Go Version: go1.19.8 Go OS/Arch: linux/amd64 $ etcdctl version etcdctl version: 3.4.25 API version: 3.4 ```

Etcd configuration (command line flags or environment variables)

--data-dir 'data/data.etcd'

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

```console $ etcdctl member list -w table # paste output here $ etcdctl --endpoints= endpoint status -w table # paste output here ```

Relevant log output

No response

serathius commented 1 year ago

Hey @mj-ramos, thanks for reporting the issue.

I would be really happy to collaborate to reproduce this and similar disk failure scenarios in etcd testing. We already have LazyFS integrated in out robustness testing suite. For now we only test simple scenario (kill etcd process, clear lazyfs cache).

Would you be interested in contributing this and others disk failure scenarios to etcd robustness testing?

serathius commented 1 year ago

Note; the issue is not surprising to me as I expect most etcd instances run in Cloud which are expected to have different (maybe less severe) disk failure scenarios.

tjungblu commented 1 year ago

Thanks for the awesome write-up! Does this happen with a 3.5 release as well?

ahrtr commented 1 year ago

This can result in a "torn write" scenario where only part of the write's payload is persisted while the rest is not

It happens in real production environment, but theoretically it shouldn't be a problem for bbolt, because (Background info: bbolt writes/syncs normal data firstly, and meta data later),

if bbolt crashes during writing/syncing the normal data, it will not break anything, the only side effect is that the db file size might increase.
Usually the meta data isn't that big (just dozen's bytes, but of course bbolt still writes the whole page. Currently the default page size is 4K, normally it will not lead to splitted wirting?). Even if bbolt crash during writing/syncing the meta data, it shouldn't be a problem, because there is another healthy meta page (Note: bbolt has two meta pages).

I will try to reproduce this issue per your steps. thx anyway.

tjungblu commented 1 year ago

To answer my own question, yes, it also reproduces on the 3.5 branch @ 7ed8970e2d43f9d44fdabe2d26f01869a36c6dd5.

{"level":"panic","ts":"2023-09-14T14:11:22.852848+0200","caller":"backend/backend.go:189","msg":"failed to open database","path":"/home/data/data.etcd/member/snap/db","error":"write /home/data/data.etcd/member/snap/db: software caused connection abort","stacktrace":"go.etcd.io/etcd/server/v3/mvcc/backend.newBackend\n\tgo.etcd.io/etcd/server/v3/mvcc/backend/backend.go:189\ngo.etcd.io/etcd/server/v3/mvcc/backend.New\n\tgo.etcd.io/etcd/server/v3/mvcc/backend/backend.go:163\ngo.etcd.io/etcd/server/v3/etcdserver.newBackend\n\tgo.etcd.io/etcd/server/v3/etcdserver/backend.go:55\ngo.etcd.io/etcd/server/v3/etcdserver.openBackend.func1\n\tgo.etcd.io/etcd/server/v3/etcdserver/backend.go:76"}
panic: failed to open database

goroutine 161 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000162300, {0xc000128600, 0x2, 0x2})
    go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x49b
go.uber.org/zap.(*Logger).Panic(0xc000049470?, {0x10c41b3?, 0x44f7f2?}, {0xc000128600, 0x2, 0x2})
    go.uber.org/zap@v1.17.0/logger.go:227 +0x59
go.etcd.io/etcd/server/v3/mvcc/backend.newBackend({{0xc000049470, 0x23}, 0x5f5e100, 0x2710, {0x10b2e12, 0x7}, 0x280000000, 0xc0004f4000, 0x0, 0x0, ...})
    go.etcd.io/etcd/server/v3/mvcc/backend/backend.go:189 +0x429
go.etcd.io/etcd/server/v3/mvcc/backend.New(...)
    go.etcd.io/etcd/server/v3/mvcc/backend/backend.go:163
go.etcd.io/etcd/server/v3/etcdserver.newBackend({{0x10b2bdb, 0x7}, {0x0, 0x0}, {0x0, 0x0}, {0xc00055f560, 0x1, 0x1}, {0xc00055f290, ...}, ...}, ...)
    go.etcd.io/etcd/server/v3/etcdserver/backend.go:55 +0x399
go.etcd.io/etcd/server/v3/etcdserver.openBackend.func1()
    go.etcd.io/etcd/server/v3/etcdserver/backend.go:76 +0x78
created by go.etcd.io/etcd/server/v3/etcdserver.openBackend
    go.etcd.io/etcd/server/v3/etcdserver/backend.go:75 +0x18a

edit, 3.6 @ 9ec5f981894063d13fdc5cfc60933c515ab6c6c2 as well:

{"level":"panic","ts":"2023-09-14T14:20:35.934935+0200","caller":"backend/backend.go:186","msg":"failed to open database","path":"/home/data/data.etcd/member/snap/db","error":"write /home/data/data.etcd/member/snap/db: software caused connection abort","stacktrace":"go.etcd.io/etcd/server/v3/storage/backend.newBackend\n\tgo.etcd.io/etcd/server/v3/storage/backend/backend.go:186\ngo.etcd.io/etcd/server/v3/storage/backend.New\n\tgo.etcd.io/etcd/server/v3/storage/backend/backend.go:164\ngo.etcd.io/etcd/server/v3/storage.newBackend\n\tgo.etcd.io/etcd/server/v3/storage/backend.go:55\ngo.etcd.io/etcd/server/v3/storage.OpenBackend.func1\n\tgo.etcd.io/etcd/server/v3/storage/backend.go:76"}
panic: failed to open database

goroutine 143 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x2?, 0xc00023a7d0?, {0x0?, 0x0?, 0xc000388000?})
    go.uber.org/zap@v1.25.0/zapcore/entry.go:196 +0x65
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000382000, {0xc0004b8d80, 0x2, 0x2})
    go.uber.org/zap@v1.25.0/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc000696a80?, {0x10e9d9b?, 0xc00011e7e0?}, {0xc0004b8d80, 0x2, 0x2})
    go.uber.org/zap@v1.25.0/logger.go:258 +0x59
go.etcd.io/etcd/server/v3/storage/backend.newBackend({{0xc000696a80, 0x23}, 0x5f5e100, 0x2710, {0x10d8594, 0x7}, 0x280000000, 0xc0000c0fc0, 0x0, 0x0, ...})
    go.etcd.io/etcd/server/v3/storage/backend/backend.go:186 +0x2ff
go.etcd.io/etcd/server/v3/storage/backend.New(...)
    go.etcd.io/etcd/server/v3/storage/backend/backend.go:164
go.etcd.io/etcd/server/v3/storage.newBackend({{0x10d834f, 0x7}, {0x0, 0x0}, {0x0, 0x0}, {{{0x19fd8d8, 0x0, 0x0}, 0x12a05f200, ...}, ...}, ...}, ...)
    go.etcd.io/etcd/server/v3/storage/backend.go:55 +0x3b8
go.etcd.io/etcd/server/v3/storage.OpenBackend.func1()
    go.etcd.io/etcd/server/v3/storage/backend.go:76 +0x58
created by go.etcd.io/etcd/server/v3/storage.OpenBackend
    go.etcd.io/etcd/server/v3/storage/backend.go:75 +0x18c

tjungblu commented 1 year ago

After little debugging, it's in the init method:

https://github.com/etcd-io/bbolt/blob/master/db.go#L614-L617

and the write OP will fail with:

err = {error | *fs.PathError} 
 Op = {string} "write"
 Path = {string} "/home/data/data.etcd/member/snap/db"
 Err = {error | syscall.Errno} ECONNABORTED (103)

That's also where your 16k write fails at:

{'syscall': 'write', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/member/snap/db', 'size': '16384', 'off': '0'}

That's the initial allocation for the two meta pages that Ben was talking about. Don't know if that's very concerning, how likely is it that your database crashes on init on the very first page write? There's no data in it anyway.

How about we move this issue to bbolt and try to make the init atomic w.r.t. multi-page writes?

The usual meta write seems safe: https://github.com/etcd-io/bbolt/blob/master/tx.go#L503-L524

Definitely worth testing this more in-depth....

mj-ramos commented 1 year ago

Hey @mj-ramos, thanks for reporting the issue.

I would be really happy to collaborate to reproduce this and similar disk failure scenarios in etcd testing. We already have LazyFS integrated in out robustness testing suite. For now we only test simple scenario (kill etcd process, clear lazyfs cache).

Would you be interested in contributing this and others disk failure scenarios to etcd robustness testing?

Hi! This bug is reproducible with a new version of LazyFS, where new types of faults were introduced. The main branch has the previous faults (clear cache) and new ones (crash, reorder and split write) that are simple to configure.

And yes, I am interested and available to contribute :)

ahrtr commented 1 year ago

Reproduced this "issue" as well. It should be the reason as @tjungblu mentioned above. It's corrupted as the very initial db initialisation phase. There is no any data yet, so It shouldn't be a big problem; no need to spend too much effort to resolve it. Once the db is already successfully initialized, then can't reproduce this issue any more.

ahrtr commented 1 year ago

Won't fix.

Thanks @mj-ramos for raising this issue anyway. It's an interesting scenario, which is worthwhile to be introduced into our test. I may also spend some time to introduce lazyFS into bbolt separately.

serathius commented 1 year ago

Ok, I understand that resiliency during db initialization is not a very impactful issue issue, but still it's an issue we might consider fixing. Let still make sure we are tracking the followups:

Bbolt is resiliant to power failure during db initialization
Etcd robustness tests utilize new LazyFS feature to simulate power failure
Include LazyFS in bbolt test suite.

To make sure the followups are not lost, let's keep the issue open until we create issues for them.

ahrtr commented 1 year ago

Bbolt is resiliant to power failure during db initialization

It isn't a problem which should be fixed in code level; Instead, we only need to update doc.

serathius commented 1 year ago

Bbolt is resiliant to power failure during db initialization

It isn't a problem which should be fixed in code level; Instead, we only need to update doc.

Ok, let's track that then.

ahrtr commented 1 year ago

all tracked.

etcd-io / etcd