Closed mj-ramos closed 1 year ago
Hey @mj-ramos, thanks for reporting the issue.
I would be really happy to collaborate to reproduce this and similar disk failure scenarios in etcd testing. We already have LazyFS integrated in out robustness testing suite. For now we only test simple scenario (kill etcd process, clear lazyfs cache).
Would you be interested in contributing this and others disk failure scenarios to etcd robustness testing?
Note; the issue is not surprising to me as I expect most etcd instances run in Cloud which are expected to have different (maybe less severe) disk failure scenarios.
Thanks for the awesome write-up! Does this happen with a 3.5 release as well?
This can result in a "torn write" scenario where only part of the write's payload is persisted while the rest is not
It happens in real production environment, but theoretically it shouldn't be a problem for bbolt, because (Background info: bbolt writes/syncs normal data firstly, and meta data later),
I will try to reproduce this issue per your steps. thx anyway.
To answer my own question, yes, it also reproduces on the 3.5 branch @ 7ed8970e2d43f9d44fdabe2d26f01869a36c6dd5.
{"level":"panic","ts":"2023-09-14T14:11:22.852848+0200","caller":"backend/backend.go:189","msg":"failed to open database","path":"/home/data/data.etcd/member/snap/db","error":"write /home/data/data.etcd/member/snap/db: software caused connection abort","stacktrace":"go.etcd.io/etcd/server/v3/mvcc/backend.newBackend\n\tgo.etcd.io/etcd/server/v3/mvcc/backend/backend.go:189\ngo.etcd.io/etcd/server/v3/mvcc/backend.New\n\tgo.etcd.io/etcd/server/v3/mvcc/backend/backend.go:163\ngo.etcd.io/etcd/server/v3/etcdserver.newBackend\n\tgo.etcd.io/etcd/server/v3/etcdserver/backend.go:55\ngo.etcd.io/etcd/server/v3/etcdserver.openBackend.func1\n\tgo.etcd.io/etcd/server/v3/etcdserver/backend.go:76"}
panic: failed to open database
goroutine 161 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000162300, {0xc000128600, 0x2, 0x2})
go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x49b
go.uber.org/zap.(*Logger).Panic(0xc000049470?, {0x10c41b3?, 0x44f7f2?}, {0xc000128600, 0x2, 0x2})
go.uber.org/zap@v1.17.0/logger.go:227 +0x59
go.etcd.io/etcd/server/v3/mvcc/backend.newBackend({{0xc000049470, 0x23}, 0x5f5e100, 0x2710, {0x10b2e12, 0x7}, 0x280000000, 0xc0004f4000, 0x0, 0x0, ...})
go.etcd.io/etcd/server/v3/mvcc/backend/backend.go:189 +0x429
go.etcd.io/etcd/server/v3/mvcc/backend.New(...)
go.etcd.io/etcd/server/v3/mvcc/backend/backend.go:163
go.etcd.io/etcd/server/v3/etcdserver.newBackend({{0x10b2bdb, 0x7}, {0x0, 0x0}, {0x0, 0x0}, {0xc00055f560, 0x1, 0x1}, {0xc00055f290, ...}, ...}, ...)
go.etcd.io/etcd/server/v3/etcdserver/backend.go:55 +0x399
go.etcd.io/etcd/server/v3/etcdserver.openBackend.func1()
go.etcd.io/etcd/server/v3/etcdserver/backend.go:76 +0x78
created by go.etcd.io/etcd/server/v3/etcdserver.openBackend
go.etcd.io/etcd/server/v3/etcdserver/backend.go:75 +0x18a
edit, 3.6 @ 9ec5f981894063d13fdc5cfc60933c515ab6c6c2 as well:
{"level":"panic","ts":"2023-09-14T14:20:35.934935+0200","caller":"backend/backend.go:186","msg":"failed to open database","path":"/home/data/data.etcd/member/snap/db","error":"write /home/data/data.etcd/member/snap/db: software caused connection abort","stacktrace":"go.etcd.io/etcd/server/v3/storage/backend.newBackend\n\tgo.etcd.io/etcd/server/v3/storage/backend/backend.go:186\ngo.etcd.io/etcd/server/v3/storage/backend.New\n\tgo.etcd.io/etcd/server/v3/storage/backend/backend.go:164\ngo.etcd.io/etcd/server/v3/storage.newBackend\n\tgo.etcd.io/etcd/server/v3/storage/backend.go:55\ngo.etcd.io/etcd/server/v3/storage.OpenBackend.func1\n\tgo.etcd.io/etcd/server/v3/storage/backend.go:76"}
panic: failed to open database
goroutine 143 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x2?, 0xc00023a7d0?, {0x0?, 0x0?, 0xc000388000?})
go.uber.org/zap@v1.25.0/zapcore/entry.go:196 +0x65
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000382000, {0xc0004b8d80, 0x2, 0x2})
go.uber.org/zap@v1.25.0/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc000696a80?, {0x10e9d9b?, 0xc00011e7e0?}, {0xc0004b8d80, 0x2, 0x2})
go.uber.org/zap@v1.25.0/logger.go:258 +0x59
go.etcd.io/etcd/server/v3/storage/backend.newBackend({{0xc000696a80, 0x23}, 0x5f5e100, 0x2710, {0x10d8594, 0x7}, 0x280000000, 0xc0000c0fc0, 0x0, 0x0, ...})
go.etcd.io/etcd/server/v3/storage/backend/backend.go:186 +0x2ff
go.etcd.io/etcd/server/v3/storage/backend.New(...)
go.etcd.io/etcd/server/v3/storage/backend/backend.go:164
go.etcd.io/etcd/server/v3/storage.newBackend({{0x10d834f, 0x7}, {0x0, 0x0}, {0x0, 0x0}, {{{0x19fd8d8, 0x0, 0x0}, 0x12a05f200, ...}, ...}, ...}, ...)
go.etcd.io/etcd/server/v3/storage/backend.go:55 +0x3b8
go.etcd.io/etcd/server/v3/storage.OpenBackend.func1()
go.etcd.io/etcd/server/v3/storage/backend.go:76 +0x58
created by go.etcd.io/etcd/server/v3/storage.OpenBackend
go.etcd.io/etcd/server/v3/storage/backend.go:75 +0x18c
After little debugging, it's in the init method:
https://github.com/etcd-io/bbolt/blob/master/db.go#L614-L617
and the write OP will fail with:
err = {error | *fs.PathError}
Op = {string} "write"
Path = {string} "/home/data/data.etcd/member/snap/db"
Err = {error | syscall.Errno} ECONNABORTED (103)
That's also where your 16k write fails at:
{'syscall': 'write', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/member/snap/db', 'size': '16384', 'off': '0'}
That's the initial allocation for the two meta pages that Ben was talking about. Don't know if that's very concerning, how likely is it that your database crashes on init on the very first page write? There's no data in it anyway.
How about we move this issue to bbolt and try to make the init atomic w.r.t. multi-page writes?
The usual meta write seems safe: https://github.com/etcd-io/bbolt/blob/master/tx.go#L503-L524
Definitely worth testing this more in-depth....
Hey @mj-ramos, thanks for reporting the issue.
I would be really happy to collaborate to reproduce this and similar disk failure scenarios in etcd testing. We already have LazyFS integrated in out robustness testing suite. For now we only test simple scenario (kill etcd process, clear lazyfs cache).
Would you be interested in contributing this and others disk failure scenarios to etcd robustness testing?
Hi! This bug is reproducible with a new version of LazyFS, where new types of faults were introduced. The main branch has the previous faults (clear cache) and new ones (crash, reorder and split write) that are simple to configure.
And yes, I am interested and available to contribute :)
Reproduced this "issue" as well. It should be the reason as @tjungblu mentioned above. It's corrupted as the very initial db initialisation phase. There is no any data yet, so It shouldn't be a big problem; no need to spend too much effort to resolve it. Once the db is already successfully initialized, then can't reproduce this issue any more.
Won't fix.
Thanks @mj-ramos for raising this issue anyway. It's an interesting scenario, which is worthwhile to be introduced into our test. I may also spend some time to introduce lazyFS into bbolt separately.
Ok, I understand that resiliency during db initialization is not a very impactful issue issue, but still it's an issue we might consider fixing. Let still make sure we are tracking the followups:
To make sure the followups are not lost, let's keep the issue open until we create issues for them.
- Bbolt is resiliant to power failure during db initialization
It isn't a problem which should be fixed in code level; Instead, we only need to update doc.
- Bbolt is resiliant to power failure during db initialization
It isn't a problem which should be fixed in code level; Instead, we only need to update doc.
Ok, let's track that then.
all tracked.
Bug report criteria
What happened?
After experiencing a power failure while an etcd server is bootstrapping, the server is no longer able to recover and restart again.
This issue occurs in both single-node and three-node clusters. The root cause of the problem is that some writes to the
member/snap/db
file exceed the common size of a page at the page cache. This can result in a "torn write" scenario where only part of the write's payload is persisted while the rest is not, since the pages of the page cache can be flushed out of order. There are several references about this problem:What did you expect to happen?
That the server where the power failure happened restarted correctly.
How can we reproduce it (as minimally and precisely as possible)?
This issue can be replicated using LazyFS, which is now capable of simulating out of order persistence of file system pages, at the disk. The main problem is a write to the file
member/snap/db
which is 16384 bytes long. LazyFS will persist portions (in sizes of 8192 bytes) of this write out of order and will crash, simulating a power failure. To reproduce this problem, one can follow these steps:Mount LazyFS on a directory where etcd data will be saved, with a specified root directory. Assuming the data path for etcd is
/home/data/data.etcd
and the root directory is/home/data-root/data.etcd
, add the following lines to the default configuration file (located in theconfig/default.toml
directory):These lines define a fault to be injected. A power failure will be simulated after writing to the
/home/data-r/data.etcd/member/snap/db
file. Since this write is large (16384 bytes), it is split into 2 parts (each with 8192 bytes), and only the second part is persisted. Specify that it's the first write issued to this file (with the parameter occurrence).Start LazyFS with the following command:
./scripts/mount-lazyfs.sh -c config/default.toml -m /home/data/data.etcd -r /home/data-r/data.etcd -f
Start etcd with the command
./etcd --data-dir '/home/data/data.etcd'
.Immediately after this step, etcd will shut down because LazyFS was unmounted, simulating the power failure. At this point, you can analyze the logs produced by LazyFS to see the system calls issued until the moment of the fault. Here is a simplified version of the log:
fusermount -uz /home/data/data.etcd
By following these steps, you can replicate the issue and analyze the effects of the power failure on etcd's restart process.
The same problem (but with a different error) happens when we persist the first 8192 bytes of the write (for this change the parameter
persist
to [1]).Note that no problem happens when
persist
is changed to [1,2]. The whole write will be persisted and etcd will succeed to restart.Anything else we need to know?
Here is the output produced by etcd on restarting. The first file corresponds to the error reported after only persisting the first 8192 bytes of the
member/snap/db
file and the second file to the error reported after only persisting the second 8192 bytes of themember/snap/db
file. persist_first_part.txt persist_second_part.txtEtcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response