DB cannot be opened after node hard reset

fyfyrchik commented 1 year ago

Hello! I am not sure whether this is a bug in bolt, but I think you might find this interesting:

Create ext4 with fast_commit feature enabled (we use 5.10 kernel)
Create DB with default settings (i.e. no unsafe settings, (my only custom parameters are batch size and batch delay, they seem unrelated).
Perform server hard-reset under write load.

DB cannot be opened, bbolt check reports:


panic: assertion failed: Page expected to be: 23092, but self identifies as 0

goroutine 6 [running]: go.etcd.io/bbolt._assert(...) /repo/bbolt/db.go:1359 go.etcd.io/bbolt.(page).fastCheck(0x7f0b95834000, 0x5a34) /repo/bbolt/page.go:57 +0x1d9 go.etcd.io/bbolt.(Tx).page(0x7f0b909be000?, 0x4f9cc0?) /repo/bbolt/tx.go:534 +0x79 go.etcd.io/bbolt.(Tx).forEachPageInternal(0x7f0b8fe11000?, {0xc000024140?, 0x3, 0xa}, 0xc00008db68) /repo/bbolt/tx.go:546 +0x5d go.etcd.io/bbolt.(Tx).forEachPageInternal(0x7f0b8fe19000?, {0xc000024140?, 0x2, 0xa}, 0xc00008db68) /repo/bbolt/tx.go:555 +0xc8 go.etcd.io/bbolt.(Tx).forEachPageInternal(0x0?, {0xc000024140?, 0x1, 0xa}, 0xc00008db68) /repo/bbolt/tx.go:555 +0xc8 go.etcd.io/bbolt.(Tx).forEachPage(...) /repo/bbolt/tx.go:542 go.etcd.io/bbolt.(Tx).checkBucket(0xc00013e000, 0xc00002a280, 0xc00008deb0, 0xc00008dee0, {0x54b920?, 0x62ec60}, 0xc00007c0c0) /repo/bbolt/tx_check.go:83 +0x111 go.etcd.io/bbolt.(Tx).checkBucket.func2({0x7f0b8fe3e0a6?, 0xc000104d28?, 0xc00013e000?}) /repo/bbolt/tx_check.go:110 +0x90 go.etcd.io/bbolt.(Bucket).ForEachBucket(0x0?, 0xc00008dd70) /repo/bbolt/bucket.go:403 +0x96 go.etcd.io/bbolt.(Tx).checkBucket(0xc00013e000, 0xc00013e018, 0xc000104eb0, 0xc000104ee0, {0x54b920?, 0x62ec60}, 0xc00007c0c0) /repo/bbolt/tx_check.go:108 +0x252 go.etcd.io/bbolt.(Tx).check(0xc00013e000, {0x54b920, 0x62ec60}, 0x0?) /repo/bbolt/tx_check.go:61 +0x365 created by go.etcd.io/bbolt.(Tx).CheckWithOptions in goroutine 1 /repo/bbolt/tx_check.go:31 +0x118


The last 64 pages of the file seem to be filled with zeroes.

When the ext4 `fast_commit` is disabled, our tests pass, and db can be opened.
I have reproduced this on both 1.3.6 and 1.3.7.

Here is the output for `bbolt pages` https://gist.github.com/fyfyrchik/4aafec23d9dfc487fb4a4cd7f5560730

Meta pages

$ ./bbolt page ./1 0 Page ID: 0 Page Type: meta Total Size: 4096 bytes Overflow pages: 0 Version: 2 Page Size: 4096 bytes Flags: 00000000 Root: Freelist: HWM: Txn ID: 1198 Checksum: 6397aaef7230fab5

$ ./bbolt page ./1 1 Page ID: 1 Page Type: meta Total Size: 4096 bytes Overflow pages: 0 Version: 2 Page Size: 4096 bytes Flags: 00000000 Root: Freelist: HWM: Txn ID: 1199 Checksum: 8ec692c6ba15e06b

ahrtr commented 1 year ago

Here is the output for bbolt pages https://gist.github.com/fyfyrchik/4aafec23d9dfc487fb4a4cd7f5560730

There are only 2754 lines, but there are 23157 pages (each page has one line). Did you manually remove some lines from the outputted of bbolt pages path-2-db?

Just to double check, you easily reproduce this issue with fast_commit enabled, and never reproduced it with fast_commit disabled?

What's OS and Arch?

Could you share your test case? Do you test case write any huge object (e.g. a object > 10M)?

fyfyrchik commented 1 year ago

Did you manually remove some lines from the outputted of bbolt pages path-2-db?

No, that's the full output.

Just to double check, you easily reproduce this issue with fast_commit enabled, and never reproduced it with fast_commit disabled?

Yes

What's OS and Arch?

Linux (Debian), 5.10 arch

Could you share your test case? Do you test case write any huge object (e.g. a object > 10M)?

All objects were less or equal than 128 KiB. We have 3 buckets (for 32, 64, 128 KiB objects respectively). On a single disk there are about 100 identical DB being written to, then we just start load and do multiple hard-reboots during the night, in the morning see the result.

ahrtr commented 1 year ago

thx @fyfyrchik for the feedback.

Could you share the db file?

Perform server hard-reset under write load.

Just to double check, do you mean you intentionally & manually power off/on the linux server on which your test was running?

fyfyrchik commented 1 year ago

Yes, I can: the file is about ~100 MB, where should I send it?

Just to double check, do you mean you intentionally & manually power off/on the linux server on which your test was running?

Yes, we poweroff the storage node, loader is on a different one.

ahrtr commented 1 year ago

Yes, I can: the file is about ~100 MB, where should I send it?

Is it possible that you put the file somewhere that I can download? If yes, please send the link to wachao@vmware.com. If not, could you try to transfer the file to me via slack? Please reach out in K8s workspace (Benjamin Wang).

Yes, we poweroff the storage node, loader is on a different one.

Thanks for the info. You mount the storage node to the loader, and the loader accesses the storage node as a NFS?

ahrtr commented 1 year ago

Thanks @fyfyrchik for sharing the db file offline. The db file is corrupted. Unfortunately the existing check command can't provide any useful info.

I will enhance the existing check command:

try not to panic
supporting checking starting from a pageId.

fuweid commented 11 months ago

Hi @fyfyrchik ,

For the Perform server hard-reset under write load., did you mean something like echo b > /proc/sysrq-trigger to force reboot?

fyfyrchik commented 11 months ago

Hi @fuweid, Yes (or like echo c?). In our case, the reboot was done via BMC console.

fuweid commented 11 months ago

@fyfyrchik thanks! I will try to reproduce this by dm-flakey

fuweid commented 10 months ago

Updated.

@fyfyrchik since my local env is using v6.4 kernel, I build v5.10 kernel and run it with qemu.

#!/bin/bash

set -exuo pipefail

readonly db_file=/var/lib/etcd/xxx.db
mkdir -p /var/lib/etcd

if [[ -f "${db_file}" ]]; then
  ls -alh $db_file
  bbolt check $db_file
fi

bbolt bench -work -path $db_file -count=1000000000 -batch-size=100 -value-size=4096 2> /dev/null &
sleep $(( ( RANDOM % 10 )  + 1 ))
echo b > /proc/sysrq-trigger

I can reproduce this issue with main branch.

panic: assertion failed: Page expected to be: 56763, but self identifies as 0

goroutine 6 [running]:
go.etcd.io/bbolt/internal/common.Assert(...)
        /home/fuwei/workspace/bbolt/internal/common/verify.go:65
go.etcd.io/bbolt/internal/common.(*Page).FastCheck(0x7f59cd5bb000, 0xddbb)
        /home/fuwei/workspace/bbolt/internal/common/page.go:83 +0x1d9
go.etcd.io/bbolt.(*Tx).page(0x7f59cd51e000?, 0x5dc120?)
        /home/fuwei/workspace/bbolt/tx.go:545 +0x79
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f59c61b6000?, {0xc0000200f0?, 0x4, 0xa}, 0xc000131b60)
        /home/fuwei/workspace/bbolt/tx.go:557 +0x5d
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f59c6cfe000?, {0xc0000200f0?, 0x3, 0xa}, 0xc000131b60)
        /home/fuwei/workspace/bbolt/tx.go:566 +0xc5
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f59c6d95000?, {0xc0000200f0?, 0x2, 0xa}, 0xc000131b60)
        /home/fuwei/workspace/bbolt/tx.go:566 +0xc5
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0xc000104b38?, {0xc0000200f0?, 0x1, 0xa}, 0xc000131b60)
        /home/fuwei/workspace/bbolt/tx.go:566 +0xc5
go.etcd.io/bbolt.(*Tx).forEachPage(...)
        /home/fuwei/workspace/bbolt/tx.go:553
go.etcd.io/bbolt.(*Tx).checkBucket(0xc000146000, 0xc000024140, 0xc000131eb0, 0xc000131ee0, {0x67bd40?, 0x80b420}, 0xc00007e0c0)
        /home/fuwei/workspace/bbolt/tx_check.go:82 +0x111
go.etcd.io/bbolt.(*Tx).checkBucket.func2({0x7f59c6d96020?, 0xc000146000?, 0xc000104d20?})
        /home/fuwei/workspace/bbolt/tx_check.go:109 +0x90
go.etcd.io/bbolt.(*Bucket).ForEachBucket(0x0?, 0xc000131d68)
        /home/fuwei/workspace/bbolt/bucket.go:451 +0x96
go.etcd.io/bbolt.(*Tx).checkBucket(0xc000146000, 0xc000146018, 0xc000104eb0, 0xc000104ee0, {0x67bd40?, 0x80b420}, 0xc00007e0c0)
        /home/fuwei/workspace/bbolt/tx_check.go:107 +0x252
go.etcd.io/bbolt.(*Tx).check(0xc000146000, {0x67bd40, 0x80b420}, 0x0?)
        /home/fuwei/workspace/bbolt/tx_check.go:60 +0x365
created by go.etcd.io/bbolt.(*Tx).Check in goroutine 1
        /home/fuwei/workspace/bbolt/tx_check.go:30 +0x118

56886    unknown<00> 0
56887    unknown<00> 0
56888    unknown<00> 0
56889    unknown<00> 0
56890    unknown<00> 0
56891    unknown<00> 0
56892    unknown<00> 0
56893    unknown<00> 0
56894    unknown<00> 0
56895    unknown<00> 0
56896    unknown<00> 0
56897    unknown<00> 0
56898    unknown<00> 0
56899    unknown<00> 0
56900    unknown<00> 0
56901    unknown<00> 0
56902    unknown<00> 0
56903    unknown<00> 0
56904    unknown<00> 0
56905    unknown<00> 0
56906    unknown<00> 0
56907    unknown<00> 0
56908    unknown<00> 0
56909    unknown<00> 0
56910    unknown<00> 0
56911    unknown<00> 0
56912    unknown<00> 0
56913    free
56914    free
56915    leaf       2      2
56918    leaf       2      2
56921    leaf       2      2
56924    leaf       2      2
56927    leaf       2      2
56930    leaf       2      2
56933    leaf       2      2
56936    leaf       2      2
56939    leaf       2      2
56942    leaf       2      2
56945    leaf       2      2
56948    leaf       2      2
56951    leaf       2      2
56954    leaf       2      2
56957    leaf       2      2

I am not sure it's related to fast-commit because I didn't find relative patch for this. Trying to test it without fast-commit.

cc @ahrtr

fuweid commented 10 months ago

Found a relative patch about Fast Commit: https://lore.kernel.org/linux-ext4/20211223032337.5198-3-yinxin.x@bytedance.com/. And confirmed with that patch's author and fast commit in v5.10 has some corner case which can loss data even if fsync/fdatasync returns successful...

@fyfyrchik I think I would like to suggest to upgrade kernel or disable fc option :joy:

NOTE:

Other patch is here https://lore.kernel.org/lkml/202201091544.W5HHEXAp-lkp@intel.com/T/ which is covered by https://kernel.googlesource.com/pub/scm/fs/xfs/xfstests-dev/+/refs/heads/master/tests/generic/455

fuweid commented 10 months ago

I can't reproduce this issue in new kernel with fc option or v5.10 without fc option.

And talked to patch's author and I believe that this is kernel bug instead of bbolt.

@fyfyrchik Thanks for that reporting!

ahrtr commented 10 months ago

Found a relative patch about Fast Commit: https://lore.kernel.org/linux-ext4/20211223032337.5198-3-yinxin.x@bytedance.com/. And confirmed with that patch's author and fast commit in v5.10 has some corner case which can loss data even if fsync/fdatasync returns successful...

Great finding.

Just as we discussed earlier today, we received around 18 data corruption cases in the past 2 years. Most likely fast commit wasn't enabled at all in most cases.

Enable the nightly workflow.
Let's keep the robustness testing running for a long time (e.g. > 4 weeks).
Can we also keep the test "bbolt xxxx; echo b > /proc/sysrq-trigger" for a long time (e.g > 4 weeks)

ahrtr commented 10 months ago

Also please let's add a known issue in the readme to clarify

do not enable fast_commit on which kernel version
which kernel version contains the fix

fuweid commented 10 months ago

Can we also keep the test "bbolt xxxx; echo b > /proc/sysrq-trigger" for a long time (e.g > 4 weeks)

Will do and think about how to submit it as code in repo. Assign this issue to me.

/assign

fyfyrchik commented 10 months ago

@fuweid thanks for your research! In case someone will be wondering, I have tracked down the version this patch was added: The second link is present in 5.17 (handle parameter) https://elixir.bootlin.com/linux/v5.17/source/fs/ext4/fast_commit.c#L307 And in 5.15.144, 5.16.... with the latest fixes.. So to fix the problem one might want to try >= 5.17 kernel.

fuweid commented 10 months ago

Thanks @fyfyrchik ! Yeah, fast commit is new feature and when we enable this, we need to run failpoint test cases on that. It's not just bbolt, but also other applications which cares data persistence.

ahrtr commented 8 months ago

For anyone reference, from application (bbolt in this case)'s perspective, the root cause of the issue is that the system call Fdatasync returns successful (err == nil) but actually the kernel hasn't actually synced the data successfully yet. It may lead to a situation that the meta page is persisted successfully, but some data pages are not yet. Eventually some pageIds may point to corrupted data pages.

etcd-io / bbolt

DB cannot be opened after node hard reset #562