Closed fyfyrchik closed 9 months ago
Here is the output for
bbolt pages
https://gist.github.com/fyfyrchik/4aafec23d9dfc487fb4a4cd7f5560730
There are only 2754 lines, but there are 23157 pages (each page has one line). Did you manually remove some lines from the outputted of bbolt pages path-2-db
?
Just to double check, you easily reproduce this issue with fast_commit
enabled, and never reproduced it with fast_commit
disabled?
What's OS and Arch?
Could you share your test case? Do you test case write any huge object (e.g. a object > 10M)?
Did you manually remove some lines from the outputted of bbolt pages path-2-db?
No, that's the full output.
Just to double check, you easily reproduce this issue with fast_commit enabled, and never reproduced it with fast_commit disabled?
Yes
What's OS and Arch?
Linux (Debian), 5.10 arch
Could you share your test case? Do you test case write any huge object (e.g. a object > 10M)?
All objects were less or equal than 128 KiB. We have 3 buckets (for 32, 64, 128 KiB objects respectively). On a single disk there are about 100 identical DB being written to, then we just start load and do multiple hard-reboots during the night, in the morning see the result.
thx @fyfyrchik for the feedback.
Could you share the db file?
- Perform server hard-reset under write load.
Just to double check, do you mean you intentionally & manually power off/on the linux server on which your test was running?
Yes, I can: the file is about ~100 MB, where should I send it?
Just to double check, do you mean you intentionally & manually power off/on the linux server on which your test was running?
Yes, we poweroff the storage node, loader is on a different one.
Yes, I can: the file is about ~100 MB, where should I send it?
Is it possible that you put the file somewhere that I can download? If yes, please send the link to wachao@vmware.com. If not, could you try to transfer the file to me via slack? Please reach out in K8s workspace (Benjamin Wang).
Yes, we poweroff the storage node, loader is on a different one.
Thanks for the info. You mount the storage node to the loader, and the loader accesses the storage node as a NFS?
Thanks @fyfyrchik for sharing the db file offline. The db file is corrupted. Unfortunately the existing check command can't provide any useful info.
I will enhance the existing check command:
Hi @fyfyrchik ,
For the Perform server hard-reset under write load.
, did you mean something like echo b > /proc/sysrq-trigger
to force reboot?
Hi @fuweid,
Yes (or like echo c
?). In our case, the reboot was done via BMC console.
@fyfyrchik thanks! I will try to reproduce this by dm-flakey
Updated.
@fyfyrchik since my local env is using v6.4 kernel, I build v5.10 kernel and run it with qemu.
#!/bin/bash
set -exuo pipefail
readonly db_file=/var/lib/etcd/xxx.db
mkdir -p /var/lib/etcd
if [[ -f "${db_file}" ]]; then
ls -alh $db_file
bbolt check $db_file
fi
bbolt bench -work -path $db_file -count=1000000000 -batch-size=100 -value-size=4096 2> /dev/null &
sleep $(( ( RANDOM % 10 ) + 1 ))
echo b > /proc/sysrq-trigger
I can reproduce this issue with main branch.
panic: assertion failed: Page expected to be: 56763, but self identifies as 0
goroutine 6 [running]:
go.etcd.io/bbolt/internal/common.Assert(...)
/home/fuwei/workspace/bbolt/internal/common/verify.go:65
go.etcd.io/bbolt/internal/common.(*Page).FastCheck(0x7f59cd5bb000, 0xddbb)
/home/fuwei/workspace/bbolt/internal/common/page.go:83 +0x1d9
go.etcd.io/bbolt.(*Tx).page(0x7f59cd51e000?, 0x5dc120?)
/home/fuwei/workspace/bbolt/tx.go:545 +0x79
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f59c61b6000?, {0xc0000200f0?, 0x4, 0xa}, 0xc000131b60)
/home/fuwei/workspace/bbolt/tx.go:557 +0x5d
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f59c6cfe000?, {0xc0000200f0?, 0x3, 0xa}, 0xc000131b60)
/home/fuwei/workspace/bbolt/tx.go:566 +0xc5
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f59c6d95000?, {0xc0000200f0?, 0x2, 0xa}, 0xc000131b60)
/home/fuwei/workspace/bbolt/tx.go:566 +0xc5
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0xc000104b38?, {0xc0000200f0?, 0x1, 0xa}, 0xc000131b60)
/home/fuwei/workspace/bbolt/tx.go:566 +0xc5
go.etcd.io/bbolt.(*Tx).forEachPage(...)
/home/fuwei/workspace/bbolt/tx.go:553
go.etcd.io/bbolt.(*Tx).checkBucket(0xc000146000, 0xc000024140, 0xc000131eb0, 0xc000131ee0, {0x67bd40?, 0x80b420}, 0xc00007e0c0)
/home/fuwei/workspace/bbolt/tx_check.go:82 +0x111
go.etcd.io/bbolt.(*Tx).checkBucket.func2({0x7f59c6d96020?, 0xc000146000?, 0xc000104d20?})
/home/fuwei/workspace/bbolt/tx_check.go:109 +0x90
go.etcd.io/bbolt.(*Bucket).ForEachBucket(0x0?, 0xc000131d68)
/home/fuwei/workspace/bbolt/bucket.go:451 +0x96
go.etcd.io/bbolt.(*Tx).checkBucket(0xc000146000, 0xc000146018, 0xc000104eb0, 0xc000104ee0, {0x67bd40?, 0x80b420}, 0xc00007e0c0)
/home/fuwei/workspace/bbolt/tx_check.go:107 +0x252
go.etcd.io/bbolt.(*Tx).check(0xc000146000, {0x67bd40, 0x80b420}, 0x0?)
/home/fuwei/workspace/bbolt/tx_check.go:60 +0x365
created by go.etcd.io/bbolt.(*Tx).Check in goroutine 1
/home/fuwei/workspace/bbolt/tx_check.go:30 +0x118
56886 unknown<00> 0
56887 unknown<00> 0
56888 unknown<00> 0
56889 unknown<00> 0
56890 unknown<00> 0
56891 unknown<00> 0
56892 unknown<00> 0
56893 unknown<00> 0
56894 unknown<00> 0
56895 unknown<00> 0
56896 unknown<00> 0
56897 unknown<00> 0
56898 unknown<00> 0
56899 unknown<00> 0
56900 unknown<00> 0
56901 unknown<00> 0
56902 unknown<00> 0
56903 unknown<00> 0
56904 unknown<00> 0
56905 unknown<00> 0
56906 unknown<00> 0
56907 unknown<00> 0
56908 unknown<00> 0
56909 unknown<00> 0
56910 unknown<00> 0
56911 unknown<00> 0
56912 unknown<00> 0
56913 free
56914 free
56915 leaf 2 2
56918 leaf 2 2
56921 leaf 2 2
56924 leaf 2 2
56927 leaf 2 2
56930 leaf 2 2
56933 leaf 2 2
56936 leaf 2 2
56939 leaf 2 2
56942 leaf 2 2
56945 leaf 2 2
56948 leaf 2 2
56951 leaf 2 2
56954 leaf 2 2
56957 leaf 2 2
I am not sure it's related to fast-commit because I didn't find relative patch for this. Trying to test it without fast-commit.
cc @ahrtr
Found a relative patch about Fast Commit: https://lore.kernel.org/linux-ext4/20211223032337.5198-3-yinxin.x@bytedance.com/. And confirmed with that patch's author and fast commit in v5.10 has some corner case which can loss data even if fsync/fdatasync returns successful...
@fyfyrchik I think I would like to suggest to upgrade kernel or disable fc option :joy:
NOTE:
Other patch is here https://lore.kernel.org/lkml/202201091544.W5HHEXAp-lkp@intel.com/T/ which is covered by https://kernel.googlesource.com/pub/scm/fs/xfs/xfstests-dev/+/refs/heads/master/tests/generic/455
I can't reproduce this issue in new kernel with fc option or v5.10 without fc option.
And talked to patch's author and I believe that this is kernel bug instead of bbolt.
@fyfyrchik Thanks for that reporting!
Found a relative patch about Fast Commit: https://lore.kernel.org/linux-ext4/20211223032337.5198-3-yinxin.x@bytedance.com/. And confirmed with that patch's author and fast commit in v5.10 has some corner case which can loss data even if fsync/fdatasync returns successful...
Great finding.
Just as we discussed earlier today, we received around 18 data corruption cases in the past 2 years. Most likely fast commit wasn't enabled at all in most cases.
bbolt xxxx; echo b > /proc/sysrq-trigger
" for a long time (e.g > 4 weeks)Also please let's add a known issue in the readme to clarify
Can we also keep the test "bbolt xxxx; echo b > /proc/sysrq-trigger" for a long time (e.g > 4 weeks)
Will do and think about how to submit it as code in repo. Assign this issue to me.
/assign
@fuweid thanks for your research!
In case someone will be wondering, I have tracked down the version this patch was added:
The second link is present in 5.17 (handle
parameter) https://elixir.bootlin.com/linux/v5.17/source/fs/ext4/fast_commit.c#L307
And in 5.15.144, 5.16.... with the latest fixes..
So to fix the problem one might want to try >= 5.17 kernel.
Thanks @fyfyrchik ! Yeah, fast commit is new feature and when we enable this, we need to run failpoint test cases on that. It's not just bbolt, but also other applications which cares data persistence.
For anyone reference, from application (bbolt in this case)'s perspective, the root cause of the issue is that the system call Fdatasync
returns successful (err == nil
) but actually the kernel hasn't actually synced the data successfully yet. It may lead to a situation that the meta page is persisted successfully, but some data pages are not yet. Eventually some pageIds may point to corrupted data pages.
Hello! I am not sure whether this is a bug in bolt, but I think you might find this interesting:
fast_commit
feature enabled (we use 5.10 kernel)bbolt check
reports:goroutine 6 [running]: go.etcd.io/bbolt._assert(...) /repo/bbolt/db.go:1359 go.etcd.io/bbolt.(page).fastCheck(0x7f0b95834000, 0x5a34) /repo/bbolt/page.go:57 +0x1d9 go.etcd.io/bbolt.(Tx).page(0x7f0b909be000?, 0x4f9cc0?) /repo/bbolt/tx.go:534 +0x79 go.etcd.io/bbolt.(Tx).forEachPageInternal(0x7f0b8fe11000?, {0xc000024140?, 0x3, 0xa}, 0xc00008db68) /repo/bbolt/tx.go:546 +0x5d go.etcd.io/bbolt.(Tx).forEachPageInternal(0x7f0b8fe19000?, {0xc000024140?, 0x2, 0xa}, 0xc00008db68) /repo/bbolt/tx.go:555 +0xc8 go.etcd.io/bbolt.(Tx).forEachPageInternal(0x0?, {0xc000024140?, 0x1, 0xa}, 0xc00008db68) /repo/bbolt/tx.go:555 +0xc8 go.etcd.io/bbolt.(Tx).forEachPage(...) /repo/bbolt/tx.go:542 go.etcd.io/bbolt.(Tx).checkBucket(0xc00013e000, 0xc00002a280, 0xc00008deb0, 0xc00008dee0, {0x54b920?, 0x62ec60}, 0xc00007c0c0) /repo/bbolt/tx_check.go:83 +0x111 go.etcd.io/bbolt.(Tx).checkBucket.func2({0x7f0b8fe3e0a6?, 0xc000104d28?, 0xc00013e000?}) /repo/bbolt/tx_check.go:110 +0x90 go.etcd.io/bbolt.(Bucket).ForEachBucket(0x0?, 0xc00008dd70) /repo/bbolt/bucket.go:403 +0x96 go.etcd.io/bbolt.(Tx).checkBucket(0xc00013e000, 0xc00013e018, 0xc000104eb0, 0xc000104ee0, {0x54b920?, 0x62ec60}, 0xc00007c0c0) /repo/bbolt/tx_check.go:108 +0x252 go.etcd.io/bbolt.(Tx).check(0xc00013e000, {0x54b920, 0x62ec60}, 0x0?) /repo/bbolt/tx_check.go:61 +0x365 created by go.etcd.io/bbolt.(Tx).CheckWithOptions in goroutine 1 /repo/bbolt/tx_check.go:31 +0x118
$ ./bbolt page ./1 0 Page ID: 0 Page Type: meta Total Size: 4096 bytes Overflow pages: 0 Version: 2 Page Size: 4096 bytes Flags: 00000000 Root:
Freelist:
HWM:
Txn ID: 1198
Checksum: 6397aaef7230fab5
$ ./bbolt page ./1 1 Page ID: 1 Page Type: meta Total Size: 4096 bytes Overflow pages: 0 Version: 2 Page Size: 4096 bytes Flags: 00000000 Root:
Freelist:
HWM:
Txn ID: 1199
Checksum: 8ec692c6ba15e06b