Open ahrtr opened 1 year ago
@fuweid are you interested in working on this? The goal is to try to reproduce the data corruption issues.
My high level thought is:
We can discuss the details on next meeting.
Note: the goal is to reproduce the data corruption issues instead of verify correctness.
Reference: https://github.com/dsrhaslab/lazyfs
cc @tjungblu
Sure! Let me take this one.
Hi @ahrtr , I tried to use dm-flakey device to drop_writes during block layer and cause data corruption. The test case is stable to reproduce the data corruption.
$ go test -c -o /tmp/test ./datacorruption/boltdb && sudo /tmp/test -test.v
=== RUN TestDropWritesDuringBench
main_test.go:56: Init empty bbolt database with 128 MiB
main_test.go:60: Ensure the empty boltdb data persisted in the flakey device
main_test.go:63: Start to run bbolt-bench
main_test.go:85: Drop all the write IOs after 3 seconds
main_test.go:89: Let bbolt-bench run with DropWrites mode for 3 seconds
main_test.go:92: Start to allow all the write IOs for 2 seconds
main_test.go:96: Kill the bbolt process and simulate power failure
main_test.go:101: Invoke bbolt check to verify data
main_test.go:103:
Error Trace: /home/fuwei/workspace/go-dmflakey/contrib/datacorruption/boltdb/main_test.go:103
Error: Received unexpected error:
exit status 2
Test: TestDropWritesDuringBench
Messages: bbolt check output: panic: invalid freelist page: 0, page type is unknown<00>
goroutine 1 [running]:
go.etcd.io/bbolt.(*freelist).read(0x0?, 0x0?)
/home/fuwei/workspace/bbolt/freelist.go:270 +0x199
go.etcd.io/bbolt.(*DB).loadFreelist.func1()
/home/fuwei/workspace/bbolt/db.go:400 +0xc5
sync.(*Once).doSlow(0xc0001301c0?, 0x584020?)
/usr/local/go/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
/usr/local/go/src/sync/once.go:65
go.etcd.io/bbolt.(*DB).loadFreelist(0xc000130000?)
/home/fuwei/workspace/bbolt/db.go:393 +0x47
go.etcd.io/bbolt.Open({0x7ffc60ce9530, 0x38}, 0x670060?, 0xc00005fc18)
/home/fuwei/workspace/bbolt/db.go:275 +0x44f
main.(*checkCommand).Run(0xc000137e58, {0xc0000161a0, 0x1, 0x1})
/home/fuwei/workspace/bbolt/cmd/bbolt/main.go:212 +0x1e5
main.(*Main).Run(0xc00005ff40, {0xc000016190?, 0xc0000061a0?, 0x200000003?})
/home/fuwei/workspace/bbolt/cmd/bbolt/main.go:124 +0x4d4
main.main()
/home/fuwei/workspace/bbolt/cmd/bbolt/main.go:62 +0xae
--- FAIL: TestDropWritesDuringBench (8.29s)
FAIL
I think if the lazyFS can drop writes silently, that will be perfect for simulation. We can discuss this in next meeting.
@fuweid thx for the test case.
We can have more discussion on Monday's meeting.
Confirmed that the file.WriteAt
never fails when running your test case, nor the fdatasync
.
I added log in https://github.com/etcd-io/bbolt/pull/616, also updated your test case in https://github.com/fuweid/go-dmflakey/pull/1. So it should be the test case's issue (specifically it should be flaky filesystem's issue) instead of bbolt's issue.
# go test -v
=== RUN TestDropWritesDuringBench
main_test.go:57: Init empty bbolt database with 128 MiB
main_test.go:61: Ensure the empty boltdb data persisted in the flakey device
main_test.go:64: Start to run bbolt-bench
main_test.go:92: Drop all the write IOs after 3 seconds
main_test.go:96: Let bbolt-bench run with DropWrites mode for 3 seconds
main_test.go:99: Start to allow all the write IOs for 2 seconds
main_test.go:103: Kill the bbolt process and simulate power failure
main_test.go:86: ####### bbolt output:
work: /tmp/TestDropWritesDuringBench3215672155/001/root/boltdb
starting write benchmark.
Completed 910 requests, 909/s
Completed 1805 requests, 894/s
Completed 2705 requests, 899/s
Completed 79925 requests, 77209/s
Completed 154430 requests, 74501/s
Completed 226100 requests, 71659/s
Completed 227845 requests, 1744/s
Completed 228710 requests, 864/s
main_test.go:115: Invoke bbolt check to verify data: "/tmp/boltdb"
main_test.go:117:
Error Trace: /root/go/src/github.com/fuweid/go-dmflakey/contrib/datacorruption/boltdb/main_test.go:117
Error: Received unexpected error:
exit status 2
Test: TestDropWritesDuringBench
Messages: bbolt check output: panic: invalid freelist page: 0, page type is unknown<00>
goroutine 1 [running]:
go.etcd.io/bbolt.(*freelist).read(0x0?, 0x0?)
/root/go/src/github.com/ahrtr/bbolt/freelist.go:270 +0x179
go.etcd.io/bbolt.(*DB).loadFreelist.func1()
/root/go/src/github.com/ahrtr/bbolt/db.go:400 +0xb8
sync.(*Once).doSlow(0xc0001341c0?, 0x581580?)
/root/software/go/src/sync/once.go:74 +0xbf
sync.(*Once).Do(...)
/root/software/go/src/sync/once.go:65
go.etcd.io/bbolt.(*DB).loadFreelist(0xc000134000?)
/root/go/src/github.com/ahrtr/bbolt/db.go:393 +0x45
go.etcd.io/bbolt.Open({0x7fff4ee2c62a, 0xb}, 0x677960?, 0xc00005fbd0)
/root/go/src/github.com/ahrtr/bbolt/db.go:275 +0x425
main.(*checkCommand).Run(0xc00013be18, {0xc000016140, 0x1, 0x1})
/root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:212 +0x1df
main.(*Main).Run(0xc00005ff00, {0xc000016130?, 0xc000006340?, 0x200000003?})
/root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:124 +0x469
main.main()
/root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:62 +0xa6
--- FAIL: TestDropWritesDuringBench (8.35s)
FAIL
exit status 1
FAIL github.com/fuweid/go-dmflakey/contrib/datacorruption/boltdb 8.350s
Hi @ahrtr , The original case is used to simulate the data loss in the device level. The DropWrites failpoint is to drop all the data submitted by fdatasync
or fsync
. It's like power failure and cause data committed loss.
I was trying to use the following script in pod:
run-bench-test &
sleep random-seconds
echo b > /proc/sysrq-trigger
And run pod as daemonset in the kubernetes cluster. It can't reproduce the data corruption. So I inject failpoint in device level to drop the writes, which seems to change one byte in the data.
I think we can add failpoint between writeAt
and fdatasync
so that we can see the recovery after power failure.
We can sync the detail in tomorrow meeting.
@fuweid can you add the test case TestDropWritesDuringBench (Possibly need to change the name) into bbolt?
@ahrtr Hi, I am still working on this. will file pull request when it's ready
See https://github.com/etcd-io/etcd/issues/16596#issuecomment-1719519519