etcd-io / bbolt

An embedded key/value database for Go.
https://go.etcd.io/bbolt
MIT License
8.27k stars 642 forks source link

Enhance the test to reproduce data corruption issues #568

Open ahrtr opened 1 year ago

ahrtr commented 1 year ago

See https://github.com/etcd-io/etcd/issues/16596#issuecomment-1719519519

ahrtr commented 1 year ago

@fuweid are you interested in working on this? The goal is to try to reproduce the data corruption issues.

My high level thought is:

We can discuss the details on next meeting.

Note: the goal is to reproduce the data corruption issues instead of verify correctness.

Reference: https://github.com/dsrhaslab/lazyfs

cc @tjungblu

fuweid commented 1 year ago

Sure! Let me take this one.

fuweid commented 11 months ago

Hi @ahrtr , I tried to use dm-flakey device to drop_writes during block layer and cause data corruption. The test case is stable to reproduce the data corruption.

$ go test -c -o /tmp/test ./datacorruption/boltdb && sudo /tmp/test -test.v
=== RUN   TestDropWritesDuringBench
    main_test.go:56: Init empty bbolt database with 128 MiB
    main_test.go:60: Ensure the empty boltdb data persisted in the flakey device
    main_test.go:63: Start to run bbolt-bench
    main_test.go:85: Drop all the write IOs after 3 seconds
    main_test.go:89: Let bbolt-bench run with DropWrites mode for 3 seconds
    main_test.go:92: Start to allow all the write IOs for 2 seconds
    main_test.go:96: Kill the bbolt process and simulate power failure
    main_test.go:101: Invoke bbolt check to verify data
    main_test.go:103:
                Error Trace:    /home/fuwei/workspace/go-dmflakey/contrib/datacorruption/boltdb/main_test.go:103
                Error:          Received unexpected error:
                                exit status 2
                Test:           TestDropWritesDuringBench
                Messages:       bbolt check output: panic: invalid freelist page: 0, page type is unknown<00>

                                goroutine 1 [running]:
                                go.etcd.io/bbolt.(*freelist).read(0x0?, 0x0?)
                                        /home/fuwei/workspace/bbolt/freelist.go:270 +0x199
                                go.etcd.io/bbolt.(*DB).loadFreelist.func1()
                                        /home/fuwei/workspace/bbolt/db.go:400 +0xc5
                                sync.(*Once).doSlow(0xc0001301c0?, 0x584020?)
                                        /usr/local/go/src/sync/once.go:74 +0xc2
                                sync.(*Once).Do(...)
                                        /usr/local/go/src/sync/once.go:65
                                go.etcd.io/bbolt.(*DB).loadFreelist(0xc000130000?)
                                        /home/fuwei/workspace/bbolt/db.go:393 +0x47
                                go.etcd.io/bbolt.Open({0x7ffc60ce9530, 0x38}, 0x670060?, 0xc00005fc18)
                                        /home/fuwei/workspace/bbolt/db.go:275 +0x44f
                                main.(*checkCommand).Run(0xc000137e58, {0xc0000161a0, 0x1, 0x1})
                                        /home/fuwei/workspace/bbolt/cmd/bbolt/main.go:212 +0x1e5
                                main.(*Main).Run(0xc00005ff40, {0xc000016190?, 0xc0000061a0?, 0x200000003?})
                                        /home/fuwei/workspace/bbolt/cmd/bbolt/main.go:124 +0x4d4
                                main.main()
                                        /home/fuwei/workspace/bbolt/cmd/bbolt/main.go:62 +0xae
--- FAIL: TestDropWritesDuringBench (8.29s)
FAIL

I think if the lazyFS can drop writes silently, that will be perfect for simulation. We can discuss this in next meeting.

ahrtr commented 11 months ago

@fuweid thx for the test case.

We can have more discussion on Monday's meeting.

ahrtr commented 11 months ago

Confirmed that the file.WriteAt never fails when running your test case, nor the fdatasync.

I added log in https://github.com/etcd-io/bbolt/pull/616, also updated your test case in https://github.com/fuweid/go-dmflakey/pull/1. So it should be the test case's issue (specifically it should be flaky filesystem's issue) instead of bbolt's issue.

# go test -v
=== RUN   TestDropWritesDuringBench
    main_test.go:57: Init empty bbolt database with 128 MiB
    main_test.go:61: Ensure the empty boltdb data persisted in the flakey device
    main_test.go:64: Start to run bbolt-bench
    main_test.go:92: Drop all the write IOs after 3 seconds
    main_test.go:96: Let bbolt-bench run with DropWrites mode for 3 seconds
    main_test.go:99: Start to allow all the write IOs for 2 seconds
    main_test.go:103: Kill the bbolt process and simulate power failure
    main_test.go:86: ####### bbolt output: 
         work: /tmp/TestDropWritesDuringBench3215672155/001/root/boltdb
        starting write benchmark.
        Completed 910 requests, 909/s 
        Completed 1805 requests, 894/s 
        Completed 2705 requests, 899/s 
        Completed 79925 requests, 77209/s 
        Completed 154430 requests, 74501/s 
        Completed 226100 requests, 71659/s 
        Completed 227845 requests, 1744/s 
        Completed 228710 requests, 864/s 
    main_test.go:115: Invoke bbolt check to verify data: "/tmp/boltdb"
    main_test.go:117: 
            Error Trace:    /root/go/src/github.com/fuweid/go-dmflakey/contrib/datacorruption/boltdb/main_test.go:117
            Error:          Received unexpected error:
                            exit status 2
            Test:           TestDropWritesDuringBench
            Messages:       bbolt check output: panic: invalid freelist page: 0, page type is unknown<00>

                            goroutine 1 [running]:
                            go.etcd.io/bbolt.(*freelist).read(0x0?, 0x0?)
                                /root/go/src/github.com/ahrtr/bbolt/freelist.go:270 +0x179
                            go.etcd.io/bbolt.(*DB).loadFreelist.func1()
                                /root/go/src/github.com/ahrtr/bbolt/db.go:400 +0xb8
                            sync.(*Once).doSlow(0xc0001341c0?, 0x581580?)
                                /root/software/go/src/sync/once.go:74 +0xbf
                            sync.(*Once).Do(...)
                                /root/software/go/src/sync/once.go:65
                            go.etcd.io/bbolt.(*DB).loadFreelist(0xc000134000?)
                                /root/go/src/github.com/ahrtr/bbolt/db.go:393 +0x45
                            go.etcd.io/bbolt.Open({0x7fff4ee2c62a, 0xb}, 0x677960?, 0xc00005fbd0)
                                /root/go/src/github.com/ahrtr/bbolt/db.go:275 +0x425
                            main.(*checkCommand).Run(0xc00013be18, {0xc000016140, 0x1, 0x1})
                                /root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:212 +0x1df
                            main.(*Main).Run(0xc00005ff00, {0xc000016130?, 0xc000006340?, 0x200000003?})
                                /root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:124 +0x469
                            main.main()
                                /root/go/src/github.com/ahrtr/bbolt/cmd/bbolt/main.go:62 +0xa6
--- FAIL: TestDropWritesDuringBench (8.35s)
FAIL
exit status 1
FAIL    github.com/fuweid/go-dmflakey/contrib/datacorruption/boltdb 8.350s
fuweid commented 11 months ago

Hi @ahrtr , The original case is used to simulate the data loss in the device level. The DropWrites failpoint is to drop all the data submitted by fdatasync or fsync. It's like power failure and cause data committed loss.

I was trying to use the following script in pod:

run-bench-test &
sleep random-seconds
echo b > /proc/sysrq-trigger

And run pod as daemonset in the kubernetes cluster. It can't reproduce the data corruption. So I inject failpoint in device level to drop the writes, which seems to change one byte in the data.

I think we can add failpoint between writeAt and fdatasync so that we can see the recovery after power failure.

We can sync the detail in tomorrow meeting.

ahrtr commented 11 months ago

@fuweid can you add the test case TestDropWritesDuringBench (Possibly need to change the name) into bbolt?

fuweid commented 11 months ago

@ahrtr Hi, I am still working on this. will file pull request when it's ready