Crash when trying to open corrupted database

tmm1 commented 6 years ago

I have an app that takes regular backups of boltdb databases. Sometimes, for unknown reasons, the backups are corrupted.

I also have a restore UI that lets me browse and read from backups. Trying to open and read from these corrupted databases crashes my process. I'm using 4f5275f4ebbf6fe7cb772de987fa96ee674460a7

unexpected fault address 0x8a6b008
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8a6b008 pc=0x42e0e2f]

goroutine 12 [running]:
runtime.throw(0x4a487e4, 0x5)
    /usr/local/Cellar/go/1.10.2/libexec/src/runtime/panic.go:616 +0x81 fp=0xc4206eee00 sp=0xc4206eede0 pc=0x402d5b1
runtime.sigpanic()
    /usr/local/Cellar/go/1.10.2/libexec/src/runtime/signal_unix.go:395 +0x211 fp=0xc4206eee50 sp=0xc4206eee00 pc=0x4042de1
github.com/coreos/bbolt.(*Cursor).search(0xc4206eefe0, 0xc4206ef118, 0x6, 0x20, 0x63)
    .go/src/github.com/coreos/bbolt/cursor.go:255 +0x5f fp=0xc4206eef08 sp=0xc4206eee50 pc=0x42e0e2f
github.com/coreos/bbolt.(*Cursor).seek(0xc4206eefe0, 0xc4206ef118, 0x6, 0x20, 0x0, 0x0, 0x4063d84, 0x614e000, 0x0, 0x48d8300, ...)
    .go/src/github.com/coreos/bbolt/cursor.go:159 +0xa5 fp=0xc4206eef58 sp=0xc4206eef08 pc=0x42e0725
github.com/coreos/bbolt.(*Bucket).Bucket(0xc4204976d8, 0xc4206ef118, 0x6, 0x20, 0xc4206ef118)
    .go/src/github.com/coreos/bbolt/bucket.go:105 +0xde fp=0xc4206ef010 sp=0xc4206eef58 pc=0x42dc66e
github.com/coreos/bbolt.(*Tx).Bucket(0xc4204976c0, 0xc4206ef118, 0x6, 0x20, 0x6)
    .go/src/github.com/coreos/bbolt/tx.go:101 +0x4f fp=0xc4206ef048 sp=0xc4206ef010 pc=0x42ebbef

test.db.gz

tmm1 commented 6 years ago

I tried to use tx.Check() but it also blows up. Perhaps because I'm using ReadOnly: true?

unexpected fault address 0xaf41008
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0xaf41008 pc=0x42e6aa7]

goroutine 90 [running]:
runtime.throw(0x4a48764, 0x5)
    /usr/local/Cellar/go/1.10.2/libexec/src/runtime/panic.go:616 +0x81 fp=0xc4205e0be0 sp=0xc4205e0bc0 pc=0x402d2e1
runtime.sigpanic()
    /usr/local/Cellar/go/1.10.2/libexec/src/runtime/signal_unix.go:395 +0x211 fp=0xc4205e0c30 sp=0xc4205e0be0 pc=0x4042b11
github.com/coreos/bbolt.(*freelist).read(0xc4200bf500, 0xaf41000)
    .go/src/github.com/coreos/bbolt/freelist.go:236 +0x37 fp=0xc4205e0ce0 sp=0xc4205e0c30 pc=0x42e6aa7
github.com/coreos/bbolt.(*DB).loadFreelist.func1()
    .go/src/github.com/coreos/bbolt/db.go:290 +0x12b fp=0xc4205e0d30 sp=0xc4205e0ce0 pc=0x42ef22b
sync.(*Once).Do(0xc42032f050, 0xc420055d78)
    /usr/local/Cellar/go/1.10.2/libexec/src/sync/once.go:44 +0xbe fp=0xc4205e0d68 sp=0xc4205e0d30 pc=0x406379e
github.com/coreos/bbolt.(*DB).loadFreelist(0xc42032ef00)
    .go/src/github.com/coreos/bbolt/db.go:283 +0x4e fp=0xc4205e0d98 sp=0xc4205e0d68 pc=0x42e201e
github.com/coreos/bbolt.(*Tx).check(0xc420384380, 0xc42039a600)
    .go/src/github.com/coreos/bbolt/tx.go:399 +0x47 fp=0xc4205e0fd0 sp=0xc4205e0d98 pc=0x42ed2c7
runtime.goexit()
    /usr/local/Cellar/go/1.10.2/libexec/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4205e0fd8 sp=0xc4205e0fd0 pc=0x405b871
created by github.com/coreos/bbolt.(*Tx).Check
    .go/src/github.com/coreos/bbolt/tx.go:393 +0x67

tmm1 commented 6 years ago

Without ReadOnly, Open() crashes right away on a different backup:

unexpected fault address 0x8bf2008
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8bf2008 pc=0x42e6aa7]

goroutine 79 [running]:
runtime.throw(0x4a48764, 0x5)
    /usr/local/Cellar/go/1.10.2/libexec/src/runtime/panic.go:616 +0x81 fp=0xc42047f0d8 sp=0xc42047f0b8 pc=0x402d2e1
runtime.sigpanic()
    /usr/local/Cellar/go/1.10.2/libexec/src/runtime/signal_unix.go:395 +0x211 fp=0xc42047f128 sp=0xc42047f0d8 pc=0x4042b11
github.com/coreos/bbolt.(*freelist).read(0xc4205cf320, 0x8bf2000)
    .go/src/github.com/coreos/bbolt/freelist.go:236 +0x37 fp=0xc42047f1d8 sp=0xc42047f128 pc=0x42e6aa7
github.com/coreos/bbolt.(*DB).loadFreelist.func1()
    .go/src/github.com/coreos/bbolt/db.go:290 +0x12b fp=0xc42047f228 sp=0xc42047f1d8 pc=0x42ef22b
sync.(*Once).Do(0xc42038d050, 0xc42047f270)
    /usr/local/Cellar/go/1.10.2/libexec/src/sync/once.go:44 +0xbe fp=0xc42047f260 sp=0xc42047f228 pc=0x406379e
github.com/coreos/bbolt.(*DB).loadFreelist(0xc42038cf00)
    .go/src/github.com/coreos/bbolt/db.go:283 +0x4e fp=0xc42047f290 sp=0xc42047f260 pc=0x42e201e
github.com/coreos/bbolt.Open(0xc4200edc20, 0x41, 0x180, 0xc42047f388, 0xc4206446b8, 0x0, 0x0)
    .go/src/github.com/coreos/bbolt/db.go:260 +0x38e fp=0xc42047f330 sp=0xc42047f290 pc=0x42e1c4e

test2.db.gz

tmm1 commented 6 years ago

Here's my repro code:

func readBackup(file string) error {
    db, err := bolt.Open(file, 0600, &bolt.Options{Timeout: 1 * time.Second, ReadOnly: true})
    if err != nil {
        return err
    }
    defer db.Close()

    db.View(func(tx *bolt.Tx) error {
        if groups := tx.Bucket([]byte("groups")); groups != nil {
            num := groups.Stats().KeyN
            log.Printf("num: %v", num)
        }
    })
    return nil
}

Would be really nice if there was some way I could check to see if the backup was consistent before trying to read it. Ideally bbolt would be able to deal with truncated/corrupted files itself and not crash the entire process.

subbu05 commented 5 years ago

defer func() {
    if err := recover(); err != nil {
        fmt.Printf("Corrupted or invalid boltDB file\n",)
    }
}()

Add code to recover.

benma commented 1 year ago

I am also running into the issue that Check() on a corrupt DB crashes. Check() should definitely return an error instead of panicking.

panics-on-check.db.zip

cc @serathius - I saw you recently committed to the repo - who to ping? Is this repo still maintained?

Edit: the address fault is a segmentation fault, not a panic, so I this can't even be recovered with recover(). This seems to require a bugfix in this library, as it cannot be worked around really.

serathius commented 1 year ago

@benma etcd project still has maintainers, however we are very stretched with work on etcd. We can review PR and fix bugs, but there is no active development on bbolt.

cenkalti commented 1 year ago

With https://pkg.go.dev/runtime/debug#SetPanicOnFault , segmentation faults can be turned into panics.

ahrtr commented 1 year ago

Check() should definitely return an error instead of panicking.

Agreed.

Fixing corrupted db file is my top priority recently. The most important thing is to figure out how to reproduce the issue. It would be great if anyone provide clues on this. Please do not hesitate to ping me if you have any thoughts. Thanks.

FYI. Recently we added a bbolt surgery clear-page-elements command as a workaround to fix corrupt db file, see https://github.com/etcd-io/bbolt/pull/417.

ahrtr commented 1 year ago

I am also running into the issue that Check() on a corrupt DB crashes. Check() should definitely return an error instead of panicking.

panics-on-check.db.zip

The DB (panics-on-check.db) was somehow corrupted during the last transaction. The corrupted db can be easily fixed by reverting the meta page (It actually rollback the last transaction).

$ ./bbolt surgery revert-meta-page /tmp/panics-on-check.db --output ./new.db
The meta page is reverted.
$ ./bbolt check ./new.db 
OK

I am almost sure that the corruption isn't caused by bbolt. The db file has 6 pages in total, but the bucket's root page is somehow a huge value 7631988 (0x747474). Most likely it's caused by other issues, e.g. hardware or OS issue?

@benma Do you still remember how was the corrupt file generated? Was there anything unusual (e.g. power off, OS crash, etc.) when the corrupt file being generated? BTW, what's the bbolt version?

$ ./bbolt  page /tmp/panics-on-check.db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=4>
Freelist:   <pgid=5>
HWM:        <pgid=6>
Txn ID:     2
Checksum:   eef96d7a2c1b336e

$ ./bbolt  page /tmp/panics-on-check.db 1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=3>
Freelist:   <pgid=2>
HWM:        <pgid=4>
Txn ID:     1
Checksum:   264c351a5179480f

$ ./bbolt  page /tmp/panics-on-check.db 4
Page ID:    4
Page Type:  leaf
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 1

"bucket": <pgid=7631988,seq=0>

ahrtr commented 1 year ago

test.db.gz

The corrupted file provided by @tmm1 seems like a potential bbolt bug. What's your bbolt version?

The freelist page (108) was somehow reset (all fields have zero value).

What's confusing is that two meta pages have exactly the same Root (99), Freelist (108) and HWM (482). Meta 0 has TXN 64920, while meta 1 has TXN 64920; it indicates that the last RW transaction did not change anything. But the freelist should change anyway (It's a potential improvement point, we shouldn't sync freelist if the RW TXN changes nothing)

$ ./bbolt page /tmp/test.db  0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=99>
Freelist:   <pgid=108>
HWM:        <pgid=482>
Txn ID:     64921
Checksum:   aab8d660770b88f7

$ ./bbolt page /tmp/test.db  1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=99>
Freelist:   <pgid=108>
HWM:        <pgid=482>
Txn ID:     64920
Checksum:   929bdcc802b6f642

ahrtr commented 1 year ago

test.db.gz

There is even no way to fix the corrupted db file. The file is only 204800 bytes, so it's 50 pages ( 204800/4096 ). Obviously the root page ID (99), Freelist (108) and HWM (482) exceeds the file size. I can't even find the root page in the available 50 pages. It seems that the file was somehow truncated, and the root was in the truncated part.

$ ls -lrt test.db -rw-r--r-- 1 wachao wheel 204800 May 26 15:15 test.db

etcd-io / bbolt

Crash when trying to open corrupted database #105