Open tmm1 opened 6 years ago
I tried to use tx.Check()
but it also blows up. Perhaps because I'm using ReadOnly: true
?
unexpected fault address 0xaf41008
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0xaf41008 pc=0x42e6aa7]
goroutine 90 [running]:
runtime.throw(0x4a48764, 0x5)
/usr/local/Cellar/go/1.10.2/libexec/src/runtime/panic.go:616 +0x81 fp=0xc4205e0be0 sp=0xc4205e0bc0 pc=0x402d2e1
runtime.sigpanic()
/usr/local/Cellar/go/1.10.2/libexec/src/runtime/signal_unix.go:395 +0x211 fp=0xc4205e0c30 sp=0xc4205e0be0 pc=0x4042b11
github.com/coreos/bbolt.(*freelist).read(0xc4200bf500, 0xaf41000)
.go/src/github.com/coreos/bbolt/freelist.go:236 +0x37 fp=0xc4205e0ce0 sp=0xc4205e0c30 pc=0x42e6aa7
github.com/coreos/bbolt.(*DB).loadFreelist.func1()
.go/src/github.com/coreos/bbolt/db.go:290 +0x12b fp=0xc4205e0d30 sp=0xc4205e0ce0 pc=0x42ef22b
sync.(*Once).Do(0xc42032f050, 0xc420055d78)
/usr/local/Cellar/go/1.10.2/libexec/src/sync/once.go:44 +0xbe fp=0xc4205e0d68 sp=0xc4205e0d30 pc=0x406379e
github.com/coreos/bbolt.(*DB).loadFreelist(0xc42032ef00)
.go/src/github.com/coreos/bbolt/db.go:283 +0x4e fp=0xc4205e0d98 sp=0xc4205e0d68 pc=0x42e201e
github.com/coreos/bbolt.(*Tx).check(0xc420384380, 0xc42039a600)
.go/src/github.com/coreos/bbolt/tx.go:399 +0x47 fp=0xc4205e0fd0 sp=0xc4205e0d98 pc=0x42ed2c7
runtime.goexit()
/usr/local/Cellar/go/1.10.2/libexec/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4205e0fd8 sp=0xc4205e0fd0 pc=0x405b871
created by github.com/coreos/bbolt.(*Tx).Check
.go/src/github.com/coreos/bbolt/tx.go:393 +0x67
Without ReadOnly
, Open()
crashes right away on a different backup:
unexpected fault address 0x8bf2008
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8bf2008 pc=0x42e6aa7]
goroutine 79 [running]:
runtime.throw(0x4a48764, 0x5)
/usr/local/Cellar/go/1.10.2/libexec/src/runtime/panic.go:616 +0x81 fp=0xc42047f0d8 sp=0xc42047f0b8 pc=0x402d2e1
runtime.sigpanic()
/usr/local/Cellar/go/1.10.2/libexec/src/runtime/signal_unix.go:395 +0x211 fp=0xc42047f128 sp=0xc42047f0d8 pc=0x4042b11
github.com/coreos/bbolt.(*freelist).read(0xc4205cf320, 0x8bf2000)
.go/src/github.com/coreos/bbolt/freelist.go:236 +0x37 fp=0xc42047f1d8 sp=0xc42047f128 pc=0x42e6aa7
github.com/coreos/bbolt.(*DB).loadFreelist.func1()
.go/src/github.com/coreos/bbolt/db.go:290 +0x12b fp=0xc42047f228 sp=0xc42047f1d8 pc=0x42ef22b
sync.(*Once).Do(0xc42038d050, 0xc42047f270)
/usr/local/Cellar/go/1.10.2/libexec/src/sync/once.go:44 +0xbe fp=0xc42047f260 sp=0xc42047f228 pc=0x406379e
github.com/coreos/bbolt.(*DB).loadFreelist(0xc42038cf00)
.go/src/github.com/coreos/bbolt/db.go:283 +0x4e fp=0xc42047f290 sp=0xc42047f260 pc=0x42e201e
github.com/coreos/bbolt.Open(0xc4200edc20, 0x41, 0x180, 0xc42047f388, 0xc4206446b8, 0x0, 0x0)
.go/src/github.com/coreos/bbolt/db.go:260 +0x38e fp=0xc42047f330 sp=0xc42047f290 pc=0x42e1c4e
Similar issue: https://github.com/boltdb/bolt/issues/698
Here's my repro code:
func readBackup(file string) error {
db, err := bolt.Open(file, 0600, &bolt.Options{Timeout: 1 * time.Second, ReadOnly: true})
if err != nil {
return err
}
defer db.Close()
db.View(func(tx *bolt.Tx) error {
if groups := tx.Bucket([]byte("groups")); groups != nil {
num := groups.Stats().KeyN
log.Printf("num: %v", num)
}
})
return nil
}
Would be really nice if there was some way I could check to see if the backup was consistent before trying to read it. Ideally bbolt would be able to deal with truncated/corrupted files itself and not crash the entire process.
defer func() {
if err := recover(); err != nil {
fmt.Printf("Corrupted or invalid boltDB file\n",)
}
}()
Add code to recover.
I am also running into the issue that Check()
on a corrupt DB crashes. Check()
should definitely return an error instead of panicking.
cc @serathius - I saw you recently committed to the repo - who to ping? Is this repo still maintained?
Edit: the address fault is a segmentation fault, not a panic, so I this can't even be recovered with recover()
. This seems to require a bugfix in this library, as it cannot be worked around really.
@benma etcd project still has maintainers, however we are very stretched with work on etcd. We can review PR and fix bugs, but there is no active development on bbolt.
With https://pkg.go.dev/runtime/debug#SetPanicOnFault , segmentation faults can be turned into panics.
Check()
should definitely return an error instead of panicking.
Agreed.
Fixing corrupted db file is my top priority recently. The most important thing is to figure out how to reproduce the issue. It would be great if anyone provide clues on this. Please do not hesitate to ping me if you have any thoughts. Thanks.
FYI. Recently we added a bbolt surgery clear-page-elements
command as a workaround to fix corrupt db file, see https://github.com/etcd-io/bbolt/pull/417.
I am also running into the issue that
Check()
on a corrupt DB crashes.Check()
should definitely return an error instead of panicking.
The DB (panics-on-check.db
) was somehow corrupted during the last transaction. The corrupted db can be easily fixed by reverting the meta page (It actually rollback the last transaction).
$ ./bbolt surgery revert-meta-page /tmp/panics-on-check.db --output ./new.db
The meta page is reverted.
$ ./bbolt check ./new.db
OK
I am almost sure that the corruption isn't caused by bbolt. The db file has 6 pages in total, but the bucket's root page is somehow a huge value 7631988 (0x747474). Most likely it's caused by other issues, e.g. hardware or OS issue?
@benma Do you still remember how was the corrupt file generated? Was there anything unusual (e.g. power off, OS crash, etc.) when the corrupt file being generated? BTW, what's the bbolt version?
$ ./bbolt page /tmp/panics-on-check.db 0
Page ID: 0
Page Type: meta
Total Size: 4096 bytes
Overflow pages: 0
Version: 2
Page Size: 4096 bytes
Flags: 00000000
Root: <pgid=4>
Freelist: <pgid=5>
HWM: <pgid=6>
Txn ID: 2
Checksum: eef96d7a2c1b336e
$ ./bbolt page /tmp/panics-on-check.db 1
Page ID: 1
Page Type: meta
Total Size: 4096 bytes
Overflow pages: 0
Version: 2
Page Size: 4096 bytes
Flags: 00000000
Root: <pgid=3>
Freelist: <pgid=2>
HWM: <pgid=4>
Txn ID: 1
Checksum: 264c351a5179480f
$ ./bbolt page /tmp/panics-on-check.db 4
Page ID: 4
Page Type: leaf
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 1
"bucket": <pgid=7631988,seq=0>
The corrupted file provided by @tmm1 seems like a potential bbolt bug. What's your bbolt version?
The freelist page (108) was somehow reset (all fields have zero value).
What's confusing is that two meta pages have exactly the same Root (99)
, Freelist (108)
and HWM (482)
. Meta 0 has TXN 64920, while meta 1 has TXN 64920; it indicates that the last RW transaction did not change anything. But the freelist should change anyway (It's a potential improvement point, we shouldn't sync freelist if the RW TXN changes nothing)
$ ./bbolt page /tmp/test.db 0
Page ID: 0
Page Type: meta
Total Size: 4096 bytes
Overflow pages: 0
Version: 2
Page Size: 4096 bytes
Flags: 00000000
Root: <pgid=99>
Freelist: <pgid=108>
HWM: <pgid=482>
Txn ID: 64921
Checksum: aab8d660770b88f7
$ ./bbolt page /tmp/test.db 1
Page ID: 1
Page Type: meta
Total Size: 4096 bytes
Overflow pages: 0
Version: 2
Page Size: 4096 bytes
Flags: 00000000
Root: <pgid=99>
Freelist: <pgid=108>
HWM: <pgid=482>
Txn ID: 64920
Checksum: 929bdcc802b6f642
There is even no way to fix the corrupted db file. The file is only 204800 bytes, so it's 50 pages ( 204800/4096 ). Obviously the root page ID (99), Freelist (108) and HWM (482) exceeds the file size. I can't even find the root page in the available 50 pages. It seems that the file was somehow truncated, and the root was in the truncated part.
$ ls -lrt test.db -rw-r--r-- 1 wachao wheel 204800 May 26 15:15 test.db
I have an app that takes regular backups of boltdb databases. Sometimes, for unknown reasons, the backups are corrupted.
I also have a restore UI that lets me browse and read from backups. Trying to open and read from these corrupted databases crashes my process. I'm using 4f5275f4ebbf6fe7cb772de987fa96ee674460a7
test.db.gz