etcd-io / bbolt

An embedded key/value database for Go.
https://go.etcd.io/bbolt
MIT License
7.89k stars 621 forks source link

panic: invalid freelist page: 0, page type is unknown<00> #446

Open gandarez opened 1 year ago

gandarez commented 1 year ago

I've been using bbolt (already updated to latest version v1.3.7)since two years ago and started getting some weird panic when opening database file. I can't debug it neither get the db file to test it out since I distribute my application as a standalone client. Why it panics and do not return an error? Does that error happens because there's a corrupted db?

func (f *freelist) read(p *page) {
    if (p.flags & freelistPageFlag) == 0 {
        panic(fmt.Sprintf("invalid freelist page: %d, page type is %s", p.id, p.typ()))
    }
....
}

https://github.com/etcd-io/bbolt/blob/da2f2a53f6e2f25b215b79db2cd417488ef8e955/freelist.go#L265

https://github.com/wakatime/wakatime-cli/issues/848

cenkalti commented 1 year ago

Looks like the db file is corrupted. To skip the error, @gandarez could try passing PreLoadFreelist: false but it is always loaded in RW mode. Can this restriction be removed? https://github.com/etcd-io/bbolt/blob/3e560dbae20dcb078d50f928ef7d17f1a56a4413/db.go#L182-L183

ahrtr commented 1 year ago

Thanks @gandarez for raising this issue and sorry for the inconvenience. Copied the call stack from https://github.com/wakatime/wakatime-cli/issues/848 below.

The error message indicates that the meta page 0 might be corrupted (but the checksum is somehow correct). Is is possible to provide the db file? ( I saw your message neither get the db file, but still want to double confirm).

Do you have a detailed step to reproduce this issue?

goroutine 1 [running]:
runtime/debug.Stack()
 /opt/hostedtoolcache/go/1.19.6/x64/src/runtime/debug/stack.go:24 +0x65
github.com/wakatime/wakatime-cli/cmd.runCmd.func1()
 /home/runner/work/wakatime-cli/wakatime-cli/cmd/run.go:272 +0xd3
panic({0x9a5540, 0xc00060b980})
 /opt/hostedtoolcache/go/1.19.6/x64/src/runtime/panic.go:884 +0x212
go.etcd.io/bbolt.(*freelist).read(0x0?, 0x11bfa0c2000)
 /home/runner/go/pkg/mod/go.etcd.io/bbolt@v1.3.7/freelist.go:267 +0x22e
go.etcd.io/bbolt.(*DB).loadFreelist.func1()
 /home/runner/go/pkg/mod/go.etcd.io/bbolt@v1.3.7/db.go:415 +0xb8
sync.(*Once).doSlow(0xc000123608?, 0x10?)
 /opt/hostedtoolcache/go/1.19.6/x64/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
 /opt/hostedtoolcache/go/1.19.6/x64/src/sync/once.go:65
go.etcd.io/bbolt.(*DB).loadFreelist(0xc000123440?)
 /home/runner/go/pkg/mod/go.etcd.io/bbolt@v1.3.7/db.go:408 +0x47
go.etcd.io/bbolt.Open({0xc0002fd260, 0x1a}, 0x0?, 0xc000378c20)
 /home/runner/go/pkg/mod/go.etcd.io/bbolt@v1.3.7/db.go:290 +0x40c
ahrtr commented 1 year ago

Or execute commands below if you can't provide the db file,

$ ./bbolt check <db-file>

$ ./bbolt pages <db-file>

$ ./bbolt page <db-file> 0

$ ./bbolt page <db-file> 1
ahrtr commented 1 year ago

@gandarez could try passing PreLoadFreelist: false but it is always loaded in RW mode

Note that bbolt always loads the freelist in write mode, no matter what value is set for PreLoadFreelist.

EDIT:

Can this restriction be removed?

NO, we can't. Freelist management is the most crucial part of bbolt, and it's always needed in write mode, and definitely always necessary to load freelist in write mode.

cenkalti commented 1 year ago

Sorry, I didn't tell the whole thing. What I meant was, if the user switches NoFreelistSync from false to true, db.Open() still loads the freelist.

I'm proposing changing: https://github.com/etcd-io/bbolt/blob/3e560dbae20dcb078d50f928ef7d17f1a56a4413/db.go#L253-L255

to

    if db.PreLoadFreelist && !db.NoFreeListSync {
        db.loadFreelist()
    }
ahrtr commented 1 year ago

It isn't correct. db.NoFreeListSync == false only means not syncing freelist in this transaction; in other words, it doesn't mean not loading freelist. We still need to load freelist, even there is no synced freelist in previous transaction (bbolt will scan the whole db to reconstructure the freelist in this case).

cenkalti commented 1 year ago

Sorry for my misunderstanding. Currently, there is no way to skip loading freelist from the disk if meta page points to an existing freelist. Is that correct?

ahrtr commented 1 year ago

Currently, there is no way to skip loading freelist from the disk if meta page points to an existing freelist. Is that correct?

Correct. bbolt will always read from disk (either from synced freelist or scan the whole db to restructure the freelist) to get the freelist in write mode.

The most important thing for now is to reproduce the issue ourselves. It would be great if @gandarez can provide some clues.

gandarez commented 1 year ago

I can't promise anything as I said it runs in our user's machines, but I'll try to get a copy of it.

cenkalti commented 1 year ago

With NoFreeListSync: false, freelist is saved to a page and referenced from the meta page. With NoFreeListSync: true, freelist is not saved to the file and a special marker is put into the meta page.

Current freelist loading logic does not take NoFreeListSync option into account. https://github.com/etcd-io/bbolt/blob/e6563eef17d87c7e96e96fbb2b78be3e93d67ff1/db.go#L371-L383

By setting it to true, the user of the library accepts that the freelist will not be saved to disk and accepts the latency for scanning whole db.

The loading behavior currently depends only on the existence of freelist on the db file.

I have a proposal for adding NoFreeListSync into the decision:

 if !db.hasSyncedFreelist() || db.NoFreeListSync { 
    // Reconstruct free list by scanning the DB. 
    db.freelist.readIDs(db.freepages()) 
 } else { 
    // Read free list from freelist page. 
    db.freelist.read(db.page(db.meta().Freelist())) 
 } 

This may help to open the database by changing an option if the corruption is just in the freelist.

ahrtr commented 1 year ago

@gandarez is there any update on this? thx

gandarez commented 11 months ago

I haven't heard anything from nobody, is this issue still on track?

ahrtr commented 11 months ago

I haven't heard anything from nobody, is this issue still on track?

Based on all the info we have so far, most likely the db file is somehow corrupted. The suggestion I can think of for now is to regularly backup the db file [your application is a standalone client]. For distributed systems, single points of failure are usually tolerated.

It would be great if you can provide the db file next time when you run into similar issue, so that I can double check. I can also try to fix the corrupted db file using the surgery commands.

BTW, how many times have you run into such corruption issue in your application?

gandarez commented 11 months ago

Running as a standalone application it's hard to say how many users were affected but it seems only one is still running with this issue. I tried to contact but didn't get any reply from them.