etcd-io / bbolt

An embedded key/value database for Go.
https://go.etcd.io/bbolt
MIT License
8.24k stars 639 forks source link

Remove panics in the library code #447

Open cenkalti opened 1 year ago

cenkalti commented 1 year ago

A library should not panic at all. Panics are usually not handled in Go programs and it makes the whole program crashes when a library function panics.

Currently there are 12 open issues involving panic, especially working with corrupted databases. BoltDB should return the error as an nice message.

Overview of panics from boltdb code:

% ag 'panic\(.*\)' | grep -v _test.go
freelist_hmap.go:167:           panic("pgids not sorted")
db.go:1067:     panic("bolt.DB.meta(): invalid meta pages")
db.go:1151:                     panic("freepages: failed to rollback tx")
db.go:1155:             panic("freepages: failed to open read only tx")
db.go:1163:                     panic(fmt.Sprintf("freepages: failed to get all reachable pages (%v)", e))
internal/common/page.go:136:                    panic(fmt.Sprintf("leading element count %d overflows int", c))
internal/common/page.go:361:            panic(fmt.Errorf("mergepgids bad len %d < %d + %d", len(dst), len(a), len(b)))
internal/common/meta.go:42:             panic(fmt.Sprintf("root bucket pgid (%d) above high water mark (%d)", m.root.root, m.pgid))
internal/common/meta.go:45:             panic(fmt.Sprintf("freelist pgid (%d) above high water mark (%d)", m.freelist, m.pgid))
freelist.go:119:                        panic(fmt.Sprintf("invalid page allocation: %d", id))
freelist.go:157:                panic(fmt.Sprintf("cannot free page 0 or 1: %d", p.Id()))
freelist.go:177:                        panic(fmt.Sprintf("page %d already freed", id))
freelist.go:269:                panic(fmt.Sprintf("invalid freelist page: %d, page type is %s", p.Id(), p.Typ()))
cursor.go:276:          panic(fmt.Sprintf("invalid page type: %d: %x", p.Id(), p.Flags()))
internal/common/utils.go:11:            panic(fmt.Sprintf("assertion failed: "+msg, v...))
internal/btesting/btesting.go:110:              panic("Please call Close() before MustReopen()")
bucket.go:563:                  panic(fmt.Sprintf("misplaced bucket header: %x -> %x", []byte(name), k))
bucket.go:566:                  panic(fmt.Sprintf("unexpected bucket header flag: %x", flags))
bucket.go:584:          panic(fmt.Sprintf("pgid (%d) above high water mark (%d)", b.rootNode.pgid, b.tx.meta.Pgid()))
bucket.go:717:                  panic(fmt.Sprintf("inline bucket non-zero page access(2): %d != 0", id))
tx.go:212:                      panic("check fail: " + strings.Join(errs, "\n"))
mlock_windows.go:5:     panic("mlock is supported only on UNIX systems")
mlock_windows.go:10:    panic("munlock is supported only on UNIX systems")
node.go:76:             panic(fmt.Sprintf("invalid childAt(%d) on a leaf node", index))
node.go:119:            panic(fmt.Sprintf("pgId (%d) above high water mark (%d)", pgId, n.bucket.tx.meta.Pgid()))
node.go:121:            panic("put: zero-length old key")
node.go:123:            panic("put: zero-length new key")
node.go:190:            panic(fmt.Sprintf("inode overflow: %d (pgid=%d)", len(n.inodes), p.Id()))
node.go:331:                    panic(fmt.Sprintf("pgid (%d) above high water mark (%d)", p.Id(), tx.meta.Pgid()))
infdahai commented 1 year ago

I fix this.

ptabor commented 1 year ago

A library should not panic at all.

I don't agree with this statement in this specific context:

Looking at the error list above... most mean that the underlying bbolt file is corrupted or inconsistent. The chances of recovery of such file are big if the file is not subject to continous mutations and the problem will be surfaced soon after the corruption happens.

Moreover panic in golang is not 'terminal'. If the user of the library is determined to continue (e.g. because of hosting multiple instances in a single process), the panic can get explicitly recovered.

To summarise, I recommend:

  1. Keep it panic whenever there is chance we just corrupted DB
  2. Keep it error when it's user fault (btesting.go:110: panic("Please call Close() before MustReopen()") ? )
  3. There is middle-ground for this rules: When you open in RO mode already corrupted file and the check fails: There is slim chance you just corrupted the file, so it's a reason for error. At the same time keeping all code to be consistent around panicking on data inconsistency is easier to maintain. What we can do, is to warn callers that the DB.Open operation might panic when the file is corrupted and the user needs to 'recovery' if they need.
infdahai commented 1 year ago

I try this in https://github.com/infdahai/bbolt/commit/7d07d9fde19c30a399b445dac2c4fd6a6c077a94. And I agree with @ptabor said. We can replace some panics with error messages, such as panic(fmt.Sprintf("page %d already freed", id)). But other panics are good indications of the DB crash.

ptabor commented 1 year ago

"page %d already freed" is also indicator of either bug in freelist implementation or memory stomping or ram/cpu corruption. It shouldn't happen regardless of bbolt lib user actions.

cenkalti commented 1 year ago

keeping all code to be consistent around panicking on data inconsistency is easier to maintain

I definitely agree on this. However, I think this should not be the default behavior. Think about these 2 cases:

Panicking takes the whole program down which is not an ideal decision for a library in my opinion.

I propose this:


There is also a special case. Recover only works when called from the same goroutine as the panic is called in. For example:

https://github.com/etcd-io/bbolt/blob/56e033be3380958b4f20b4201d9f77a7d5c029b4/db.go#L1178-L1182

This code block executes when opening a db and it's not possible to recover that panic.


In overall, I see corruption errors as non-fatal. I think it's nicer to let the user decide by returning regular errors.