etcd-io / bbolt

An embedded key/value database for Go.
https://go.etcd.io/bbolt
MIT License
7.98k stars 626 forks source link

Refactor freelist management & add more dedicated tests #789

Open ahrtr opened 2 weeks ago

ahrtr commented 2 weeks ago

Backgroud

There are some long standing data corruption issues, which indicate that there might be potential bug(s) in freelist management.

I am not satisfied with the bbolt freelist management for a long time. It's hard to understand and also tightly coupled with bbolt. I have been thinking to refactor it to improve the understandability & testability.

Refactor

The high level idea is to

What we have done and are going to do:

We also need to continue to refactor & simplify the interface from user (bbolt) perspective, in other words, we should have a clear understanding on how the interface will & should be used by bbolt. The motivation is to improve testablity.

The freelist management is the most sensitive & important part. So let's do it step by step.

Test

Any unit tests are welcome.

But more importantly, we should add dedicated randomized test case to simulate concurrent (multiple) read TXNs and (single) writing TXN.

ahrtr commented 2 weeks ago

A good news is that I kept the concurrent test case running (and randomly kill & restart the test) for 22 days and no any issue occurred. Read https://github.com/etcd-io/bbolt/pull/769#issuecomment-2211678851

tjungblu commented 2 weeks ago

Here's one obvious addition I also wrote down during #775.

ReadWriter

Read / Write / EstimatedWritePageSize implementations are already well decoupled through interface functions and thus a very low hanging fruit to refactor.

I propose to split this into another struct, allowing us to embed/compose it in the freelist itself.

Nice side-effect: Copyall can go in there as a private implementation function, removed entirely from the interface. tx_checker.go requires only a list of pending and free pages, which can be handled with other functions + sorting.

Benefit: four exported functions from the freelist interface less. Test cases become very simple serialization/deserialization of integer slices. We can improve this later on more easily, e.g. with vint compression to reduce the serialized size. We should definitely add a checksum hash there too...


decouple it with the bbolt TXN workflow

Yep, I think this is the biggest one. The main responsibility we still need to rethink seems to track when a page was allocated and when it was freed again (when in terms of transaction ID) and having strong assertions around those. The semantic leaking of "pending" pages also irks me a bit, this should be an implementation detail.

The remaining parts that care about different allocation styles and span merging seem OK to me. Even though there seems to be a lot of double-accounting and ensuring all data structures stay consistent at all times.

tjungblu commented 2 weeks ago

As for the testing, we might also want to consider digging up the lazyfs work in #567. I didn't get very far with getting meaningful tests up last year, but torn writes scare me a lot more now :)

ahrtr commented 1 week ago

There is a flaw on the freelist management.

The allocs is a mapping of Pgid -> Txid, see below, https://github.com/etcd-io/bbolt/blob/efc3eb649b55ac9ab47cd681fc74f4272a7aa088/internal/freelist/shared.go#L22

It only records the page IDs which are allocated from the freelist. But if the freelist can't meet the allocation requirement, then bbolt will try to resize the db file and allocate more disk space. In such case, the new allocated page IDs aren't recorded in the allocs.

https://github.com/etcd-io/bbolt/blob/efc3eb649b55ac9ab47cd681fc74f4272a7aa088/db.go#L1159-L1165

Usually it isn't a big problem. Because in such case, bbolt needs to re-map the db file, so it needs to wait for all the existing readonly TXNs to finish. But there is still a very small windows, in which new readonly TXNs may be created after re-mapping but before the writing TXN finishes committing.

The new allocated pages in such case are actually invisible to such readonly TXNs. But since the Txid (allocating the page ID) isn't recorded in such case, when the pages are freed in following writing TXNs, they can't be released as long as the readonly TXNs created in the small windows are still running.

It isn't a big problem,

ahrtr commented 1 week ago

I am planning to simplify & update the logic related to the pending released free pages, please take a look at doc below and feedback if anyone is interested, https://docs.google.com/spreadsheets/d/1T7ORCbv3mguq1hQoZkUg0XEzuUXFrNFFj1sUfQysZ2k/edit?gid=0#gid=0

tjungblu commented 5 days ago

It isn't a big problem,

I think the more general problem here is that we have tracking of transactions and their pages, but the allocation is also somewhere else entirely. I don't want to open another can of worms, but maybe we can also find a good way to incoperate the transaction and page allocation management along with it.

I am planning to simplify & update the logic related to the pending released free pages

the doc looks good, I see the code has some short-cuts around the txids to avoid looping more than necessary, we should start to simplify first though.