cberner / redb

An embedded key-value database in pure Rust
https://www.redb.org
Apache License 2.0
3.33k stars 154 forks source link

Writes don't reuse freed pages from the previous transaction #829

Open mconst opened 3 months ago

mconst commented 3 months ago

Thanks for writing redb!

A sequence of large write transactions seems to consume twice as much disk space as I was expecting. If each transaction modifies 1 GB of existing pages (and there are no live reads), I was expecting 1 GB of overhead: the old data needs to be kept around until the transaction has committed.

But in fact this has 2 GB of overhead. The first transaction allocates 1 GB of new pages, and adds 1 GB of old pages to the freed tree, as expected. The second transaction correctly frees those old pages -- but they're freed at the end of the transaction, rather than the beginning. So even though the pages are no longer referenced by anything and could be reused, the second transaction doesn't use them; instead, it allocates a fresh 1 GB of pages, doubling the disk usage. The original pages don't get reused until the third transaction[*].

It's possible to work around this, at the cost of some extra latency, by adding an extra empty write transaction after each real write. But it's kind of surprising that that's necessary, and it would be nice if it weren't needed!

This also interacts with another issue, where live reads are blocking page freeing for one transaction longer than they should. I'll submit that one separately.

[*] A few pages do actually get reused in the second transaction, to store the updated freed tree. But they don't get reused for data.

cberner commented 3 months ago

Ah yes, I need to think through whether it's possible to do this free'ing at the end of the commit, after the last fsync()

I think I implemented it the way it is for a couple reasons: 1) I wanted begin_write() to be constant time 2) I wanted to avoid doing extra book keeping to support rolling back these frees during a call to abort()

However, this is a pretty significant downside

mconst commented 3 months ago

Ah, yeah, doing the frees after the fsync() seems like a nice approach! You'd also have to redo them when opening the database, because they wouldn't get persisted, right? But that seems fine. And it's definitely nice not to have to worry about rolling back the frees in abort().

I think there might be a bit of complexity with savepoints, because the allocator state at the time the snapshot is taken will no longer match the freed tree that's saved. But that's not too hard to deal with, and in fact I believe the current code there isn't quite right anyway (I'll submit another issue for that in a few minutes).

cberner commented 3 months ago

You'd also have to redo them when opening the database, because they wouldn't get persisted, right?

There's a Drop implementation on TransactionalMemory which persists outstanding non-durable commits. I think it can persist the frees, so that they didn't need to be replayed when the database is opened

in fact I believe the current code there isn't quite right anyway

That wouldn't surprise me at all! Savepoints were the last big feature I added, and they created a lot of bugs. Sometimes I regret adding them, haha, but they're a rather convenient feature

mconst commented 3 months ago

There's a Drop implementation on TransactionalMemory which persists outstanding non-durable commits. I think it can persist the frees, so that they didn't need to be replayed when the database is opened

Oh nice! Yeah, that's even better.