Open mconst opened 3 months ago
Ah yes, I need to think through whether it's possible to do this free'ing at the end of the commit, after the last fsync()
I think I implemented it the way it is for a couple reasons:
1) I wanted begin_write()
to be constant time
2) I wanted to avoid doing extra book keeping to support rolling back these frees during a call to abort()
However, this is a pretty significant downside
Ah, yeah, doing the frees after the fsync() seems like a nice approach! You'd also have to redo them when opening the database, because they wouldn't get persisted, right? But that seems fine. And it's definitely nice not to have to worry about rolling back the frees in abort().
I think there might be a bit of complexity with savepoints, because the allocator state at the time the snapshot is taken will no longer match the freed tree that's saved. But that's not too hard to deal with, and in fact I believe the current code there isn't quite right anyway (I'll submit another issue for that in a few minutes).
You'd also have to redo them when opening the database, because they wouldn't get persisted, right?
There's a Drop
implementation on TransactionalMemory
which persists outstanding non-durable commits. I think it can persist the frees, so that they didn't need to be replayed when the database is opened
in fact I believe the current code there isn't quite right anyway
That wouldn't surprise me at all! Savepoints were the last big feature I added, and they created a lot of bugs. Sometimes I regret adding them, haha, but they're a rather convenient feature
There's a
Drop
implementation onTransactionalMemory
which persists outstanding non-durable commits. I think it can persist the frees, so that they didn't need to be replayed when the database is opened
Oh nice! Yeah, that's even better.
Thanks for writing redb!
A sequence of large write transactions seems to consume twice as much disk space as I was expecting. If each transaction modifies 1 GB of existing pages (and there are no live reads), I was expecting 1 GB of overhead: the old data needs to be kept around until the transaction has committed.
But in fact this has 2 GB of overhead. The first transaction allocates 1 GB of new pages, and adds 1 GB of old pages to the freed tree, as expected. The second transaction correctly frees those old pages -- but they're freed at the end of the transaction, rather than the beginning. So even though the pages are no longer referenced by anything and could be reused, the second transaction doesn't use them; instead, it allocates a fresh 1 GB of pages, doubling the disk usage. The original pages don't get reused until the third transaction[*].
It's possible to work around this, at the cost of some extra latency, by adding an extra empty write transaction after each real write. But it's kind of surprising that that's necessary, and it would be nice if it weren't needed!
This also interacts with another issue, where live reads are blocking page freeing for one transaction longer than they should. I'll submit that one separately.
[*] A few pages do actually get reused in the second transaction, to store the updated freed tree. But they don't get reused for data.