cberner / redb

An embedded key-value database in pure Rust
https://www.redb.org
Apache License 2.0
3.35k stars 155 forks source link

Slow to open large database #386

Closed casey closed 1 year ago

casey commented 2 years ago

We recently restarted ord, and the database was pretty slow to open, on the order of a minute or two. I'm not sure it was the culprit, but I think this was the first time we used a WriteStrategy::Checksum database on mainnet. Under what circumstances will redb check all the checksums for a database using WriteStrategy::Checksum? Is it every time the database opens, or only after a crash?

cberner commented 2 years ago

It should only do that after a crash. I opened a PR to add a log message, so that you can verify whether that's the issue

casey commented 2 years ago

We switched from WriteStrategy::Checksum to WriteStrategy::TwoPhase, and we're still encountering this. Opening a fully synced index takes a minute or two. RUST_LOG=trace doesn't print anything out on open, and it always happens, so I think this isn't related to crash recovery.

The slow call is this commit, which is just making sure the expected tables exist, so it isn't doing any actual work.

cberner commented 2 years ago

Does it still happen if you use a Durability::None transaction? Also, what OS are you running this on? Assuming those tables already exist nothing should happen in that commit except some metadata changes and then an fsync.

casey commented 2 years ago

This is on MacOS, and it doesn't happen when Durability::None is used for the write transaction.

cberner commented 2 years ago

Interesting, so the only big difference with a Durability::None commit is that it skips the fsync. Do you have access to a Linux machine you can test it on? MacOS has enough weird mmap behavior that this might just be a MacOS issue

casey commented 2 years ago

I just ran it on the ordinals.com server, which is Linux:

$ time /usr/local/bin/ord --bitcoin-data-dir /var/lib/bitcoind --data-dir /var/lib/ord --height-limit 0 index
real  0m45.156s
user  0m6.556s
sys 0m21.695s

And here's MacOS:

$ time ./target/release/ord --height-limit 0 index
./target/release/ord --height-limit 0 index  2.78s user 84.62s system 30% cpu 4:45.39 total

So it's way slower on Mac, 4 minutes and 45 seconds vs 45 seconds on Linux, but they're both pretty slow.

Is there anything in the way that the mmap is being opened, like something in the options, that could cause the system to do extra work when opening it?

cberner commented 2 years ago

Certainly nothing obvious. The code for that is here and it just opens it with the SHARED flag. The fact that it happens at commit time makes me think that fsync/msync has to do some kind of setup the first time it flushes a mmap'ed region

casey commented 2 years ago

Should we leave this open? Maybe nothing can be done, but leaving it open might attract someone who knows what the issue is.

cberner commented 2 years ago

Lets re-open at least until I have time to profile it to see whether it's a redb bug or just poor performance in the syscall

casey commented 2 years ago

It would also be good to do a comparison with LMDB. If LMDB is slow, it's likely to just be the syscall, if it's not slow, then it's either a bug in redb, or maybe LMDB has some kind of trick to make it fast.

casey commented 2 years ago

I asked a friend of mine who's working on a project with a 300 GiB LMDB database, and he said that it opens instantly, so there must be some difference in the syscalls.

cberner commented 1 year ago

While looking at #436 I thought of something you could try. Try commenting out the fsync in Mmap https://github.com/cberner/redb/blob/52c3b481daa3c26a42673e3e91c01147a7a8ab8b/src/tree_store/page_store/mmap.rs#L128-L148 and see if that resolves your slowness. That's the only significant difference in syscalls that redb & lmdb make, that I could find

casey commented 1 year ago

While looking at #436 I thought of something you could try. Try commenting out the fsync in Mmap

Isn't that the fsync that actually flushes data to disk?

cberner commented 1 year ago

Yes. I just mean as a way to determine whether that's the problem, not as an actual fix.

casey commented 1 year ago

I've been digging into this:

So it looks like for some reason, on opening and commit, redb thinks that it has to zero out a whole bunch of pages.

It looks like it looks at the oldest live read, and then bases what it needs to free on that, so maybe something isn't getting updated properly in a prior transaction, like an old read transaction.

I've been experimenting with a single 96GiB ord index snapshot taken at block 700,000. There could be something wrong with this index.redb, although I believe we've seen this issue multiple times with other index.redbs. I'm going to update redb to 0.9.0, try building another snapshot at the same height, and see if I can reproduce it with that.

casey commented 1 year ago

Also, after the slow open and commit, re-opening is instant, and the dbg!(page) in free doesn't print anything. So I guess the question now is: Did we do something weird on our end to leave index.redb in this state? I'll report back after I rebuild the snapshot with the latest redb.

cberner commented 1 year ago

Oh, this makes complete sense actually, and is another data point in favor of not zeroing out memory on free. So here's what is almost certainly happening: 1) you build the index, which involves a large number of large transactions. Pages that are committed before a transaction starts, which are deleted during the transaction, are only freed when the next transaction commits. 2) therefore, when your index finishes building there is a large list of pending pages to free 3) these are freed when you next open the database and perform the first commit()

Try adding an extra empty commit at the end of your indexing, and be sure there are no open read transactions. That should clear out the pending frees, so that when you re-open the database file there is no pending work to do.

casey commented 1 year ago

I'm not sure I understand 1. Just to clarify, is this the sequence of events?

  1. WTX 1 allocates page A and commits.
  2. WTX 2 deletes page A and commits.
  3. WTX 3 does nothing and commits. Page A is freed here.

Why isn't page A freed by WTX 2's commit?

When we're building the index, is the list of pending pages from the last commit cleared after every commit? Or does it build up until there's a final empty commit?

We can definitely add an extra commit after indexing, although in general, we can't be sure that there are no open read transactions, since the server could be serving requests concurrently.

Not zeroing out pages is definitely fine for ord. Maybe remove it and figure out how to make it configurable when someone wants it?

cberner commented 1 year ago

Just to clarify, is this the sequence of events? Yes, that's right

Why isn't page A freed by WTX 2's commit? Because WTX 2 could crash before it commits, so it's only safe to free A after WTX 2 commits

The pending free list is cleared, as much as possible, at the end of every commit. Read transactions will block pages from being freed, if it's possible for the read to reach that page. i.e. if the read was created before the page was added to the pending free list.

Not zeroing out pages is definitely fine for ord. Maybe remove it and figure out how to make it configurable when someone wants it? Ya. The current state of affairs gives a false sense of security, since as you discovered a large amount of deleted data can still be in the database file.

casey commented 1 year ago

Because WTX 2 could crash before it commits, so it's only safe to free A after WTX 2 commits

Ah, okay, that makes sense.

The pending free list is cleared, as much as possible, at the end of every commit. Read transactions will block pages from being freed, if it's possible for the read to reach that page. i.e. if the read was created before the page was added to the pending free list.

Gotcha. We're doing single massive commits of 5k blocks at a time, so it makes sense for each to have a massive list of pages to free, and thus zero out.

Ya. The current state of affairs gives a false sense of security, since as you discovered a large amount of deleted data can still be in the database file.

Right, definitely.