Table::last performance cliff for low cache size

marvin-j97 commented 4 weeks ago

I don't know if this is expected performance behaviour, but I found this during benchmarking and it may have pretty serious implications about data-to-cache ratio and read performance for specific workloads.

Basically, the benchmark is just writing small items in monotonic order and reading the last one (Table::last). It will probably happen with other operations such as ::first etc. too.

Depending on the item count & cache size, at some point the read performance will absolutely fall off a cliff.

The cache is set to 4 MB (pretty low), and after 140k (small!) items, the read performance tanks by 6x.

Excerpt:

db len now 100000
did 1075700 reads in 1s
db len now 110000
did 1088200 reads in 1s
db len now 120000
db len now 130000
did 1086900 reads in 1s
db len now 140000
did 1089600 reads in 1s
db len now 150000
db len now 160000
did 723000 reads in 1s
db len now 170000
did 206700 reads in 1s
db len now 180000
db len now 190000
did 167100 reads in 1s
db len now 200000
did 169500 reads in 1s
db len now 210000
db len now 220000
did 169900 reads in 1s
db len now 230000
did 173000 reads in 1s
db len now 240000
did 184500 reads in 1s
did 219800 reads in 1s
-- at this point writes are done --
Ok(DatabaseStats { tree_height: 4, allocated_pages: 4477, leaf_pages: 4385, branch_pages: 86, stored_leaf_bytes: 8000000, metadata_bytes: 1196540, fragmented_bytes: 9120772, page_size: 4096 })
did 220000 reads in 1s
did 220600 reads in 1s
...

Even after all (250k) items were written, the read performance does not recover, even though only the very last part of the keyspace is read. I wouldn't really expect that, as per Db stats, the tree height is only 4, so realistically only a couple of 4K pages are needed to get to any leaf page, right? To prove my point, waiting for everything to be written, and replacing db.last() with db.get(LAST_ITEM_ID), reads 20x faster. So it kind of confirms my theory that this cache size is high enough to handle the tree traversal quickly, but specifically range operations seem to degrade somehow.

If you increase the cache to 8MB, it can maybe pull off 400k items, with 16MB it can do 800k, etc. Even though the read pattern is not random at all.

mem

And here's a comparative benchmark, even though it's hard to compare (I ran redb multiple times just to be sure)... Sled uses tons of memory, as it doesn't seem to obey the set cache size. LMDB just caches in memory, so it's basically cheating too. Fjall is an LSM-tree and can just jump to the correct block very easily, so its cache size can be very low (probably <1 MB) in this scenario for basically an infinite amount of items.

Here's a reproduction script: https://gist.github.com/marvin-j97/8f3ec281bfeeebec2532580a51d741b3

cberner commented 3 weeks ago

Ah yes, the implementation needs to be optimized. Creating a range does two lookups (beginning and end of range), which is why it's less efficient than get

cberner commented 3 weeks ago

To prove my point, waiting for everything to be written, and replacing db.last() with db.get(LAST_ITEM_ID), reads 20x faster

Did you close the database and re-open it, when you ran this test? And if so, did you do that for both last() and get()? I wasn't able to reproduce your 20x number, when I compare them.

I think there are two issues, one is cache management but it affects both last() and get(). When the database is closed and re-opened the cache is cleared, and that would explain most of the 20x difference you observed.

The second issue is that last() does two lookups, like I mentioned. This one I have a fix for.

cberner commented 3 weeks ago

https://github.com/cberner/redb/pull/871

marvin-j97 commented 3 weeks ago

Did you close the database and re-open it, when you ran this test? And if so, did you do that for both last() and get()? I wasn't able to reproduce your 20x number, when I compare them.

Clean database, though this seems to heavily depend on how you read. When you read while inserting, the performance cliff will happen. If you let it write all 250k items, then do first(), last(), get(0), get(LAST), it will perform much better:

Read while insert:
first(): 240k reads per second after all inserts are done
last(): 220k reads per second after all inserts are done
get(0): 540k reads per second after all inserts are done
get(LAST): 420k reads per second after all inserts are done

Only read after insert:
first(): 890k reads per second
last(): 900k reads per second
get(0): 1.1M reads per second
get(LAST): 2.4M reads per second

cberner commented 3 weeks ago

Ah, I found the issue. There was a race in the tracking of the remaining read cache space. Thanks for the report!

cberner / redb

Table::last performance cliff for low cache size #870