Open tbg opened 1 week ago
After a compaction happens, the first time we read a block from a new file, we read the data from the OS. Typically it will be served from the OS page cache.
https://github.com/cockroachdb/pebble/issues/2543 tracks improving this.
I also have an idea about copying entire data blocks when possible during compactions, and keeping them in cache if they were there already. I thought we had an issue tracking that but I can't seem to find it right now.
We can look at bandwidth metrics to see how much data we actually read from the disk.
Thanks! So we are reading around 10MB/s.. Note that the OS should be using the free memory for the page cache; moving some of that memory to Pebble would actually decrease the total amount of cached data (the Pebble block cache stores uncompressed blocks).
In sysbench oltp_read_write (10x10m rows), we see ~0.5% of cpu time in syscalls reading SSTs, presumably due to block cache misses. The dataset at this point is ~39gb per node (
df -h /mnt/data1
), which is not much larger than available memory. At the same time, actual mem usage on these nodes is just around 30-40%.It stands to reason that performance could improve with a larger block cache. We should validate this and if so understand why we're not using more memory on this workload out of the box.
Jira issue: CRDB-43618