storage: sstable readahead seems absent or ineffective

jbowens commented 1 year ago

The Cockroach Cloud telemetry cluster experiences large spikes in read IOPs during full backups. @sumeerbhola observed and calculated that at one data point, a node was performing 3.5K IOPS to read 65MB/s, which works out to be 19KB per operation. He points out that 19KB suspiciously matches the size of a single 32KB block post-compression.

Our expectation is that sstable-level readahead (explicitly through readahead syscalls or implicitly through fadvise(POSIX_FADV_SEQUENTIAL)) should cause a sequential scan like a full backup to perform much larger reads, up to 256kb.

internal Slack link

Jira issue: CRDB-22012

jbowens commented 1 year ago

This did not reproduce with the backup/2TB roachtest on GCP or AWS:

On GCP with local SSD, at one data point, a node read 352.81 MB through performing 1,547 read operations, which nets out to 228 KB/op. On AWS with EBS, at one data point, a node read 124.93 MB through performing 558 read operations, which nets out to 224 KB/op.

I'm suspicious that the difference is in the block cache effectiveness. In these roachtests, there'd been no prior workload, and the backup scans consistently missed the block cache.

During the telemetry cluster's backups, the block cache hit ratio is unperturbed.

The sstable iterator has two different code paths for read ahead accounting for misses and hits. Maybe the hit code path has a bug that loses its knowledge of sequential access?

The backupTPCC roachtest runs a tpcc workload before and during the backup, and maybe it will show the smaller reads?

jbowens commented 1 year ago

I'm taking a second look at this. I'm looking at one node during one recent backup.

It looks like the spikes in export requests are incremental backups, running concurrent with the long-running full backup. When there's not a concurrent backup, the size of the reads seem reasonable: ~50K-110K. During incremental backups, the average size of reads tanks.

Interestingly, the magnitude of the incremental backup's read IOPS spikes is disproportionately high during full backups:

I checked the LSM stats of a few nodes, and the size of the table cache is about equal to the number of files in the LSM, so the full backup isn't trashing the table cache.

I suspect we're seeing the effect of the full backup trashing the OS cache. Ordinarily the sstables we read during incremental backup are ones we recently wrote and are likely in the OS cache. During the full backup, reading the entire LSM replaces the OS's cache with a ~random set of files' blocks.

To reduce the magnitude of the IOPS spikes, I think we'd need to throttle the incremental backup requests more aggressively.

sumeerbhola commented 1 year ago

I suspect we're seeing the effect of the full backup trashing the OS cache. Ordinarily the sstables we read during incremental backup are ones we recently wrote and are likely in the OS cache. During the full backup, reading the entire LSM replaces the OS's cache with a ~random set of files' blocks.

Just trying to confirm the basic math and interpretation.

Can we trust those thin spikes in IOPS to 2.5K/s in that are we sure that the read bandwidth is also not spiking similarly (by ~4x and not just ~2x)?
If we divide 125MB/s with 2.5K IOPS, we get 50KB, which is ok if there is little readahead.
Full backup should presumably be able hit the maximum readahead of 256KB. Are you saying that we may be issuing the readahead, but by the time we get around to reading what was read it may be gone from the page cache? This sounds plausible. Do you know the replacement policy for the page cache?
Do you know how many bytes (as fraction of memory) are available for the page cache to use in our usual configuration, or this particular configuration?

To reduce the magnitude of the IOPS spikes, I think we'd need to throttle the incremental backup requests more aggressively.

It may be better to rethink the reliance on the page cache for reads entirely. The behavior is unpredictable even with a local filesystem, and it is irrelevant with a remote object storage. I think we will need to implement a local in-memory cache of compressed blocks as a second caching tier.

jbowens commented 1 year ago

Can we trust those thin spikes in IOPS to 2.5K/s in that are we sure that the read bandwidth is also not spiking similarly (by ~4x and not just ~2x)?

Good question. I'm not sure how much precision we have.

Full backup should presumably be able hit the maximum readahead of 256KB.

It looks like in practice during a full backup (w/o any concurrent incremental backup) we're seeing on average ~75 MB / 600 iops = 125KB reads.

Another potential factor here, I calculated the average on-disk size of a replica on this node to be ~80MB. If reading .9 * 80 MB = 72 MB from L6 only used the 256KB reads, that'd be 281 reads. If the remaining 8 MB all used 32KB reads, they'd take an additional 250 reads.

Are you saying that we may be issuing the readahead, but by the time we get around to reading what was read it may be gone from the page cache? This sounds plausible. Do you know the replacement policy for the page cache?

That wasn't what I was saying, but I think that's a good question. I'm not sure, and especially with readahead. If a readahead page has never been accessed, is its last access when the readahead was requested or never?

I was saying that I thought usually incremental backups infrequently hit the disk because the OS page cache contains all the recently flushed+compacted data. I would guess that the telemetry cluster is read-little, write-heavy. But a full backup is a deviation from that status quo, and the incremental backups find that recently flushed+compacted data has been replaced with pages loaded during the full backup scan.

It may be better to rethink the reliance on the page cache for reads entirely. The behavior is unpredictable even with a local filesystem, and it is irrelevant with a remote object storage. I think we will need to implement a local in-memory cache of compressed blocks as a second caching tier.

Makes sense

sumeerbhola commented 1 year ago

Another potential factor here, I calculated the average on-disk size of a replica on this node to be ~80MB. If reading .9 * 80 MB = 72 MB from L6 only used the 256KB reads, that'd be 281 reads. If the remaining 8 MB all used 32KB reads, they'd take an additional 250 reads.

Good observation.

jbowens commented 1 year ago

I'm going to move this to backlog for tracking the continued investigation of our reliance on the OS page cache.

joshimhoff commented 1 year ago

It may be better to rethink the reliance on the page cache for reads entirely. The behavior is unpredictable even with a local filesystem, and it is irrelevant with a remote object storage. I think we will need to implement a local in-memory cache of compressed blocks as a second caching tier.

Some thoughts: What we want specifically may be read-ahead without relying on the OS page cache. We rely on read-ahead to keep our spend of of iops down, but we only get read-ahead currently thru the OS page cache. But the OS page cache is not in our control, and also it has the goal of caching frequently used blocks also, a different goal than enabling read-ahead, and one that the pebble block cache is arguably better equipped for.

Not totally sure about this but... The thing to do is not necessarily add another layer of in-memory caching but instead think about how to enable read-ahead in a way that plays nice with typical workloads & the existing pebble block cache. Best solution may be adding another cache, as then cache A's goal can be holding frequently accessed blocks & cache B's goal can be enable ahead. OTOH do we need a cache for read-ahead? Or do we just need a buffer instead? Perhaps the meaning of "cache" is not worth worrying about...

it is irrelevant with a remote object storage

It is a little unclear to me how read-ahead works conceptually here. Today with the GCS / S3 client, do we currently make any attempt to do read-ahead or at least read in large blocks when doing big scans? I am not sure how thick those clients are, either our code or the GCS / S3 client code we are wrapping. If no one has a good mental model here, I can dig.

jbowens commented 1 year ago

What we want specifically may be read-ahead without relying on the OS page cache.

I think this problem exists beyond the scope of read-ahead too. I believe what we're seeing here on the telemetry cluster is not really read-ahead specific. It's that our current performance relies on cheapish block cache misses enabled by the OS page cache. During a full backup, the OS page cache gets trashed and block cache misses become expensive. Conceptually, the OS page cache is already a secondary cache of compressed blocks, so shifting the OS page cache's memory into a secondary cache of compressed blocks should be roughly equivalent. I think it's an open question whether that memory would be better spent enlarging the block cache or holding a secondary compressed block cache. I imagine it has a lot to do with the compressibility of blocks.

joshimhoff commented 1 year ago

That makes sense. There are two sources of unnecessarily high disk usage in the ops & bandwidth sense. Thrashed OS page cache means losing read-ahead means higher iops usage than is ideal. Thrashed OS page cache means more misses in general so more disk usage in general, both iops & bandwidth. I think the experiments I am doing with O_DIRECT right now may help us quantify this, tho it is sort of hard to distinguish between the two inefficiencies. I guess with my O_DIRECT experimentation we will have strictly no read ahead. I guess if we see iops elevated compared to a control cluster but bandwidth steady, we can prob assume the main problem is losing read-ahead. Actually, I think that applies to this production issue too! Do we see elevated iops & bandwidth or just elevated iops?

joshimhoff commented 1 year ago

I see both read iops & bandwidth were elevated up above. I guess it's hard because we expect both to be elevated at full BACKUP time. The issue is that OS page caches can lead to even higher usage than we expect. Anyway, I shall run the experiments now. Hopefully we learn something.

joshimhoff commented 1 year ago

Doing some related experimentation with O_DIRECT at https://github.com/cockroachdb/cockroach/issues/98345, tho the goals are a bit different than understanding the issue documented in this ticket in more detail.

cockroachdb / cockroach

storage: sstable readahead seems absent or ineffective #92869