kv, storage: rebalance replicas when disk throughput / IOPS drops

cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.

https://www.cockroachlabs.com

Other

29.97k stars 3.79k forks source link

kv, storage: rebalance replicas when disk throughput / IOPS drops #62168

Open itsbilal opened 3 years ago

itsbilal commented 3 years ago

We occasionally see instances of large production clusters where one node inexplicably got a slower disk (often an AWS/GCP local ssd), and the replicas on that node kept falling further and further behind in writes than the rest of the cluster. And since storage level compactions also take up disk write throughput, the most obvious symptom of this often is compactions backing up and Pebble read amplification increasing.

When a node is disproportionately slower at committing to disk than other nodes, replicas on that node need to be balanced away so that that disk doesn't continue to be overloaded with writes.

One metric that can be observed to identify disk slowness is command commit latency; since a LogData will have to wait for batches ahead of it to be written to the WAL, an increase in the latency of a LogData call would signal a slow disk. We already leverage LogData as part of node liveness heartbeats; before a node responds to a heartbeat request, it does a LogData to each store's engine. Here's the associated comment from liveness.go, which suggests that we already move leases when this latency increases:

            // We synchronously write to all disks before updating liveness because we
            // don't want any excessively slow disks to prevent leases from being
            // shifted to other nodes. A slow/stalled disk would block here and cause
            // the node to lose its leases.

Other possibilities of metrics to react to could include changes in disk write ops or cross-node differences in disk-write ops; the production instances of this issue that we've observed tend to show a significantly lower disk write ops on the affected node as opposed to on other nodes (in what is still an IO-bounded workload).

gz#9005

Jira issue: CRDB-2831

Epic CRDB-39952

itsbilal commented 3 years ago

cc @nvanbenschoten , @aayushshah15 and @sumeerbhola - would be happy to hear your thoughts on this

aayushshah15 commented 3 years ago

Something like this came up in a KV weekly many moons ago and it was received coldly due to the challenges in properly validating the new heuristic under a realistic scenario and making sure there are no false positives that can lead to thrashing.

We probably want to be overly cautious and pick a metric that can reliably signal disk-slowness on a particular node in the cluster. A sort of big hurdle with anything like this is that the allocator needs to be able to simulate the resulting state of the “sender” (the store that’s relinquishing a replica) after a purported rebalancing decision. With the current set of signals that we rebalance on (the store’s range count, total QPS across all replicas on the store, etc), this simulation is easy to do but it’s not really feasible for metrics like log-commit / command-commit latency or read-amplification — because it’s hard to quantify how much of a difference one rebalancing decision will make to the metric and because the ramifications of the rebalance will not be visible until a much later time.

Consider that without some way to simulate the resulting state, the allocator has no way of knowing when to stop moving replicas / leases off of a node[1].

A notable exception to what I just said is "disk fullness" . However, since we eagerly remove replicas that have been moved away, the feedback loop is tight enough that [1] is not a problem.

nvanbenschoten commented 3 years ago

@itsbilal in the original description here, you mentioned basing decisions of liveness off the latency of LogData as a way to detect a slow disk. You also mentioned that slow disks can cause an increase in read amplification.

Is there a reason you didn't propose using read amplification itself as a signal to cut a node's liveness? In some sense, this may actually be a better indicator of health, because it changes on a much larger timescale than the latency of a single LogData operation. In recent issues, we have seen that read amplification can severely impact a node's ability to serve traffic. For the sake of the workload, it would be better for a node with high read amplification to shed all leases. This would probably also be better for the node itself, as it would free up some headroom to compact more aggressively.

So one straightforward mechanism to add here would be a read amplification threshold (default to 200?) above which a node will not heartbeat node liveness.

itsbilal commented 3 years ago

@nvanbenschoten That's a great point and I don't have a counterpoint to that; read amp on its own could be a great indicator of storage engine health. I was just trying to think of something that'd also intuitively catch "slow disks" more directly before they caused compactions to back up enough to increase read amp, but just going with read amp might be the simplest and most concrete thing here, even if it's a little bit of a lagging indicator.

And would "not heartbeating liveness" for long enough result in replicas being proactively rebalanced elsewhere? Or will the node just remain "suspect" but not dead throughout?

sumeerbhola commented 3 years ago

Is there a reason you didn't propose using read amplification itself as a signal to cut a node's liveness? ... So one straightforward mechanism to add here would be a read amplification threshold (default to 200?) above which a node will not heartbeat node liveness.

One thing to keep in mind is that when admission control is enabled, the ioLoadListener will use a sub-level count threshold of 10 (l0SubLevelCountOverloadThreshold) to start throttling traffic, so read amplification may never become high enough. I think we need a low-level measure of slowness and not an emergent signal. This is related to my comment on the "suspect nodes" PR https://github.com/cockroachdb/cockroach/pull/63999#issuecomment-834400058.

tbg commented 3 years ago

We also have https://github.com/cockroachdb/cockroach/blob/0e0082d16bd9a19b5c11662c3d83cdfe54c5e506/pkg/server/node_engine_health.go#L54-L76 and https://github.com/cockroachdb/cockroach/blob/90f4522a15d9d328bbcea55d6f45599154fd4fa6/pkg/storage/pebble.go#L637-L657 so we could entertain setting this cluster setting to true with a more aggressive wal sync interval.

We don't have good testing here but I think even when liveness is gone there are still ways for this node to cause issues, whether it is blocking out the UI, and leaving some requests hanging that arrived before liveness went dark and leases were shed.

edit: checking the read amp before liveness heartbeat as Nathan suggested is a great start though.

sumeerbhola commented 3 years ago

Adding some discussion (with tiny edits) from an internal thread.

@nvanbenschoten: I don’t have a feel for how much we actually expect a given LSM to be able to ingest without getting overloaded and losing the read amplification game. We see an imbalance, but that alone doesn’t mean that we should expect one store get unhealthy. If one store was ingesting 1KB/s and one was ingesting 1MB/s, we’d be massively imbalanced but still be healthy, we’d just be suboptimal in terms of latency. Meanwhile, if all stores are balanced and ingesting 1GB/s, we’re going to be in trouble. So where is the cutoff? Do we have a ballpark understanding of the ingest data rate that Pebble can sustain? Is 20MB/s (from here) above this threshold? Is it approaching it? We’ve seen that the disks that redacted is running can push well over 500MB/s, but of course write amplification will prevent us from pushing that hard, so I’d expect the limit to be some fraction of that.

@sumeerbhola:

I don’t think we have any ballpark ingest rate understanding — it will depend on disk speed, key distribution (which impacts write amplification). I am hoping we don’t need up-front ingest/write capacity knowledge for good load-balancing.
In 21.2, if we have admission control turned on, we could have an understanding of write capacity achievable given the current workload and state of disk. That could be used to compute a target rate and shed load. If we don’t shed load, the admission control throttling will still prevent situations where we need to take the node offline and compact, which is an improvement, but traffic will still suffer by being queued.

aayushshah15 commented 3 years ago

So one straightforward mechanism to add here would be a read amplification threshold (default to 200?) above which a node will not heartbeat node liveness.

Before a node heartbeats its liveness record, today it will do a no-op commit here: https://github.com/cockroachdb/cockroach/blob/78688ea3a0be50ffd5136416134927a54215b2d5/pkg/kv/kvserver/liveness/liveness.go#L1239-L1247

As that comment describes, this is mostly meant to guard against an excessively slow disk but in most cases that we actually see in the wild, it ends up effectively guarding against high read amplification. Was your suggestion that we would check read-amplification in addition to this check?

And would "not heartbeating liveness" for long enough result in replicas being proactively rebalanced elsewhere? Or will the node just remain "suspect" but not dead throughout?

If the node wasn't able to heartbeat for the duration specified by the server.time_until_store_dead cluster setting (which defaults to 5m), then it would be classified dead by the allocator and we'd start moving replicas off of it. @nvanbenschoten, this seems a bit undesirable to me. What do you think?

aayushshah15 commented 3 years ago

I also wanted to gather thoughts on how people feel about the status-quo after @lunevalex's change to mark nodes as "suspect" if they have failed any of their liveness heartbeats in the recent past. Nodes that are marked suspect are considered invalid, by the allocator, for the purposes of receiving lease transfers and new replicas. However, we don't proactively move anything off of suspect nodes.

This patch, combined with the no-op pebble commit I mentioned in my comment above, effectively has us in a place where a node with excessively high read-amplification will ~regularly struggle to heartbeat its liveness record. Thus, such nodes will lose all their currently held leases and we won't proactively transfer any new leases to these nodes for as long as they're marked "suspect". With this in mind, how do we feel about the current state of affairs?

github-actions[bot] commented 9 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!