cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.86k stars 3.77k forks source link

kv: bump snapshot rates even higher #74695

Open irfansharif opened 2 years ago

irfansharif commented 2 years ago

Describe the problem

We recently bumped our snapshot rates from 8 MB/s to 32 MB/s after observations that the defaults were too conservative for production clusters (and cluster years of experience with higher rates). It's possible the 32 MB/s is still too conservative; @a-entin reports field data that 256 MB/s is our standard first-step recommendation for new clusters. This issue tracks bumping the 32 MB/s higher still.

Additional context

It's possible that higher rates may amplify the problems described in https://github.com/cockroachdb/cockroach/issues/74694, which should perhaps be fixed semi-independently.

Jira issue: CRDB-12215

a-entin commented 2 years ago

A study done in our lab for a customer has relevant insights. Search internal wiki for "Cluster Expansion", from September 2021 (sorry, can not include a link revealing proprietary info).

erikgrinaker commented 2 years ago

We shouldn't do this until we implement throttling of snapshot ingestion (e.g. #77491 or #73720). Otherwise, if the transfer rate is higher than about half of the disk throughput, then we can end up ingesting snapshot SSTs faster than the recipient can compact them away. We've seen this cause read amp blowup in several incidents.

a-entin commented 2 years ago

We've seen this cause read amp blowup in several incidents.

In these incidents, what was the rate set to? Was it at or above 256MB? The field needs to know that because I proactively instruct all my customers to use 256MB. It has multiple material benefits and I never heard of negatives, therefore asking for evidence when it causes undesirable side effects and what they are

erikgrinaker commented 2 years ago

256 MB/s.

erikgrinaker commented 2 years ago

You can see an example incident in this internal escalation: https://github.com/cockroachlabs/support/issues/1507.

a-entin commented 2 years ago

Good to know, thx! Sobering. Can you share pointers to these issues?

erikgrinaker commented 2 years ago

You can see an example incident in this internal escalation: cockroachlabs/support#1507.

FYI, snapshot rates were a red herring in the above escalation. I do believe we've seen them cause read amp in other escalations, especially with underprovisioned disks, so we may want to hold off on this until we implement some sort of throttling. But to be clear, 256 MB/s is fine as long as the disks can handle it.

github-actions[bot] commented 11 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!