Open irfansharif opened 2 years ago
A study done in our lab for a customer has relevant insights. Search internal wiki for "Cluster Expansion", from September 2021 (sorry, can not include a link revealing proprietary info).
We shouldn't do this until we implement throttling of snapshot ingestion (e.g. #77491 or #73720). Otherwise, if the transfer rate is higher than about half of the disk throughput, then we can end up ingesting snapshot SSTs faster than the recipient can compact them away. We've seen this cause read amp blowup in several incidents.
We've seen this cause read amp blowup in several incidents.
In these incidents, what was the rate set to? Was it at or above 256MB? The field needs to know that because I proactively instruct all my customers to use 256MB. It has multiple material benefits and I never heard of negatives, therefore asking for evidence when it causes undesirable side effects and what they are
256 MB/s.
You can see an example incident in this internal escalation: https://github.com/cockroachlabs/support/issues/1507.
Good to know, thx! Sobering. Can you share pointers to these issues?
You can see an example incident in this internal escalation: cockroachlabs/support#1507.
FYI, snapshot rates were a red herring in the above escalation. I do believe we've seen them cause read amp in other escalations, especially with underprovisioned disks, so we may want to hold off on this until we implement some sort of throttling. But to be clear, 256 MB/s is fine as long as the disks can handle it.
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
Describe the problem
We recently bumped our snapshot rates from 8 MB/s to 32 MB/s after observations that the defaults were too conservative for production clusters (and cluster years of experience with higher rates). It's possible the 32 MB/s is still too conservative; @a-entin reports field data that 256 MB/s is our standard first-step recommendation for new clusters. This issue tracks bumping the 32 MB/s higher still.
Additional context
It's possible that higher rates may amplify the problems described in https://github.com/cockroachdb/cockroach/issues/74694, which should perhaps be fixed semi-independently.
Jira issue: CRDB-12215