Closed tbg closed 1 year ago
cc @cockroachdb/replication
Hi @tbg, please add branch-* labels to identify which branch(es) this release-blocker affects.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
@cockroachdb/disaster-recovery @msbutler What limits the rate of writes in restore
tests? Can we reduce it a bit so that we don't eat up all the disk throughput as on this graph? Or are we interested in running these tests at the edge of capacity for max performance tracking purposes?
(as an alternative to increasing the throughput beyond 125 MB/s, which is not free)
@pavelkalinnikov At the sql level, nothing limits the write rate of restore. At the kvserver level, there are a few things that I'm less familiar with. I know Irfan recently set the concurrentAddSStable limiter to default off on master (not on 23.1) https://github.com/cockroachdb/cockroach/pull/104861.
These tests were written in part to answer the following question: "given a hardware configuraiton and workload, what's our restore throughput"? Since this test seems to be bottlenecked on hardware, it seems reasonable to use beefier hardware. I'm going to chat with my team about it tomorrow.
On the flipside, it would be a bad look to tell customers "restore needs to be run on beefier machines or else a node could oom".
Comparing performance of restore/tpce/8TB/aws/nodes=10/cpus=8
test with 125 MB/s and 250 MB/s disks.
125 MB/s | 250 MB/s | |
CPU | ~30-50% | ~60% |
Mem | ~6 GB / 16 GB | ~6 GB / 16 GB |
Read | <5 MB/s | 2-10 MB/s |
Write | <125 MB/s | 70-170 MB/s |
Time | 3h40m / 5h | 3h10m / 5h |
@msbutler The prototype for provisioning for extra throughput is in #108427. I've tested it and compared 125 and 250 MB/s, see the above message. At 125 we're maxing out the throughput, at 250 we have some leeway, so I think we should be good with 250?
Agreeing with your point that we should fix the OOMs rather than require beefier machines. We are probably doing it next, but we want to avoid unnecessary test flakes in the meantime.
@pavelkalinnikov thanks for experimenting with this here! Here's what the DR team thinks:
Some follow up questions:
Thanks @msbutler. The plan makes sense.
I will design a restore roachtest (which can stay skipped), which attempts to saturate disk bandwidth and reliably produces OOMs. My plan is to run a 400GB restore on a single node cluster.
The OOMs in this test manifest in interaction between raft nodes, e.g. when many leaders (on the same node) queue up too many / large log entries going to followers [#73376], or when a follower is slow and receives and buffers too many / large incoming updates. I think it's best to repro with 3 nodes/replicas, see some ideas that @tbg noted here.
When do you expect raft memory monitoring to land?
I think we will likely be considering it for 24.1.
Reopening this issue, to bump throughput in other tests too.
Currently, other restore
tests (even small 400GB ones) max out the 125 MB/s throughput, e.g. see restore/tpce/400GB/aws/nodes=4/cpus=8
at https://github.com/cockroachdb/cockroach/issues/106248#issuecomment-1673632320 and restore/tpce/8TB/aws/nodes=10/cpus=8
at https://github.com/cockroachdb/cockroach/issues/107609#issuecomment-1671415670.
On GCE, a same scale test restore/tpce/400GB/gce/nodes=4/cpus=8
gets disks with larger throughput which is not usually maxed out:
@msbutler I think we should bump throughput to 250 MB/s on all AWS restore
tests, as this issue originally suggested. This would both reduce likelihood of OOMs, and will bring some parity between tests.
Describe the problem
A number of restore tests are failing on AWS with OOMs in the replication layer.
While we improve the memory accounting in the replication layer, it's not useful for these tests to keep failing.
We are fairly confident that the ooms can be traced to disk overload. Default AWS provisioned gp3 volumes come with combined read+write throughput of 125mb/s, which is routinely maxed out in tests. Combined with load imbalances in the cluster, this can lead to OOMs in the replication layer today.
We should double the provisioned write bandwidth and make sure that this is the default on AWS for the restore tests. This should be easy, as these tests have their own little framework (i.e. all tests go through a central code path).
Holistically addressing the OOMs is tracked in CRDB-25503.
Jira issue: CRDB-30126