cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

roachtest: bump AWS provisioned write bandwidth for all restore tests #107609

Closed tbg closed 1 year ago

tbg commented 1 year ago

Describe the problem

A number of restore tests are failing on AWS with OOMs in the replication layer.

While we improve the memory accounting in the replication layer, it's not useful for these tests to keep failing.

We are fairly confident that the ooms can be traced to disk overload. Default AWS provisioned gp3 volumes come with combined read+write throughput of 125mb/s, which is routinely maxed out in tests. Combined with load imbalances in the cluster, this can lead to OOMs in the replication layer today.

We should double the provisioned write bandwidth and make sure that this is the default on AWS for the restore tests. This should be easy, as these tests have their own little framework (i.e. all tests go through a central code path).

Holistically addressing the OOMs is tracked in CRDB-25503.

Jira issue: CRDB-30126

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

blathers-crl[bot] commented 1 year ago

Hi @tbg, please add branch-* labels to identify which branch(es) this release-blocker affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

pav-kv commented 1 year ago

@cockroachdb/disaster-recovery @msbutler What limits the rate of writes in restore tests? Can we reduce it a bit so that we don't eat up all the disk throughput as on this graph? Or are we interested in running these tests at the edge of capacity for max performance tracking purposes?

(as an alternative to increasing the throughput beyond 125 MB/s, which is not free)

msbutler commented 1 year ago

@pavelkalinnikov At the sql level, nothing limits the write rate of restore. At the kvserver level, there are a few things that I'm less familiar with. I know Irfan recently set the concurrentAddSStable limiter to default off on master (not on 23.1) https://github.com/cockroachdb/cockroach/pull/104861.

These tests were written in part to answer the following question: "given a hardware configuraiton and workload, what's our restore throughput"? Since this test seems to be bottlenecked on hardware, it seems reasonable to use beefier hardware. I'm going to chat with my team about it tomorrow.

On the flipside, it would be a bad look to tell customers "restore needs to be run on beefier machines or else a node could oom".

pav-kv commented 1 year ago

Comparing performance of restore/tpce/8TB/aws/nodes=10/cpus=8 test with 125 MB/s and 250 MB/s disks.

125 MB/s 250 MB/s
CPU ~30-50% ~60%
Mem ~6 GB / 16 GB ~6 GB / 16 GB
Read <5 MB/s 2-10 MB/s
Write <125 MB/s 70-170 MB/s
Time 3h40m / 5h 3h10m / 5h
With provisioned 125 MB/s Screenshot 2023-08-09 at 15 56 36 Screenshot 2023-08-09 at 15 56 45 Screenshot 2023-08-09 at 15 56 51 Screenshot 2023-08-09 at 15 56 57
With provisioned 250 MB/s Screenshot 2023-08-09 at 15 05 46 Screenshot 2023-08-09 at 15 05 56 Screenshot 2023-08-09 at 15 06 05 Screenshot 2023-08-09 at 15 06 22
pav-kv commented 1 year ago

@msbutler The prototype for provisioning for extra throughput is in #108427. I've tested it and compared 125 and 250 MB/s, see the above message. At 125 we're maxing out the throughput, at 250 we have some leeway, so I think we should be good with 250?

Agreeing with your point that we should fix the OOMs rather than require beefier machines. We are probably doing it next, but we want to avoid unnecessary test flakes in the meantime.

msbutler commented 1 year ago

@pavelkalinnikov thanks for experimenting with this here! Here's what the DR team thinks:

  1. Since this 8TB test is clearly disk bandwidth constrained, and since these tests attempt to find software bottlenecks and not hardware bottlenecks, we think it makes sense to bump disk bandwidth on this this test for now. I believe bumping disk bandwith will only cost use an extra $10 a month, assuming same test runtime (pessimistic), but do check my math!. I don't think it makes sense to bump machine size as well.
  2. Once your prototype lands, I can open a new restore/400GB test that also bumps the disk bandwidth to 250 MB/s and keep the existing tests as is. This OOM only surfaces like once a month, so I don't think there's much investigative cost (open a heap profile, see that raft OOmed).
  3. I will design a restore roachtest (which can stay skipped), which attempts to saturate disk bandwidth and reliably produces OOMs. My plan is to run a 400GB restore on a single node cluster. This test could help us tune various knobs that could avoid OOMs while we wait for raft memory monitoring to land. I lied about sql level knobs: restore can limit the number of workers per node that send addsstable requests.

Some follow up questions:

pav-kv commented 1 year ago

Thanks @msbutler. The plan makes sense.

I will design a restore roachtest (which can stay skipped), which attempts to saturate disk bandwidth and reliably produces OOMs. My plan is to run a 400GB restore on a single node cluster.

The OOMs in this test manifest in interaction between raft nodes, e.g. when many leaders (on the same node) queue up too many / large log entries going to followers [#73376], or when a follower is slow and receives and buffers too many / large incoming updates. I think it's best to repro with 3 nodes/replicas, see some ideas that @tbg noted here.

When do you expect raft memory monitoring to land?

I think we will likely be considering it for 24.1.

pav-kv commented 1 year ago

Reopening this issue, to bump throughput in other tests too.

Currently, other restore tests (even small 400GB ones) max out the 125 MB/s throughput, e.g. see restore/tpce/400GB/aws/nodes=4/cpus=8 at https://github.com/cockroachdb/cockroach/issues/106248#issuecomment-1673632320 and restore/tpce/8TB/aws/nodes=10/cpus=8 at https://github.com/cockroachdb/cockroach/issues/107609#issuecomment-1671415670.

On GCE, a same scale test restore/tpce/400GB/gce/nodes=4/cpus=8 gets disks with larger throughput which is not usually maxed out:

Screenshot 2023-08-22 at 09 43 31

@msbutler I think we should bump throughput to 250 MB/s on all AWS restore tests, as this issue originally suggested. This would both reduce likelihood of OOMs, and will bring some parity between tests.