cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.92k stars 3.78k forks source link

roachtest: wait for system ranges to finish upreplication before running workloads #101532

Open aliher1911 opened 1 year ago

aliher1911 commented 1 year ago

Is your feature request related to a problem? Please describe.

We had a number of test failures like (#101438) where test fails when initializing or starting workload because cluster is still upreplicating system ranges. This could cause slowness when accessing liveness, metrics, id generators etc and in turn cause timeouts and fail tests.

It is more pronounced in multi-region clusters where latency is higher and replication bandwidth could be lower.

At the same time this behaviour doesn't represent a real situation where load is being applied seconds after cluster starts.

Describe the solution you'd like We have similar issues with TestCluster in integration tests and we use WaitForFullReplication() to explicitly ensure that cluster is ready before performing any actions.

We can apply similar approach when starting cluster with roachprod. If starting node is triggering init because we think it is a new cluster, then also wait for cluster to get all ranges upreplicated. We can for example use WaitForReplication() before proceeding. Care must be taken in clusters with more than 3 nodes to ensure that zone configs are propagated before replication checks are performed to ensure 5x replication is achieved.

Tests that specifically want to use underreplicated cluster or restarting some of the nodes, should then opt out of this behaviour. We can use option.StartOpts for opt out.

Jira issue: CRDB-26994

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/test-eng

erikgrinaker commented 1 year ago

The main problem here is that splits will inherit the replication factor of the LHS. If we start splitting off ranges for the workload (e.g. an import) before the LHS system ranges have finished upreplicating, then they'll start off with RF=1 and have to individually upreplicate to RF=3 instead of just starting out with RF=3 in the first place. This additional work can severely impact the workload.