cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

roachtest: excessive `sync` during nightly runs #130082

Open renatolabs opened 1 month ago

renatolabs commented 1 month ago

Every time a roachprod cluster is created, we perform a Sync operation two times:

Most importantly, these Sync operations involve fetching data for all VMs, across all clouds. This happens even if the cluster created is hosted exclusively on a single cloud provider (which is true ~100% of the time today).

This excessive syncing is bad for a few reasons:

The last point is especially important because it can lead to very cryptic roachtest failures: VMs may seem to have disappeared, or may have wrong metadata (if there's a blip on the provider API).

In roachtest nightly runs, a lot of this extra work is completely unnecessary: roachtest ensures that cluster names are unique (so the first Sync can theoretically be skipped entirely). The second Sync could also be entirely skipped (if we created a cached entry from a cluster spec) or, alternatively, we could add the option to Sync only a specific cluster on a particular provider, eliminating the impact on unrelated clusters.

Clusters are created (concurrently) several hundred times during a nightly run. Each time a cluster is created, there's a chance a completely unrelated cluster/test might be impacted due to the behaviour described here.

Jira issue: CRDB-41869

blathers-crl[bot] commented 1 month ago

cc @cockroachdb/test-eng