Every time a roachprod cluster is created, we perform a Sync operation two times:
before the VMs are created: this is to gather all roachprod cluster names and to verify that the cluster name passed does not clash with an existing cluster.
after the VMs are created: presumably to create a cache entry for the newly created cluster.
Most importantly, these Sync operations involve fetching data for all VMs, across all clouds. This happens even if the cluster created is hosted exclusively on a single cloud provider (which is true ~100% of the time today).
This excessive syncing is bad for a few reasons:
makes cluster creation a lot slower than it needs to be
increases the chance of hitting rate limits on cloud provider APIs
opens up the possibility of wrong data being written to the roachprod cache
The last point is especially important because it can lead to very cryptic roachtest failures: VMs may seem to have disappeared, or may have wrong metadata (if there's a blip on the provider API).
In roachtest nightly runs, a lot of this extra work is completely unnecessary: roachtest ensures that cluster names are unique (so the first Sync can theoretically be skipped entirely). The second Sync could also be entirely skipped (if we created a cached entry from a cluster spec) or, alternatively, we could add the option to Sync only a specific cluster on a particular provider, eliminating the impact on unrelated clusters.
Clusters are created (concurrently) several hundred times during a nightly run. Each time a cluster is created, there's a chance a completely unrelated cluster/test might be impacted due to the behaviour described here.
Every time a roachprod cluster is created, we perform a
Sync
operation two times:Most importantly, these
Sync
operations involve fetching data for all VMs, across all clouds. This happens even if the cluster created is hosted exclusively on a single cloud provider (which is true ~100% of the time today).This excessive syncing is bad for a few reasons:
The last point is especially important because it can lead to very cryptic roachtest failures: VMs may seem to have disappeared, or may have wrong metadata (if there's a blip on the provider API).
In roachtest nightly runs, a lot of this extra work is completely unnecessary: roachtest ensures that cluster names are unique (so the first
Sync
can theoretically be skipped entirely). The secondSync
could also be entirely skipped (if we created a cached entry from a cluster spec) or, alternatively, we could add the option toSync
only a specific cluster on a particular provider, eliminating the impact on unrelated clusters.Clusters are created (concurrently) several hundred times during a nightly run. Each time a cluster is created, there's a chance a completely unrelated cluster/test might be impacted due to the behaviour described here.
Jira issue: CRDB-41869