Open jeffswenson opened 7 months ago
cc @cockroachdb/disaster-recovery
This isn't specific to PCR, or even multi-tenancy, is it? A backup of a database or cluster on one cloud restored on a cluster on another cloud could easily encounter similarly different node locality values in the restoring cluster right?
A backup of a database or cluster on one cloud restored on a cluster on another cloud could easily encounter similarly different node locality values in the restoring cluster right?
Yes. This is also a challenge for using back ups to migrate between physical deployments. The mitigating factor for back ups is it is already a procedure that comes with a large amount of downtime so you could use sql to fix up the cluster before trying to serve from it. Whereas the whole point of PCR is that cut over can be done with minimal down time.
From a product standpoint, part of why PCR makes this more important is we want to use PCR to make it easy to migrate between regions, clouds, and serverless<->dedicated. So the locality labels have always been a pain, but the problem is becoming more acute.
cc @cockroachdb/disaster-recovery
We are constraining the definition of plan migration to be from a cluster to another cluster with identical MR configuration.
Physical cluster replication (PCR) replicates a virtual cluster's key space as is. If the application is using multi-region primitives or controlling placement via zone configuration the replicated keyspace will contain the names of the source cluster's locality flags.
For example: Adding a region to a database depends on the region locality label in the source cluster. Or using zone configs may depend on arbitrary labels.
Historically, we have encouraged locality labels that describe the underlying physical topology of the deployment. Using labels that describe the physical deployment doesn't work with PCR because an organization may want to use PCR to replicate a multi-region cluster to a different set of physical regions or to a different cloud.
For greenfield single tenant deployments of CRDB, one option is to switch from picking locality labels based on the cloud region to picking logical label names that can be consistent between two physical deployments. But this is insufficient for existing deployments that are using locality labels based on physical locations or multi-tenant deployments where the locality labels are shared by multiple different virtual clusters.
PCR would be more flexible if we broke the link between what values can be specified via SQL and the values that are applied as locality labels. There are a few possible designs for breaking this link.
Virtual Cluster Configuration
We could lean into the separation between the physical cluster and virtual clusters. Virtual clusters would be allowed to use arbitrary region names and locality labels within the virtual cluster. When creating a virtual cluster, we would specify a mapping between the physical cluster locality labels and the virtual cluster region names.
SQL Mapping
An alternative to defining the mapping within the virtual cluster is we could define a mapping within the virtual cluster's key space. This could work if one logical name could map to one of several physical names.
Jira issue: CRDB-36398