cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.19k stars 3.82k forks source link

backupccl: prepare RESTORE router for multitenancy #81989

Open msbutler opened 2 years ago

msbutler commented 2 years ago

In a multi tenant cluster, Restore's distSQL processors are assigned to sql instances using the sqlInstanceID. Currently, the splitAndScatterProcessor routes a scattered range to a sql instance running the restoreProcessor using the nodeID returned by the adminScatterRequest, which actually identifies a KV instance. In other words, to route ranges for restore ingestion after scatter, we currently assume the list of sqlInstanceIds from planning are identical to the nodeIDs returned by split and scatter during execution, which is certainly not the case, implying multitenant restore could be significantly slower. If there are fewer kv instances than planned sql instances ,for example, a subset of sql instances would never get sent any ranges to ingest!

In a non-multiregion multitenant cluster, we don't know (or even care) which sql instance is "closest" to a given a kv instance; thus, we ought to route ranges for ingestion such that we balance load across all available sql instances.

In a multiregion multitenant cluster, we will likely want to route a range to a sql instance that is "close" to the range's leaseholder (or at least a follower?). Solution: apply the solution above, by region.

Jira issue: CRDB-16375

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/bulk-io

github-actions[bot] commented 12 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

msbutler commented 11 months ago

Not working on this, but this is still a problem.