backupccl: prepare RESTORE router for multitenancy

msbutler commented 2 years ago

In a multi tenant cluster, Restore's distSQL processors are assigned to sql instances using the sqlInstanceID. Currently, the splitAndScatterProcessor routes a scattered range to a sql instance running the restoreProcessor using the nodeID returned by the adminScatterRequest, which actually identifies a KV instance. In other words, to route ranges for restore ingestion after scatter, we currently assume the list of sqlInstanceIds from planning are identical to the nodeIDs returned by split and scatter during execution, which is certainly not the case, implying multitenant restore could be significantly slower. If there are fewer kv instances than planned sql instances ,for example, a subset of sql instances would never get sent any ranges to ingest!

In a non-multiregion multitenant cluster, we don't know (or even care) which sql instance is "closest" to a given a kv instance; thus, we ought to route ranges for ingestion such that we balance load across all available sql instances.

A simple solution: use a hashRouter as oppose to a rangeRouter. During planning, map each available kv node to a set of sql instances. If the backup job detects significant churn of sql instances, the job should be replanned.
A better solution: route ranges to sql instances dynamically. I'm not sure if this is possible right now.

In a multiregion multitenant cluster, we will likely want to route a range to a sql instance that is "close" to the range's leaseholder (or at least a follower?). Solution: apply the solution above, by region.

Jira issue: CRDB-16375

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/bulk-io

github-actions[bot] commented 12 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

msbutler commented 11 months ago

Not working on this, but this is still a problem.

cockroachdb / cockroach

backupccl: prepare RESTORE router for multitenancy #81989