Open msbutler opened 2 years ago
cc @cockroachdb/bulk-io
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
Not working on this, but this is still a problem.
In a multi tenant cluster, Restore's distSQL processors are assigned to sql instances using the
sqlInstanceID
. Currently, thesplitAndScatterProcessor
routes a scattered range to a sql instance running therestoreProcessor
using thenodeID
returned by theadminScatterRequest
, which actually identifies a KV instance. In other words, to route ranges for restore ingestion after scatter, we currently assume the list ofsqlInstanceId
s from planning are identical to thenodeID
s returned by split and scatter during execution, which is certainly not the case, implying multitenant restore could be significantly slower. If there are fewer kv instances than planned sql instances ,for example, a subset of sql instances would never get sent any ranges to ingest!In a non-multiregion multitenant cluster, we don't know (or even care) which sql instance is "closest" to a given a kv instance; thus, we ought to route ranges for ingestion such that we balance load across all available sql instances.
hashRouter
as oppose to arangeRouter
. During planning, map each available kv node to a set of sql instances. If the backup job detects significant churn of sql instances, the job should be replanned.In a multiregion multitenant cluster, we will likely want to route a range to a sql instance that is "close" to the range's leaseholder (or at least a follower?). Solution: apply the solution above, by region.
Jira issue: CRDB-16375