cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.04k stars 3.8k forks source link

spanconfigmanager: a newly restored or cutover tenant waits 10 minutes before reconciliation job #109771

Open msbutler opened 1 year ago

msbutler commented 1 year ago

Once a tenant comes online after c2c cutover or a tenant restore, its reconciliation job does not run for 10 minutes, the default value of spanconfig.reconciliation_job.check_interval. To understand why, consider the following timeline:

It's worth noting this bug does not affect vanilla cluster restores, as we test here, because the restoring cluster begins a reconciliation job before the restore begins.

To get rid of this 10 minute wait, I propose adding a new manager.start method here which checks and cancels any span config job started on a previous cluster.

UPDATE: after chatting with @dt, we don't think scanning the job table at tenant startup is a great idea. If there's a fix that avoids that, like applying a constant jobID to the span config job, that would be preferable.

Jira issue: CRDB-31096

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/disaster-recovery

dt commented 1 year ago

another option would be to just put the job at a well-known, constant ID so we can create-if-not-exists without a scan in start, and then get rid of the exit-on-wrong-id check?

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/disaster-recovery

msbutler commented 1 year ago

Rumor has it that @cockroachdb/sql-foundations owns the reconciliation job now. Tagging them. I'd prioritize this as a "really nice to have" for 23.2.

blathers-crl[bot] commented 8 months ago

cc @cockroachdb/disaster-recovery

msbutler commented 5 months ago

unassigning myself as i don't plan to work on this