Open msbutler opened 1 year ago
cc @cockroachdb/disaster-recovery
another option would be to just put the job at a well-known, constant ID so we can create-if-not-exists without a scan in start, and then get rid of the exit-on-wrong-id check?
cc @cockroachdb/disaster-recovery
Rumor has it that @cockroachdb/sql-foundations owns the reconciliation job now. Tagging them. I'd prioritize this as a "really nice to have" for 23.2.
cc @cockroachdb/disaster-recovery
unassigning myself as i don't plan to work on this
Once a tenant comes online after c2c cutover or a tenant restore, its reconciliation job does not run for 10 minutes, the default value of
spanconfig.reconciliation_job.check_interval
. To understand why, consider the following timeline:It's worth noting this bug does not affect vanilla cluster restores, as we test here, because the restoring cluster begins a reconciliation job before the restore begins.
To get rid of this 10 minute wait, I propose adding a new
manager.start
method here which checks and cancels any span config job started on a previous cluster.UPDATE: after chatting with @dt, we don't think scanning the job table at tenant startup is a great idea. If there's a fix that avoids that, like applying a constant jobID to the span config job, that would be preferable.
Jira issue: CRDB-31096