spanconfigmanager: a newly restored or cutover tenant waits 10 minutes before reconciliation job

msbutler commented 1 year ago

Once a tenant comes online after c2c cutover or a tenant restore, its reconciliation job does not run for 10 minutes, the default value of spanconfig.reconciliation_job.check_interval. To understand why, consider the following timeline:

t0: restore completes, restoring the tenant's jobs table (which includes the backed up reconciliation job)
t1: the spanConfig manager observes a running reconciliation job in its first run loop, so it doesn't start a new reconciliation job
t2: a node attempts to start the backed up reconciliation job, which immediately succeeds without doing any work, as the job began on a different cluster
t_10_minutes_later: the spanConfig manager realizes no reconciliation job is running and spins up a new one

It's worth noting this bug does not affect vanilla cluster restores, as we test here, because the restoring cluster begins a reconciliation job before the restore begins.

To get rid of this 10 minute wait, I propose adding a new manager.start method here which checks and cancels any span config job started on a previous cluster.

UPDATE: after chatting with @dt, we don't think scanning the job table at tenant startup is a great idea. If there's a fix that avoids that, like applying a constant jobID to the span config job, that would be preferable.

Jira issue: CRDB-31096

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/disaster-recovery

dt commented 1 year ago

another option would be to just put the job at a well-known, constant ID so we can create-if-not-exists without a scan in start, and then get rid of the exit-on-wrong-id check?

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/disaster-recovery

msbutler commented 1 year ago

Rumor has it that @cockroachdb/sql-foundations owns the reconciliation job now. Tagging them. I'd prioritize this as a "really nice to have" for 23.2.

blathers-crl[bot] commented 8 months ago

cc @cockroachdb/disaster-recovery

msbutler commented 5 months ago

unassigning myself as i don't plan to work on this

cockroachdb / cockroach

spanconfigmanager: a newly restored or cutover tenant waits 10 minutes before reconciliation job #109771