sql,jobs: large index backfill with short GC TTL gets stuck during backfill

irfansharif commented 1 year ago

Describe the problem

We see the following when trying to create an index on the stock TPC-E dataset which uses a GC TTL of 300s. It keeps retrying but never succeeding. It retries with the same batch timestamp despite the replica GC threshold being raised higher and higher.

W221230 18:47:13.084001 9426 sql/schemachanger/scrun/scrun.go:193 ⋮    [n1,job=‹NEW SCHEMA CHANGE id=826915012625170433›] 298  failed executing declarative schema change PostCommitPhase stage 1 of 6 with 1 BackfillType op (rollback=false) for CREATE INDEX with error: failed to construct index entries during backfill: ‹batch timestamp 1672424835.082708902,0 must be after replica GC threshold 1672425215.129026617,0›
W221230 18:48:26.697789 1039706 sql/schemachanger/scrun/scrun.go:193 ⋮ [n1,job=‹NEW SCHEMA CHANGE id=826915012625170433›] 323  failed executing declarative schema change PostCommitPhase stage 1 of 6 with 1 BackfillType op (rollback=false) for CREATE INDEX with error: failed to construct index entries during backfill: ‹batch timestamp 1672424835.082708902,0 must be after replica GC threshold 1672425343.157133886,0›
W221230 18:58:40.287142 3312921 sql/schemachanger/scrun/scrun.go:193 ⋮ [n1,job=‹NEW SCHEMA CHANGE id=826915012625170433›] 369  failed executing declarative schema change PostCommitPhase stage 1 of 6 with 1 BackfillType op (rollback=false) for CREATE INDEX with error: failed to construct index entries during backfill: ‹batch timestamp 1672424835.082708902,0 must be after replica GC threshold 1672425823.166890410,0›

To Reproduce

Run the roachtest from https://github.com/cockroachdb/cockroach/pull/89324.

Expected behavior

Looks like recently we started using PTS records during the validation phase (https://github.com/cockroachdb/cockroach/pull/89540). Do we want to use PTS records for the backfill phase?
@ajwerner suggested either resetting the backoff in the jobs layer or for us to retry internally on these error. It should be fine for the job to have to resume and scan at a later point since it's robust to that.

Additional data / screenshots

Some internal discussion here.

Jira issue: CRDB-25185

blathers-crl[bot] commented 1 year ago

Hi @irfansharif, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

ajwerner commented 1 year ago

In general, it seems like, on some level, installing a blanked protected timestamp is the wrong answer. A backfill can go on for a long time, and it does checkpoint regularly. So long as the checkpoint interval is longer than the GC TTL, the backfill should make progress. Perhaps the right compromise is to install a protected timestamp that we hoist on each checkpoint to now minus one or two checkpoint intervals. That's relatively complex.

In practice, I don't expect most GC TTLs to be shorter than a checkpoint interval (60s), but I could be wrong.

ajwerner commented 1 year ago

The band-aid I'd propose here is to clear the backoff on the index backfill after some number of minutes of running.

cockroachdb / cockroach

sql,jobs: large index backfill with short GC TTL gets stuck during backfill #98311