cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.12k stars 3.81k forks source link

sql: schema change repeatedly retries with gcttl error #126260

Open itsbilal opened 4 months ago

itsbilal commented 4 months ago

On the drt-chaos test cluster running V24.2.0-ALPHA.00000000-DEV-5AFD790501E946EF306ABE2B592C5798C29C342F, a schema change for ALTER TABLE cct_tpcc.public.order_line DROP COLUMN add_column_op_2902590426 CASCADE has been running nonstop and is being repeatedly retried.

Screenshot 2024-06-26 at 7 56 40 PM

Link to the job

Looking at the logs, we see the job failing with this error. For reference, the gc ttl on this db/table is 4 hours.

job 979031533120225281: running execution encountered retriable error: failed to construct index entries during backfill: batch timestamp 1718847123.942402651,0 must be after replica GC threshold 1719379269.625591541,0
(1) forced error mark
  | ‹"retriable job error"›
  | github.com/cockroachdb/errors/withstack/*withstack.withStack::
Wraps: (2) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*indexBackfiller).runBackfill.func1
  |     github.com/cockroachdb/cockroach/pkg/sql/rowexec/indexbackfiller.go:319
  | github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*indexBackfiller).runBackfill.Group.GoCtx.func3
  |     github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:168
  | golang.org/x/sync/errgroup.(*Group).Go.func1
  |     golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:78
  | runtime.goexit
  |     src/runtime/asm_amd64.s:1695
Wraps: (3) failed to construct index entries during backfill
Wraps: (4) batch timestamp 1718847123.942402651,0 must be after replica GC threshold 1719379269.625591541,0
Error types: (1) *markers.withMark (2) *withstack.withStack (3) *errutil.withPrefix (4) *kvpb.BatchTimestampBeforeGCError

Jira issue: CRDB-39823

blathers-crl[bot] commented 4 months ago

Hi @itsbilal, please add branch-* labels to identify which branch(es) this C-bug affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

fqazi commented 4 months ago

We are running into two problems, in this scenario:

1) We always clear the protected timestamp even if a retryable error is hit, see: https://github.com/cockroachdb/cockroach/blob/c5522cee53952df1558d77b9a4bd830c3cfbe821/pkg/sql/index_backfiller.go#L86-L91 2) The readAsOf timestamp does not properly take into account the current time, if a retry happens it will assume GC TTL * 0.8 time has to pass again: https://github.com/cockroachdb/cockroach/blob/c5522cee53952df1558d77b9a4bd830c3cfbe821/pkg/jobs/jobsprotectedts/jobs_protected_ts_manager.go#L129

rafiss commented 4 months ago

@Dedej-Bergin I'll assign this to you as a bugfix/improvement that would be nice to land, but it's not highly urgent.