cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

Changefeed failures seen on drt-chaos internal CRDB cluster #127839

Open nameisbhaskar opened 3 months ago

nameisbhaskar commented 3 months ago

We have seen test changefeed failures in our internal cluster drt-chaos. The failure is

HANGEFEED job 979215929488343050: stepping through state reverting with unexpected error: received boundary timestamp 1721711057.579153634,0 < 1721716400.060881844,0 of type ‹BACKFILL› before reaching existing boundary of type ‹RESTART›
(1) assertion failure
Wraps: (2) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*schemaChangeFrontier).ForwardResolvedSpan
  |     github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_processors.go:1932
  | github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*changeFrontier).forwardFrontier
  |     github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_processors.go:1542
  | github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*changeFrontier).noteAggregatorProgress
  |     github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_processors.go:1533
  | github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*changeFrontier).Next
  |     github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_processors.go:1494
  | github.com/cockroachdb/cockroach/pkg/sql/execinfra.Run
  |     github.com/cockroachdb/cockroach/pkg/sql/execinfra/base.go:197
  | github.com/cockroachdb/cockroach/pkg/sql/execinfra.(*ProcessorBaseNoHelper).Run
  |     github.com/cockroachdb/cockroach/pkg/sql/execinfra/processorsbase.go:732
  | github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).Run
  |     github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:579
  | github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run
  |     github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:928
  | github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.startDistChangefeed.func1
  |     github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_dist.go:328
  | github.com/cockroachdb/cockroach/pkg/util/ctxgroup.GoAndWait.Group.GoCtx.func1
  |     github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:168
  | golang.org/x/sync/errgroup.(*Group).Go.func1
  |     golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:78
  | runtime.goexit
  |     src/runtime/asm_amd64.s:1695
Wraps: (3) received boundary timestamp 1721711057.579153634,0 < 1721716400.060881844,0 of type ‹BACKFILL› before reaching existing boundary of type ‹RESTART›
Error types: (1) *assert.withAssertionFailure (2) *withstack.withStack (3) *errutil.leafError
HINT: ‹You have encountered an unexpected error.›

More details can be found in the slack thread - https://cockroachlabs.slack.com/archives/C05FHJJ0MD0/p1721805027341599

Jira issue: CRDB-40631

Epic CRDB-41785

blathers-crl[bot] commented 3 months ago

Hi @nameisbhaskar, please add branch-* labels to identify which branch(es) this C-bug affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] commented 3 months ago

cc @cockroachdb/cdc

rharding6373 commented 2 months ago

@nameisbhaskar do you know if there were any schema changes around this time on the DRT cluster and what they were? Is there somewhere the logs are still available that you can point us to?

Our hypothesis is that there were 2 schema changes: one for a primary index change (which triggered the restart boundary) and one for a column change (add or delete, which triggered the backfill boundary). For some reason these were processed by the kvFeed in a different order than the timestamps of the schema change events.