cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.18k stars 3.82k forks source link

changefeedccl: checkpoint could fail to grow during long backfills #133492

Open andyyang890 opened 1 month ago

andyyang890 commented 1 month ago

Currently, when a change frontier receives a set of resolved spans from a change aggregator, it processes each resolved span individually. As part of that processing, it checks whether changefeed.frontier_checkpoint_frequency time has passed since we last wrote a checkpoint. Since the processing happens on a per-span basis, once a checkpoint has been written, usually after processing the first backfilled span in the set, it is likely that not enough time will have passed for any of the remaining backfilled spans in the set to trigger another checkpoint. In the rare scenario where we have a changefeed undergoing a long backfill that is restarting more frequently than changefeed.frontier_checkpoint_frequency, the checkpoint could fail to grow.

Possible solutions:

  1. If the checkpoint hasn't changed since the last write, don't write it again (this will ensure we can at least make incremental progress)
  2. Don't write a checkpoint until we have processed an entire batch of resolved spans (this could still be problematic since updates from different aggregators will arrive in different batches)
  3. Do nothing (this scenario should remedy itself in a stable cluster since eventually the backfill should complete and cause the highwater to be updated or enough time passes before the next restart such that we can write a new larger checkpoint)

This issue was discovered while investigating https://github.com/cockroachdb/cockroach/issues/132548.

Jira issue: CRDB-43645

blathers-crl[bot] commented 1 month ago

cc @cockroachdb/cdc