Currently, when a change frontier receives a set of resolved spans from a change aggregator, it processes each resolved span individually. As part of that processing, it checks whether changefeed.frontier_checkpoint_frequency time has passed since we last wrote a checkpoint. Since the processing happens on a per-span basis, once a checkpoint has been written, usually after processing the first backfilled span in the set, it is likely that not enough time will have passed for any of the remaining backfilled spans in the set to trigger another checkpoint. In the rare scenario where we have a changefeed undergoing a long backfill that is restarting more frequently than changefeed.frontier_checkpoint_frequency, the checkpoint could fail to grow.
Possible solutions:
If the checkpoint hasn't changed since the last write, don't write it again (this will ensure we can at least make incremental progress)
Don't write a checkpoint until we have processed an entire batch of resolved spans (this could still be problematic since updates from different aggregators will arrive in different batches)
Do nothing (this scenario should remedy itself in a stable cluster since eventually the backfill should complete and cause the highwater to be updated or enough time passes before the next restart such that we can write a new larger checkpoint)
Currently, when a change frontier receives a set of resolved spans from a change aggregator, it processes each resolved span individually. As part of that processing, it checks whether
changefeed.frontier_checkpoint_frequency
time has passed since we last wrote a checkpoint. Since the processing happens on a per-span basis, once a checkpoint has been written, usually after processing the first backfilled span in the set, it is likely that not enough time will have passed for any of the remaining backfilled spans in the set to trigger another checkpoint. In the rare scenario where we have a changefeed undergoing a long backfill that is restarting more frequently thanchangefeed.frontier_checkpoint_frequency
, the checkpoint could fail to grow.Possible solutions:
This issue was discovered while investigating https://github.com/cockroachdb/cockroach/issues/132548.
Jira issue: CRDB-43645