As observed in the drt-scale test, changefeeds which watch many ranges are unable to catch up from lag and fall further and further behind. This is largely related to the catchup scan semaphore which protects crdb from expensive catchup scans.
Some of the feeds' ranges will catch up, and most will be blocked on that semaphore. Rangefeed restarts due to transient issues / slow consumer are likely, and since we don't emit any checkpoints during catchup, that progress is lost.
This situation makes changefeeds incredibly unstable, as when they restart due to transient errors, they cannot recover.
As observed in the drt-scale test, changefeeds which watch many ranges are unable to catch up from lag and fall further and further behind. This is largely related to the catchup scan semaphore which protects crdb from expensive catchup scans.
Some of the feeds' ranges will catch up, and most will be blocked on that semaphore. Rangefeed restarts due to transient issues /
slow consumer
are likely, and since we don't emit any checkpoints during catchup, that progress is lost.This situation makes changefeeds incredibly unstable, as when they restart due to transient errors, they cannot recover.
Related issues/PRs:
Jira issue: CRDB-44440