cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.13k stars 3.81k forks source link

changefeedccl: feeds with many ranges can't recover from lag #135294

Open asg0451 opened 2 hours ago

asg0451 commented 2 hours ago

As observed in the drt-scale test, changefeeds which watch many ranges are unable to catch up from lag and fall further and further behind. This is largely related to the catchup scan semaphore which protects crdb from expensive catchup scans.

Some of the feeds' ranges will catch up, and most will be blocked on that semaphore. Rangefeed restarts due to transient issues / slow consumer are likely, and since we don't emit any checkpoints during catchup, that progress is lost.

This situation makes changefeeds incredibly unstable, as when they restart due to transient errors, they cannot recover.

Related issues/PRs:

Jira issue: CRDB-44440

blathers-crl[bot] commented 2 hours ago

cc @cockroachdb/cdc