cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.22k stars 3.82k forks source link

changefeedccl: under recommended config changefeeds fall behind on large clusters #135285

Open asg0451 opened 2 weeks ago

asg0451 commented 2 weeks ago

we recommend customers use the changefeed settings min_checkpoint_frequency='0s',resolved='0s' to make checkpoint updates as recent as possible to reduce duplicate data when backfilling. however, we've observed on escalations and in the drt-scale cluster that this configuration causes another issue -- it makes the feed's coordinator spend all its time doing checkpoint accounting, resulting in the feed lagging.

feed statement: CREATE CHANGEFEED FOR TABLE cct_tpcc.public.order_line INTO 'webhook-https://milesfrankel-big-webhook-0001.roachprod.crdb.io:9090?insecure_tls_skip_verify=true' WITH OPTIONS (initial_scan = 'no', metrics_label = 'webhook_minchk_resolved_0', updated, schema_change_policy = 'nobackfill', min_checkpoint_frequency='0s',resolved='0s')

this is a screenshot from a runtime trace of one such coordinator: image

compare to a coordinator for a similar feed but without those options set: image

Jira issue: CRDB-44437

blathers-crl[bot] commented 2 weeks ago

cc @cockroachdb/cdc

rharding6373 commented 2 weeks ago

We've seen the coordinator bottleneck while forwarding spans in a support issue (https://github.com/cockroachlabs/support/issues/3122) with non-zero (but lower than default) min_checkpoint_frequency and resolved options. In that case the changefeed was watching a large table on the order of 100Ks of spans.