changefeedccl: under recommended config changefeeds fall behind on large clusters

asg0451 commented 2 weeks ago

we recommend customers use the changefeed settings min_checkpoint_frequency='0s',resolved='0s' to make checkpoint updates as recent as possible to reduce duplicate data when backfilling. however, we've observed on escalations and in the drt-scale cluster that this configuration causes another issue -- it makes the feed's coordinator spend all its time doing checkpoint accounting, resulting in the feed lagging.

feed statement: CREATE CHANGEFEED FOR TABLE cct_tpcc.public.order_line INTO 'webhook-https://milesfrankel-big-webhook-0001.roachprod.crdb.io:9090?insecure_tls_skip_verify=true' WITH OPTIONS (initial_scan = 'no', metrics_label = 'webhook_minchk_resolved_0', updated, schema_change_policy = 'nobackfill', min_checkpoint_frequency='0s',resolved='0s')

this is a screenshot from a runtime trace of one such coordinator:

compare to a coordinator for a similar feed but without those options set:

Jira issue: CRDB-44437

blathers-crl[bot] commented 2 weeks ago

cc @cockroachdb/cdc

rharding6373 commented 2 weeks ago

We've seen the coordinator bottleneck while forwarding spans in a support issue (https://github.com/cockroachlabs/support/issues/3122) with non-zero (but lower than default) min_checkpoint_frequency and resolved options. In that case the changefeed was watching a large table on the order of 100Ks of spans.

cockroachdb / cockroach

changefeedccl: under recommended config changefeeds fall behind on large clusters #135285