Open asg0451 opened 2 weeks ago
cc @cockroachdb/cdc
We've seen the coordinator bottleneck while forwarding spans in a support issue (https://github.com/cockroachlabs/support/issues/3122) with non-zero (but lower than default) min_checkpoint_frequency
and resolved
options. In that case the changefeed was watching a large table on the order of 100Ks of spans.
we recommend customers use the changefeed settings
min_checkpoint_frequency='0s',resolved='0s'
to make checkpoint updates as recent as possible to reduce duplicate data when backfilling. however, we've observed on escalations and in the drt-scale cluster that this configuration causes another issue -- it makes the feed's coordinator spend all its time doing checkpoint accounting, resulting in the feed lagging.feed statement:
CREATE CHANGEFEED FOR TABLE cct_tpcc.public.order_line INTO 'webhook-https://milesfrankel-big-webhook-0001.roachprod.crdb.io:9090?insecure_tls_skip_verify=true' WITH OPTIONS (initial_scan = 'no', metrics_label = 'webhook_minchk_resolved_0', updated, schema_change_policy = 'nobackfill', min_checkpoint_frequency='0s',resolved='0s')
this is a screenshot from a runtime trace of one such coordinator:
compare to a coordinator for a similar feed but without those options set:
Jira issue: CRDB-44437