cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.97k stars 3.79k forks source link

changefeedccl: improve observability into aggregator sink flushes #130019

Open andyyang890 opened 1 month ago

andyyang890 commented 1 month ago

We currently have a changefeed.flushes metric that counts the number of aggregator sink flushes and changefeed.flush_hist_nanos metric which builds a histogram of the duration of each flush, but we don't have any logs or metrics that tell us why a flush happened.

From a quick skim of the code, it seems like there are three main reasons we flush:

  1. The aggregator is sending a progress update to the frontier https://github.com/cockroachdb/cockroach/blob/2f8519c1ae5020614ee1616c829e1d5b3702f942/pkg/ccl/changefeedccl/changefeed_processors.go#L869
  2. The blocking buffer is blocked https://github.com/cockroachdb/cockroach/blob/2f8519c1ae5020614ee1616c829e1d5b3702f942/pkg/ccl/changefeedccl/changefeed_processors.go#L785-L786 https://github.com/cockroachdb/cockroach/blob/359593525b285317e8eb35b1b385c98352faaa3d/pkg/ccl/changefeedccl/kvevent/blocking_buffer.go#L135
  3. The aggregator is preparing to send its shutdown checkpoint https://github.com/cockroachdb/cockroach/blob/2f8519c1ae5020614ee1616c829e1d5b3702f942/pkg/ccl/changefeedccl/changefeed_processors.go#L732

We should add more metrics (or logs) to help us distinguish these and any other reasons we flush. (Maybe something like changefeed.flush.<reason>.)

Jira issue: CRDB-41843

Epic CRDB-37337

blathers-crl[bot] commented 1 month ago

cc @cockroachdb/cdc