cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.88k stars 3.77k forks source link

changefeedccl: adaptive throttling for webhook sink #120854

Open amruss opened 5 months ago

amruss commented 5 months ago

Throttle webhook sink so that we do not overrun replicator or other webhook use cases

Jira issue: CRDB-36903

Epic CRDB-37334

blathers-crl[bot] commented 5 months ago

cc @cockroachdb/cdc

jayshrivastava commented 5 months ago

I don't think throttling is a good idea. This state where a changefeed cannot emit as fast as incoming writes is not very good. I dont think rangefeeds were designed to be in this state for very long (ie have all ranges in catchup scan mode where there's no checkpointing until the catchup scan completes). I think throttling here causes the changefeed to fall behind and will cause it to fall more and more behind. Ultimately, the latency between a write on one cluster and a write on the other via replicator will be unacceptable.

What we should do instead is consume events faster. For example, why is replicator slow? Is it because of decoding json? Maybe we should emit crdb datums using a binary protocol instead. Or, if it's slow because of re-ordering, can we try to checkpoint more often to reduce the reordering?

Alternatively, can we get the events out of cockroach as fast as possible (emitting them to s3 for example)? This way we avoid having to backpressure cockroach and just spill to s3.

If the problem is dealing with bursty traffic and the problem is not with our throughput being high - we shouldn't throttle all events. Replicator should have some buffer to absorb sudden bursts of events.

bobvawter commented 5 months ago

This state where a changefeed cannot emit as fast as incoming writes is not very good.

Agreed, but this is occasionally a reality when building distributed systems.

I think throttling here causes the changefeed to fall behind and will cause it to fall more and more behind. Ultimately, the latency between a write on one cluster and a write on the other via replicator will be unacceptable.

The point is exactly to cause it to fall behind, temporarily, to escape the metastable failure modes that changefeeds, as implemented today, will create.

It's a truism that as utilization of a communication channel approaches 100%, latency tends towards infinity. In the steady state, the changefeed receiver must be able to consume messages at least as fast as they are emitted.

What the current design cannot accommodate, in a graceful fashion, is cases where the workload has a spike that exceeds the ability of the consumer to keep up, or cases where the consumer has a temporary reduction in capacity (e.g. due to downstream receiver maintenance). The current design creates repeated spikes of attempted deliveries, instead of finding a sustainable rate to spread the burst out over time.

Let's say that a changefeed receiver can sustain 100MiB/s of traffic and the steady-state of the workload is 50MiB/s. Now, let's say that someone goes and deletes ~10M rows from the source (this part is real). What we see is that the changefeed will emit at a rate well in excess of the sustainable rate.

The only way to create backpressure today is to "be slow" and we've seen that doesn't play nicely with HTTP load-balancer timeouts. When cockroach receives a 504 Gateway Timeout, it'll back off and re-deliver the same messages again, at full speed, which prolongs the delivery outage. The fixed retry behavior in changefeeds also produces excessively long periods of idleness on the channel.

If it were possible to (dynamically) limit that bursty workload to the sustainable rate of 100MiB/s, the backlog could be delivered far sooner than hoping some subset of messages incrementally succeeds during each redelivery attempt. Think about the area under the curve of the number of succesfully-acknowledged messages.

Is it because of decoding json?

Yes, when I pprof cdc-sink, it spends most of its time decoding json. There's a whole parallel-decode utility devoted to maximizing core utilization.

Maybe we should emit crdb datums using a binary protocol instead.

Yes, it would be great if, say, Cockroach could dial GRPC out and we reuse the existing datum protobuf specifications and flow-control mechanisms within GRPC.

can we try to checkpoint more often to reduce the reordering?

This would also improve the commit-to-commit latency in any replication scenario and is desirable.

Replicator should have some buffer to absorb sudden bursts of events.

The design of changefeeds forces the receiver to have durable storage because there's no way for the receiver to control when the high water mark advances. Once we respond with a 200 OK, there's no going back. If we were to buffer to memory or local disk and that receiver tips over, there's no way for the system to rewind. Hence last year's https://cockroachlabs.atlassian.net/browse/CRDB-26439 request.

bobvawter commented 5 months ago

Anything that we can do to increase maximum throughput is great, but until there's a changefeed architecture that can adaptively shed load during operational hiccups, the risk of metastable failure modes remains.