cdc: avoid duplicate schema registrations

HonoreDB commented 1 year ago

When we have a large CRDB cluster starting up an Avro changefeed, each processor will post the same schema registration to the same endpoint at about the same time. Users should have an alternative to provisioning a schema registry endpoint that can handle such spiky traffic.

Fixes:

https://github.com/cockroachdb/cockroach/pull/99833
- https://github.com/cockroachdb/cockroach/pull/100844
- https://github.com/cockroachdb/cockroach/pull/100843
- Reduces duplicate calls to ~the number of nodes, so in a cluster with high amounts of parallelism per node that's about an 8x reduction.
- Will be in v22.2.9
- Will be in v23.1.0.
https://github.com/cockroachdb/cockroach/pull/98135
- https://github.com/cockroachdb/cockroach/pull/98349
- https://github.com/cockroachdb/cockroach/pull/98392
- https://github.com/cockroachdb/cockroach/pull/98396
- Reduce retries that were driven by spurious errors.
- Released in v22.2.7.
- Will be in v23.1.0.
https://github.com/cockroachdb/cockroach/pull/99077
- https://github.com/cockroachdb/cockroach/pull/99300
- https://github.com/cockroachdb/cockroach/pull/99505
- This makes the timeout for schema registry call longer and more configurable. It doesn’t address the root cause, but offers a mitigation.
- Released in v22.2.8.
- Will be in v23.1.0.

We can implement this by having the processor that registers the schema tell the others what the ID is. https://github.com/cockroachdb/cockroach/pull/99059 is the beginning of an attempt to implement this by synchronizing using the job_info table.

Another approach (or complementary to above) could be to register schemas before distributing the job so that the IDs can be serialized into the processor specs, but that doesn't prevent potential spikes when a schema change occurs.

Jira issue: CRDB-25769

gz#16064

gz#16384

Epic CRDB-25039

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/cdc

miretskiy commented 1 year ago

We should make sure that at the very least, schema registration happens once per node, and not once per parallel processor.

shermanCRL commented 1 year ago

@HonoreDB Have we gone as far as we intend on this, or more to do?

And, are my updates descriptions in the top comment accurate?

HonoreDB commented 1 year ago

Short term, and for stuff we intend to backport, I think we're done--the number of schema registrations is now O(number of nodes * number of table schema versions). Medium term we should still find a way to get rid of that first factor.

HonoreDB commented 1 year ago

Updated the description slightly.

shermanCRL commented 1 year ago

@HonoreDB Thanks. Remind me, what was the big-O before these changes?

HonoreDB commented 1 year ago

The main factor we took out in #99833 was nprocs, number of parallel encoding workers per node, which defaults to number of cpus per node / 4, to a max of 8. We also had duplicate registrations if there were multiple changefeeds on the same table, or a changefeed was restarted, which #99833 mitigates.

HonoreDB commented 1 year ago

So before you could say O(number of processors number of table schema versions number of changefeeds).

amruss commented 1 year ago

Can this be closed?

cockroachdb / cockroach

cdc: avoid duplicate schema registrations #99221