storage/sources/kafka: centralize source "partition" discovery in one worker

aljoscha commented 1 year ago

Currently, each timely worker in the source ingestion pipeline periodically queries the external system to discover new "partitions" (the term is heavily influence by Kafka, but other systems, such as Kinesis shards, have a similar concept and we call them partitions internally). This is problematic when we scale to very large worker counts, as we are essentially DDoS'ing the external system.

Instead, we could add another operator to the ingest pipeline whose sole purpose is to discover new partitions. This operator would only be active on a single timely worker. It would send these partitions on to the SourceReader operators for reading.

benesch commented 1 year ago

Just a quick note on priority: I believe the default topic refresh interval here is 5m, so even at the largest cluster replica size (32 threads? 64 threads?) I doubt this will be a problem in production! The way this has showed up to date is in CI, where you can use the unsafe TOPIC METADATA REFRESH INTERVAL MS option to dial the refresh interval way down. AFAIU that options is meant to be permanently unsafe, because it's a bit in the weeds and has the footgun that it will cause storaged to fall over if you set it too low for the number of workers in the process.

aljoscha commented 1 year ago

agreed! just wanted to write it down so we have it somewhere

MaterializeInc / materialize

storage/sources/kafka: centralize source "partition" discovery in one worker #15698