Open Daesgar opened 5 days ago
I've been thinking about this some more and I do think having a separate process that lives outside of CH and inserts parquet files into S3 would go a long way here.
My proposal is to add this helm chart to our deployed charts and use it to dump parquet files directly from kafka to S3. This would allow us to store events from kafka very cheaply, query them for metrics and for debugging, and leverage them for recovery if we need to.
Benefits:
Kafka
table engineCons:
Cons:
* Requires a new service, probably Kafka Connect * Not a huge problem though since this might actually be something we need in the future for ByConity or for CH in the long run
To reduce the operational load, we could make use of MSK connect. It handles the hardware under the hood and automatically scales based on the throughput (we can set a maximum number of workers for the scaling). I did not test it though, but should work fine.
The cons of using MSK connect is that I don't think it
* [clickhouse.com/docs/en/integrations/kafka/clickhouse-kafka-connect-sink](https://clickhouse.com/docs/en/integrations/kafka/clickhouse-kafka-connect-sink) <- like you said, this is very mature
And it delivers exactly-once semantics, which would be a nice to have as well when we think on switching to the connector.
Create a Kafka - ClickHouse dashboard to track relevant metrics and be alerted when there is an anomaly in the consumption or a big gap between a topic and the events table..
A WIP version of the dashboard is in progress, but still needs refinement.
Interesting metrics to track:
events
table andraw_events