Set up a Kafka - ClickHouse monitoring

PostHog / posthog

🦔 PostHog provides open-source product analytics, session recording, feature flagging and A/B testing that you can self-host.

https://posthog.com

Other

19.45k stars 1.14k forks source link

Set up a Kafka - ClickHouse monitoring #23451

Open Daesgar opened 5 days ago

Daesgar commented 5 days ago

Create a Kafka - ClickHouse dashboard to track relevant metrics and be alerted when there is an anomaly in the consumption or a big gap between a topic and the events table..

A WIP version of the dashboard is in progress, but still needs refinement.

Interesting metrics to track:

Differences between topic events and ClickHouse events
Once the raw events topics are created, differences between events table and raw_events
Lag between Kafka and ClickHouse partitions, correlated with hosts consuming those partitions. Alert when a particular threshold is crossed.
Kafka errors thrown by ClickHouse
Event duplication

fuziontech commented 4 days ago

I've been thinking about this some more and I do think having a separate process that lives outside of CH and inserts parquet files into S3 would go a long way here.

My proposal is to add this helm chart to our deployed charts and use it to dump parquet files directly from kafka to S3. This would allow us to store events from kafka very cheaply, query them for metrics and for debugging, and leverage them for recovery if we need to.

Benefits:

100% does not rely on Kafka table engine
Does not use CH resources to insert data into S3
Does not use valuable CH disk/IO resources for inserts to disk
Can be queried directly using ClickHouse
May be able to leverage this for inserts in the future on CH or ByConity 🤔

Cons:

Requires a new service, probably Kafka Connect
- Not a huge problem though since this might actually be something we need in the future for ByConity or for CH in the long run
- https://clickhouse.com/docs/en/integrations/kafka/clickhouse-kafka-connect-sink <- like you said, this is very mature

Daesgar commented 4 days ago

Cons:

* Requires a new service, probably Kafka Connect

  * Not a huge problem though since this might actually be something we need in the future for ByConity or for CH in the long run

To reduce the operational load, we could make use of MSK connect. It handles the hardware under the hood and automatically scales based on the throughput (we can set a maximum number of workers for the scaling). I did not test it though, but should work fine.

The cons of using MSK connect is that I don't think it

  * [clickhouse.com/docs/en/integrations/kafka/clickhouse-kafka-connect-sink](https://clickhouse.com/docs/en/integrations/kafka/clickhouse-kafka-connect-sink) <- like you said, this is very mature

And it delivers exactly-once semantics, which would be a nice to have as well when we think on switching to the connector.