LucaCanali / sparkMeasure

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
Apache License 2.0
690 stars 144 forks source link

Add KafkaSink to support shipping metrics into a kafka stream #36

Closed hoaihuongbk closed 2 years ago

LucaCanali commented 2 years ago

Hi, thanks for submitting this PR. It seems potentially useful. Could you please provide additional context:

hoaihuongbk commented 2 years ago

Opp. I miss the description for this pr.

A little about my project: we have a large number of spark jobs submitted to our cluster every day. And when the number of requests is increasing, it also requires higher optimization and cost savings.

In order to fit in with the current infra, which is spark deployed to a separate cluster and monitoring dashboard deployed in a separate cluster and our internal Kafka service, I cloned this repo and added the Kafka sink. Metrics are sent to the Kafka queue and then ingested into our internal influxdb (managed by dbops team). Finally, the metrics will be displayed on grafana and our team can monitor nearly real-time. Super cool !!!

image

Actually, this sink is no different from the influxdb that you implemented. Except sending metrics message to queue instead of writing directly to the database. We have been using it for a while. And it occurred to me that this might be useful to many other people as well. That's the main reason I submitted this pr.

Regarding the question: could you cover the case of people using Kafka with authentication protocol enabled? In fact, we deploy our infra on AWS and have a whitelist at the network layer (security group). So we don't need to use a username and password.

LucaCanali commented 2 years ago

Thanks for the additional explanations and for the work.