apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Prometheus support for Spark clusters on Kubernetes - proposed feature #384

Open matyix opened 7 years ago

matyix commented 7 years ago

Hello maintainers

Is there interest in a Prometheus sink for the metrics systems in order to monitor Spark clusters on (not just) Kubernetes? We extensively use Prometheus for monitoring our K8S clusters and apps with Spark is among those.

In case there is interest I can send a pull request based on this commit - a new Prometheus sink is added, with the required pom.xml changes and usage example in the metrics template.

Look forward for your answer.

kimoonkim commented 7 years ago

Hi @matyix Thanks for opening this issue.

I am personally interested in using Prometheus for monitoring Spark on K8s. And I wonder if the Prometheus sink would make it easier. I think this is definitely worth discussing.

In the past, I have used the graphite sink on the Spark side and the graphite exporter on the Prometheus side to relay metrics to Prometheus, inspired by this blog post. FYI, the graphite sink is one of a few built-in sinks in Spark core, so I didn't have to add any new code. This approach worked with a bit of workaround on the configuration work. (See #162)

So again, I wonder having a direct Prometheus sink would make it easier to use, like the config part.

matyix commented 7 years ago

Hi @kimoonkim

I think there are a few reasons why the native Prometheus based sink might be a better solution then using the Graphite one, through the Prometheus exporter. While Graphite collects key-value pairs only (and aggregates them), Prometheus using metric-vectors (metrics with label dimensions), thus the granularity is much higher. In case of Graphite you need ahead planning with your keys (need to put the job name in the key for uniqueness) if you are running multiple jobs in the same cluster - and that produced lots of files as well in the range of thousands. Our queries used for alerts are very granular and we might need the higher/multidimensional resolution - e.g. we scale the cluster based on SLA set on these alerts.

The current solution with Prometheus is not perfect either - mostly because of how the Sink system works in Spark - as we need to use the PushGateway (whereas Prometheus ideology is built around polling) - the PushGateway never forget series pushed into, thus it has to be deleted through the gateway's API (though this is not a Spark but a Prometheus feature). We are working on a PR into Prometheus to make this configurable (time, size, elapsed time, etc).

Regarding the issue you highlighted above - I assume it's the same but we never encountered as the metrics file is inside our base Spark Docker image, thus the executors have them before the metrics system is initialized. Will try to remove and submit a job to see if happens - will get back to you soon with results.

kimoonkim commented 7 years ago

@matyix Thanks for your points. Yes, I have seen the complication around the Graphite configuration myself and didn't love it. If the direct Prometheus exporter alleviates the issue, I think that's a welcome change.

matyix commented 7 years ago

@kimoonkim I will create a PR - shall this go now against the new 2.2 branch?

kimoonkim commented 7 years ago

Probably. @foxish? Is 2.2 branch ready for receiving new PRs?

foxish commented 7 years ago

We'll cut a new release on 2.2 tomorrow, and then it should be ready for PRs.

matyix commented 6 years ago

Hello @foxish and @kimoonkim - sorry for the delay, a bit later than expected but I have now the PR ready. Just submitted PR #531