apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Add support for publishing Spark metrics into Prometheus #531

Closed matyix closed 6 years ago

matyix commented 6 years ago

What changes were proposed in this pull request?

Publishing Spark metrics into Prometheus - as discussed earlier in #384. Implemented a metrics sink that publishes Spark metrics into Prometheus via Prometheus Pushgateway. Metrics data published by Spark is based on Dropwizard. The format of Spark metrics is not supported natively by Prometheus thus these are converted using DropwizardExports prior pushing metrics to the pushgateway.

Also the default Prometheus pushgateway client API implementation does not support metrics timestamp thus the client API has been ehanced to enrich metrics data with timestamp.

How was this patch tested?

This PR is not affecting the existing code base and not altering the functionality. Nevertheless, I have executed all unit and integration tests. Also this setup has been deployed with Helm and running for a few weeks now for several long and short-running workloads on a K8S cluster (on AWS) and been monitored via Prometheus (Prometheus 1.7.1 + Pushgateway 0.3.1).

Manual testing through deploying a Spark cluster, Prometheus server, Pushgateway and ran SparkPi.

tnachen commented 6 years ago

This looks like should be submitted to upstream apache git repo instead.

foxish commented 6 years ago

Thanks for the PR @matyix. I agree with @tnachen, this doesn't seem k8s specific and might as well be an upstream PR

matyix commented 6 years ago

Hello @tnachen and @foxish - reason I submitted here is that Prometheus is usually the default monitoring solution on Kubernetes and this is more welcomed/useful by/for the K8S community. Also I had the assumption that the PR is good to land here, as suggested by @foxish and @kimoonkim in the original issue I opened a while ago #384.

Additionally this is more convenient for me as I am using the Spark-on-K8S fork only and all the tests, etc I did was on K8S (as Prometheus, Pushgateway, etc).

Is there any chance it can land here? Ultimately this fork (hopefully) will be merged upstream so the Prometheus based monitoring will be available for all Spark users.

foxish commented 6 years ago

I'm okay with letting it rest here for the time being, as long as we have a plan to upstream it along with the other parts.

matyix commented 6 years ago

Yes @foxish I totally agree with you, makes only sense if this gets upstreamed. Meanwhile will fix the checkstyle violation which causes the integration test to fail.

ash211 commented 6 years ago

Yep, since this is orthogonal to the k8s changes made to Spark this should ideally go upstream. @matyix can you please open an issue upstream at https://issues.apache.org/jira/browse/SPARK so we can make prometheus available to all Spark users, not just k8s users?

matyix commented 6 years ago

Hello @ash211 - I've created the issue in the Apache Jira - https://issues.apache.org/jira/browse/SPARK-22343 - please let me know what are the next steps.

aloiscochard commented 6 years ago

Hello @matyix, this is a great work!

I wanted to use it, but our pushgateway use https could you make this a parameter? that would be awesome.

Thanks in advance.

aloiscochard commented 6 years ago

FYI: I did "monkey patch" by simply replacing the two hardcoded occurrence of the "http" string with "https" and it works!

matyix commented 6 years ago

Hello @aloiscochard - I have updated the PR with support for changing protocols.

@ash211 @foxish @tnachen - the check-style issue is fixed as well, so now all checks are passing. Let me know if anything else needed to have the PR merged.

erikerlandson commented 6 years ago

I'm a bit confused from the thread above - I agree with the philosophy that this ought to go upstream independent from spark-on-kube. If so, I'd suggest that this PR be transferred to be directly against apache/spark instead, referencing SPARK-22343

matyix commented 6 years ago

@erikerlandson @foxish I've pushed the same PR (without K8S specifics upstream), please find it here.