apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Support MetricNameSpace in executors and include shuffle to shuffleservice metrics #532

Closed matyix closed 6 years ago

matyix commented 6 years ago

What changes were proposed in this pull request?

Support MetricNameSpace in shuffle to shuffleservice metrics.

Pass spark.metrics.namespace configuration to shuffle such as the shuffleservice can publish metrics with custom namespace. In case of shuffleservice the shuffle prefix is used in the metrics.

How was this patch tested?

Executed all unit and integration tests.

Manual testing through deploying a Spark cluster, Prometheus server, Pushgateway and ran SparkPi - and checking for the shuffle metrics in Pushgateway/Grafana.

liyinan926 commented 6 years ago

Unit test build for this PR seems stuck for almost 3 days.

foxish commented 6 years ago

@ssuchter @kimoonkim, should the unit test build timeout eventually?

liyinan926 commented 6 years ago

Some test failed due to OOM. Can someone who has admin access to the Jenkins instance kill the build?

kimoonkim commented 6 years ago

@foxish I think we can time out the Jenkins jobs, say, after 8 hours. I'll see if i can make this change to Jenkins jobs.

@liyinan926 Yes, let me find and kill the hanging build.

kimoonkim commented 6 years ago

I just checked. The build was killed already by Jenkins. The build log console says:

...
Build was aborted
Aborted by Jenkins Admin
kimoonkim commented 6 years ago

Hmm, also the unit test had set up timeout already. The config page says:

Abort the build if it's stuck
Time-out strategy Absolute
    Timeout minutes: 60

I guess it doesn't always work :-(

cvpatel commented 6 years ago

@liyinan926 @kimoonkim @foxish Sorry forgot to update, but I killed the test earlier and added a 60 minute timeout to the configuration around noon today.

cvpatel commented 6 years ago

rerun all tests please

mccheah commented 6 years ago

I'm a little confused as to what this property actually does. I see the configuration of spark.metrics.namespace being set, but I don't see it being used in this pull request itself?

matyix commented 6 years ago

Hello @mccheah - thanks for the feedback and review. The starting base for the PR was an earlier version and the custom metrics namespace in executors was missing back then. I have reverted the changes and updated the PR text as well reflecting the latest commit.

erikerlandson commented 6 years ago

Is the idea here to backport the spark support for custom metric namespaces to this fork? I ask because I'd expect this to be resolved either via the upstreaming process or via our next rebase.

matyix commented 6 years ago

@erikerlandson After rebasing with latest branch-2.2-kubernetes the original PR has reduced the scope to support custom metrics namespace for external shuffle service only (missing from master).

erikerlandson commented 6 years ago

@matyix Oh I see - I was referring to the next time we rebase against an upstream release (spark-2.2.1 or spark-2.3, etc)

matyix commented 6 years ago

@erikerlandson shall I push this PR upstream as well?

matyix commented 6 years ago

Closing this as it's redundant, this PR https://github.com/apache/spark/pull/19775 fixes this one as well.