apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Reset many spark.kubernetes.* configurations when restarting from a Checkpoint #516

Closed ssaavedra closed 6 years ago

ssaavedra commented 7 years ago

What changes were proposed in this pull request?

Due to the changes made in spawning the service attached to the driver pod, the spark.driver.bindAddress property is now important to reset when restarting a workload. Also, many spark.kubernetes.* properties change due to the spark-submit process and how the configmap, secrets and other related variables get uploaded and resolved. This change features checkpoint restoration for streaming workloads.

How was this patch tested?

This patch was tested with the twitter-streaming example in AWS, using checkpoints in s3 with the s3a:// protocol, as supported by Hadoop.

mccheah commented 7 years ago

How does this impact non-K8s jobs?

ifilonenko commented 7 years ago

Shouldn't this be a PR for Spark Core in the main repo? How is this k8s specific?

ssaavedra commented 7 years ago

This was not an issue before in Kubernetes either because there were no services involved. When bindAddress is not used, it takes its value by default from spark.driver.host. I am trying to run a Streaming example reliably and this is a regression I found from v0.3.0 to v0.4.0.

mccheah commented 7 years ago

Can this be fixed in a way that is specific to Kubernetes?

ssaavedra commented 7 years ago

The first version of this PR seems to be related to code in Spark indeed, and not really related to the Kubernetes integration. However, after digging deeper, these kubernetes-related properties need to also be reloaded.

The upstream issue with bindAddress: https://issues.apache.org/jira/browse/SPARK-22294 There is also the spark of a discussion on whether to propose a more general framework at this other issue: https://github.com/apache/spark/pull/19469

ssaavedra commented 6 years ago

Is anyone reviewing this?

foxish commented 6 years ago

Sorry @ssaavedra, most of us are busy with the ongoing upstreaming effort. @mccheah @ifilonenko do you have some cycles to review this?

ifilonenko commented 6 years ago

Can an integration test be added?

ssaavedra commented 6 years ago

Sorry, I haven't had time to spin up my testing environment lately.

@mccheah that will not be enough, and I tried a build with such changes, and that wouldn't work, as I was expecting. The reason is that when performing the spark-submit process, a new configmap gets published in Kubernetes (with a different name than the first one), but then when the driver pod starts in the cluster, it will first restore the data from the checkpoint (thereby erasing the SparkConf properties set by the latter spark-submit and restoring the original ones, except the ones explicitly discarded by this pull request) and then it will start the Spark Context. With the new Spark Context spun up, executors will be brought up, but now they will be configured to use the saved SparkConf properties, and thus will look in the wrong ConfigMap name.

So we do need to make this change. I'm attaching you a screenshot of what happens with your proposed changes.

About an integration test of this caliber, I don't have an idea on how to set up the machinery related to this. I could take a look at it, but I think this could be relevant in the upstreaming process, so that streaming contexts can be resumed when using Kubernetes as the cluster-manager.

screenshot from 2018-01-02-11-07-55

ssaavedra commented 6 years ago

Ping?

foxish commented 6 years ago

@ssaavedra, this looks good in general - but we're planning a major rebase of this fork on the upstream apache/spark work after which we can merge. For now, we could propose in upstream - just filed https://issues.apache.org/jira/browse/SPARK-23200, can you propose a PR there under apache/spark? We'll get some eyes on it that way.

ssaavedra commented 6 years ago

I'm closing this, since it's already upstream.