apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

[WIP] Submission client redesign to use a step-based builder pattern #363

Closed mccheah closed 7 years ago

mccheah commented 7 years ago

Applies changes assuming PySpark is present from #364 .

This change overhauls the underlying architecture of the submission client, but it is intended to entirely preserve existing behavior of Spark applications. Therefore users will find this to be an invisible change.

The philosophy behind this design is to reconsider the breakdown of the submission process. It operates off the abstraction of "submission steps", which are transformation functions that take the previous state of the driver and return the new state of the driver. The driver's state includes its Spark configurations and the Kubernetes resources that will be used to deploy it.

Such a refactor moves away from a features-first API design, which considers different containers to serve a set of features. The previous design, for example, had a container files resolver API object that returned different resolutions of the dependencies added by the user. However, it was up to the main Client to know how to intelligently invoke all of those APIs. Therefore the API surface area of the file resolver became untenably large and it was not intuitive of how it was to be used or extended.

This design changes the encapsulation layout; every module is now responsible for changing the driver specification directly. An orchestrator builds the correct chain of steps and hands it to the client, which then calls it verbatim. The main client then makes any final modifications that put the different pieces of the driver together, particularly to attach the driver container itself to the pod and to apply the Spark configuration as command-line arguments.

The current steps are:

  1. BaseSubmissionStep: Baseline configurations such as the docker image and resource requests.
  2. DriverKubernetesCredentialsStep: Resolves Kubernetes credentials configuration in the driver pod. Mounts a secret if necessary.
  3. InitContainerBootstrapStep: Attaches the init-container, if necessary, to the driver pod. This is optional and wont' be loaded if all URIs are "local" or there are no URIs at all.
  4. DependencyResolutionStep: Sets the classpath, spark.jars, and spark.files properties. This step is partially not isolated as it assumes that files that are remote or locally submitted will be downloaded to a given location. Unit tests should verify that this contract holds.
  5. PythonStep: Configures Python environment variables if using PySpark.
mccheah commented 7 years ago

Unit tests are being rewritten. This passes integration tests in my development environment.

ifilonenko commented 7 years ago

Thanks for this! (y) @ash211 @foxish PTAL

ifilonenko commented 7 years ago

I think this refactor promotes a reasonably structured architecture that I think we can easily adapt and introduce to future developers. The ability to just write a KubernetesSubmissionStep to apply a new modification to the Driver pod, container, resources, ...etc is a seemingly painless step. This is especially good timing with the integration of secure HDFS where pod modifications will be important and having that be offset to a step that is cast into an Option() [ if it is necessary or not ] seems to quite organized.

mccheah commented 7 years ago

This was not supposed to merge yet, unfortunately - we'll be working on a revert on the base branch and re-opening this in a new push.