apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Submission client redesign to use a step-based builder pattern #365

Closed mccheah closed 7 years ago

mccheah commented 7 years ago

Applies changes assuming PySpark is present from #364 .

This change overhauls the underlying architecture of the submission client, but it is intended to entirely preserve existing behavior of Spark applications. Therefore users will find this to be an invisible change.

The philosophy behind this design is to reconsider the breakdown of the submission process. It operates off the abstraction of "submission steps", which are transformation functions that take the previous state of the driver and return the new state of the driver. The driver's state includes its Spark configurations and the Kubernetes resources that will be used to deploy it.

Such a refactor moves away from a features-first API design, which considers different containers to serve a set of features. The previous design, for example, had a container files resolver API object that returned different resolutions of the dependencies added by the user. However, it was up to the main Client to know how to intelligently invoke all of those APIs. Therefore the API surface area of the file resolver became untenably large and it was not intuitive of how it was to be used or extended.

This design changes the encapsulation layout; every module is now responsible for changing the driver specification directly. An orchestrator builds the correct chain of steps and hands it to the client, which then calls it verbatim. The main client then makes any final modifications that put the different pieces of the driver together, particularly to attach the driver container itself to the pod and to apply the Spark configuration as command-line arguments.

The current steps are:

  1. aseSubmissionStep: Baseline configurations such as the docker image and resource requests.
  2. DriverKubernetesCredentialsStep: Resolves Kubernetes credentials configuration in the driver pod. Mounts a secret if necessary.
  3. InitContainerBootstrapStep: Attaches the init-container, if necessary, to the driver pod. This is optional and wont' be loaded if all URIs are "local" or there are no URIs at all.
  4. DependencyResolutionStep: Sets the classpath, spark.jars, and spark.files properties. This step is partially not isolated as it assumes that files that are remote or locally submitted will be downloaded to a given location. Unit tests should verify that this contract holds.
  5. PythonStep: Configures Python environment variables if using PySpark.
mccheah commented 7 years ago

Rebuilt from #363.

mccheah commented 7 years ago

Conflicts are likely from the style changes in the base PR. I expect that unless we have significant deviations functionality-wise in the Python implementation, we can resolve most if not all of the conflicts by just taking this branch.

mccheah commented 7 years ago

I resolved merge conflicts with the "ours" strategy since the only difference in the latest push that caused the conflicts were style changes. I'll rebase this branch to also be pointing to branch-2.1-kubernetes shortly.

mccheah commented 7 years ago

rerun unit tests please

mccheah commented 7 years ago

rerun unit tests please

mccheah commented 7 years ago

Still have a few unit tests to finish but the idea is there now. Sorry the review is so large - I thought about how it could be broken down into smaller parts, but it's hard to have the old implementation co-exist with a partial new one.

mccheah commented 7 years ago

rerun integration tests please