apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Python Bindings for launching PySpark Jobs from the JVM (v1) #351

Closed ifilonenko closed 7 years ago

ifilonenko commented 7 years ago

What changes were proposed in this pull request?

The changes that were proposed in the pull request are the following:

  1. Using separate docker images, built on-top of spark-base, for PySpark jobs. These images differ with the inclusion of python and pyspark specific environment variables. The user-entry point also differs for driver-py as you must include the location of the primary PySpark file and distributed py-files in addition to driver args.
  2. New FileMountingTrait that is generic enough for both SparkR and PySpark to handle passing in the proper arguments for PythonRunner and RRunner. This FileMounter uses the filesDownloadPath resolved in the DriverInitComponent to ensure that file paths are correct. These file-paths are stored as environmental variables that are mounted on the driver pod.
  3. Inclusion of integration tests for PySpark (TODO: Build environment identical to distribution python environment to run the tests)
  4. Match statements to account for varying arguments and malformed inputs which may include nulls or a mix of local:// and file:// file-types.

Example Spark Submit

This is an example spark-submit that uses the custom pyspark docker images and distributes the staged sort.py file across the cluster. The entry point for the driver is: org.apache.spark.deploy.PythonRunner <FILE_DOWNLOADS_PATH>/pi.py <FILE_DOWNLOADS_PATH>/sort.py 100

bin/spark-submit \
  --deploy-mode cluster \
  --master k8s://<k8s-api-url> \
  --kubernetes-namespace default \
  --conf spark.executor.memory=500m \
  --conf spark.driver.memory=1G \
  --conf spark.driver.cores=1 \
  --conf spark.executor.cores=1 \
  --conf spark.executor.instances=1 \
  --conf spark.app.name=spark-pi \
  --conf spark.kubernetes.driver.docker.image=spark-driver-py:latest \
  --conf spark.kubernetes.executor.docker.image=spark-executor-py:latest \
  --conf spark.kubernetes.initcontainer.docker.image=spark-init:latest \
  --conf spark.kubernetes.resourceStagingServer.uri=http://192.168.99.100:31000 \
  --py-files examples/src/main/python/sort.py \
  examples/src/main/python/pi.py 100

How was this patch tested?

This was fully tested by building a make_distribution environment and running on a local minikube cluster with a single executor. The following command is an example submission:

$ build/mvn install -Pkubernetes -pl resource-managers/kubernetes/core -am -DskipTests
$ build/mvn compile -T 4C -Pkubernetes -pl resource-managers/kubernetes/core -am -DskipTests
$ dev/make-distribution.sh --pip --tgz -Phadoop-2.7 -Pkubernetes
$ tar -xvf spark-2.1.0-k8s-0.2.0-SNAPSHOT-bin-2.7.3.tgz
$ cd spark-2.1.0-k8s-0.2.0-SNAPSHOT-bin-2.7.3
$ minikube start --insecure-registry=localhost:5000 --cpus 8 --disk-size 20g --memory 8000 --kubernetes-version v1.5.3; eval $(minikube docker-env)
$ # Build all docker images using docker build ....
$ # Make sure staging server is up 
$ kubectl cluster-info
Kubernetes master is running at https://192.168.99.100:8443
KubeDNS is running at https://192.168.99.100:8443/api/v1/proxy/namespaces/kube-system/services/kube-dns
kubernetes-dashboard is running at https://192.168.99.100:8443/api/v1/proxy/namespaces/kube-system/services/kubernetes-dashboard
$ docker images
REPOSITORY                                          
spark-integration-test-asset-server                 
spark-init                                           
spark-resource-staging-server                         
spark-shuffle                                      
spark-executor-py                                    
spark-executor                                      
spark-driver-py                                      
spark-driver                                        
spark-base                                         
$ bin/spark-submit \
  --deploy-mode cluster \
  --master k8s://https://192.168.99.100:8443 \
  --kubernetes-namespace default \
  --conf spark.executor.memory=500m \
  --conf spark.driver.memory=1G \
  --conf spark.driver.cores=1 \
  --conf spark.executor.cores=1 \
  --conf spark.executor.instances=1 \
  --conf spark.app.name=spark-pi \
  --conf spark.kubernetes.driver.docker.image=spark-driver-py:latest \
  --conf spark.kubernetes.executor.docker.image=spark-executor-py:latest \
  --conf spark.kubernetes.initcontainer.docker.image=spark-init:latest \
  --conf spark.kubernetes.resourceStagingServer.uri=http://192.168.99.100:31000 \
  --py-files examples/src/main/python/sort.py \
  local:///opt/spark/examples/src/main/python/pi.py 100

Integration and Unit tests have been added.

Future Versions of this PR

Launching JVM from Python (log issue) MemoryOverhead testing (OOMKilled errors)

mccheah commented 7 years ago

(y) This is on my radar to look at - thanks a lot for submitting this.

erikerlandson commented 7 years ago

IIRC, you had concerns about additional container size - but should we consider folding the python-specific images into the standard spark images, to avoid the need for specifying special images? What is the size impact?

ifilonenko commented 7 years ago

@erikerlandson That could be something to look into. Python environment that I am loading in doubles the size of the driver image to 573 MB from the original 258 MB.

foxish commented 7 years ago

Really excited to see this. Thanks @ifilonenko!

kimoonkim commented 7 years ago

Yes, an awesome contribution! @ifilonenko

ifilonenko commented 7 years ago

rerun integration test please

mccheah commented 7 years ago

rerun integration tests please

ifilonenko commented 7 years ago

rerun integration tests please

ifilonenko commented 7 years ago

rerun unit tests please

ifilonenko commented 7 years ago

@erikerlandson @mccheah PTAL

ifilonenko commented 7 years ago

rerun unit tests please

ifilonenko commented 7 years ago

rerun unit tests please

ifilonenko commented 7 years ago

PR moved to https://github.com/apache-spark-on-k8s/spark/pull/364