apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Docker image jars should be automatically added #222

Open mccheah opened 7 years ago

mccheah commented 7 years ago

Currently when a custom docker image is provided with an application's jars mounted on it, the user has to explicitly specify the location of these jars in the docker image. They provide paths with local:// URI schemes to do this. In practice this would seem to be redundant. Multiple users that all submit the same application would have to specify the exact same set of URIs, and, the user would have to know where the jars live in the Docker image.

One idea is for the docker images to support the presence of an environment variable, say SPARK_EXTRA_CLASSPATH. The environment variable could be set on both the driver and the executor images. When the variable is set, the CMD of the base Spark image would add entries from SPARK_EXTRA_CLASSPATH to the driver and executor classpath.

For example, the base driver docker image that we can provide could have this:

CMD java $DRIVER_JAVA_OPTIONS -cp /opt/spark/jars/*:$SPARK_EXTRA_CLASSPATH $USER_CLASS $USER_ARGS

(spark-submit provides everything except SPARK_EXTRA_CLASSPATH) - and a custom Docker image implementation could have this:

FROM kubespark/spark-driver:v2.1.0-kubernetes-0.1.0-rc1
ENV SPARK_EXTRA_CLASSPATH /app/jars/*

We will be changing the structure of the Docker images when submission is redone with the submission staging server, so we ought to include this change in that iteration as well.

mccheah commented 7 years ago

@foxish @erikerlandson @ash211 as discussed yesterday.

mccheah commented 7 years ago

One nuance is that one could want to include jars only in the driver docker image and have them be shipped over to the executors - but we can assume the simplest case for now. This is effectively allowing spark.driver.extraClassPath and spark.executor.extraClassPath to be provided in the Docker image.