apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Bypass init-containers if `spark.jars` and `spark.files` is empty or only has `local://` URIs #338

Closed mccheah closed 7 years ago

mccheah commented 7 years ago

If all dependencies are installed in the Docker images, the init-container will do no work and so we shouldn't run it.

chenchun commented 7 years ago

+1 for this. Adding an init container may slow down pod launching. What about making driver container download all dependencies itself ?

mccheah commented 7 years ago

@chenchun the preferable design is to use the init-contianer because it allows the driver container to be completely generic. Or to put it another way:

1) Users who write custom Docker images don't need to call a specific class, but can just run the application main class directly, 2) When we write multiple driver runtime implementations (e.g. Python) they can all share the same init-container without modifying their commands.

We can look into improving Kubernetes itself in terms of speeding up its init-container execution time. When the nodes cache the init-container docker image that will improve performance on subsequent runs, but there is still the overhead of starting the container itself.