apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Support multiple directories for spark.local.dir #417

Open weiting-chen opened 7 years ago

weiting-chen commented 7 years ago

This is a feature request to support multiple directories for spark.local.dir setup. spark.local.dir use "/tmp" as the default setting (link). In Spark-on-Yarn, most customers usually use multiple directories for spark.local.dir setup. This way can help to get better performance. In Spark-on-K8s, it indicates to root volume and cannot be modified by default. Since Spark-on-K8s running application and creating containers by request. This feature must create storage(PV) by request as well as configure directories in spark conf before launching the Spark applications.

One related feature implementation(https://github.com/kubernetes/features/issues/121) from kubernetes. We may need to wait for this feature implemented.

ash211 commented 7 years ago

@weiting-chen do you need PV storage specifically, or would the EmptyDir from https://github.com/apache-spark-on-k8s/spark/pull/486 work for you?

I don't think you need persistence in static allocation mode, and dynamic allocation requires an external shuffle service which stores data in spark.kubernetes.shuffle.dir, not in spark.local.dir

weiting-chen commented 7 years ago

Yes, #486 is enough for static mode. Use PV storage doesn't make sense in spark.local.dir since the data is temporary and its life cycle comes with the executor pod.

tangzhankun commented 7 years ago

@ash211 @weiting-chen What determined the medium of emptydir by default?