apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Support for PVs, PVCs, and other K8s persistent volume support #627

Closed bplein closed 6 years ago

bplein commented 6 years ago

I work for Diamanti, a company that has a Kubernetes appliance featuring high performance storage and networking, both offering QOS that is accessible to the developer via K8s podspec.

In attempting to help a customer use Spark on Kubernetes, we're running into the issue of not being able to control the type of storage used by Spark jobs.

I've reviewed several past (open and closed) issues reported in this repository, and they all seem to be single use case examples (and fixes), such as for emptydir or hostpath support.

Shouldn't Spark-on-K8s generically support PVs/PVCs and other K8s ways of having temporary or persistent storage instead of single fixes for hostpath, emptydir etc?

Hosts in general and K8S clusters in particular have multiple classes of storage available to them. K8s clusters have a plethora of persistent storage options including FlexVolume (today) and CSI (to replace FlexVolume going forward).

With regards to Spark, wouldn't it be ideal if temporary files could be directed to the lowest latency highest throughput storage available to the cluster?

I would like to see if we could use the spark-submit or any other method to describe K8S volumes or PVCs so that users of Spark-on-K8s could use the storage best suited for the performance and capacity needs of their applications.

liyinan926 commented 6 years ago

Support for mounting PVCs and other types of volumes is being worked on in https://github.com/apache/spark/pull/21260. The title of the PR says hostPath, but the implementation is generic and works with other types of volumes too. Note that we moved development into upstream.

bplein commented 6 years ago

Thanks, I'll close this and follow the action upstream.

er0sin commented 6 years ago

@liyinan926 , Is there a config example for a PV or PVC? I downloaded and compiled the commits in apache #2600; want to try this in my environment.

According to the JIRA, it's: spark.kubernetes.executor.volumes=hostPath:containerPath[:ro|rw] for hostPath, but what about PV?