apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Init-container should support downloading remote files from any sources compatible with Hadoop file system #562

Open liyinan926 opened 6 years ago

liyinan926 commented 6 years ago

Currently the init-container is able to download files from the resource staging server or any HTTP endpoints out-of-the-box. However, to be able to download files from a remote HDFS cluster, cloud storage, or S3, the init container very likely needs 1) Hadoop configuration (e.g., needed for both cloud storage and S3), 2) custom environment variable (e.g., GOOGLE_APPLICATION_CREDENTIALS for cloud storage and HADOOP_TOKEN_FILE_LOCATION for secured HDFS), and 3) credentials injected through user-specified secrets. Some of them may be done through custom docker images, but it would be a much better user experiences if they are natively supported.

@apache-spark-on-k8s/contributors