Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
Currently the init-container is able to download files from the resource staging server or any HTTP endpoints out-of-the-box. However, to be able to download files from a remote HDFS cluster, cloud storage, or S3, the init container very likely needs 1) Hadoop configuration (e.g., needed for both cloud storage and S3), 2) custom environment variable (e.g., GOOGLE_APPLICATION_CREDENTIALS for cloud storage and HADOOP_TOKEN_FILE_LOCATION for secured HDFS), and 3) credentials injected through user-specified secrets. Some of them may be done through custom docker images, but it would be a much better user experiences if they are natively supported.
Currently the init-container is able to download files from the resource staging server or any HTTP endpoints out-of-the-box. However, to be able to download files from a remote HDFS cluster, cloud storage, or S3, the init container very likely needs 1) Hadoop configuration (e.g., needed for both cloud storage and S3), 2) custom environment variable (e.g.,
GOOGLE_APPLICATION_CREDENTIALS
for cloud storage andHADOOP_TOKEN_FILE_LOCATION
for secured HDFS), and 3) credentials injected through user-specified secrets. Some of them may be done through custom docker images, but it would be a much better user experiences if they are natively supported.@apache-spark-on-k8s/contributors