How to correctly setup Spark to access HDFS

I have Spark submitting a job through k8s. It works perfectly with Spark-Pi example code (Great job!)

I have also setup hdfs with kubernetes-HDFS. I verified that it works fine and I was able to hit the namenode with port 50070.

However, as you know, in order to allow Spark to use hdfs as default fs, I have to provide HADOOP_CONF_DIR through spark-env.sh (https://spark.apache.org/docs/latest/configuration.html#inheriting-hadoop-cluster-configuration)

I have tried running WordCount example by copying hdfs-site.xml and core-site.xml from hdfs datanode and updating spark-env.sh. Unfortunately, it was not the right way and job failed with Path does not exist.

I looked at the Dockerfile for spark image but it seems like conf folder is not being copied.

What's the correct way of setting up Spark to maximize data locality of HDFS?

apache-spark-on-k8s / kubernetes-HDFS

How to correctly setup Spark to access HDFS #63