apache-spark-on-k8s / kubernetes-HDFS

Repository holding configuration files for running an HDFS cluster in Kubernetes
Apache License 2.0
397 stars 185 forks source link

How to correctly setup Spark to access HDFS #63

Closed ljj7975 closed 5 years ago

ljj7975 commented 6 years ago

I have Spark submitting a job through k8s. It works perfectly with Spark-Pi example code (Great job!)

I have also setup hdfs with kubernetes-HDFS. I verified that it works fine and I was able to hit the namenode with port 50070.

However, as you know, in order to allow Spark to use hdfs as default fs, I have to provide HADOOP_CONF_DIR through spark-env.sh (https://spark.apache.org/docs/latest/configuration.html#inheriting-hadoop-cluster-configuration)

I have tried running WordCount example by copying hdfs-site.xml and core-site.xml from hdfs datanode and updating spark-env.sh. Unfortunately, it was not the right way and job failed with Path does not exist.

I looked at the Dockerfile for spark image but it seems like conf folder is not being copied.

What's the correct way of setting up Spark to maximize data locality of HDFS?

sambit19 commented 5 years ago

You can export HADOOP_CONF_DIR before running spark-submit