Open kimoonkim opened 7 years ago
IIUC, it seems it's just because SPARK_USER
is not set so root
gets returned when getCurrentUserName
gets called. So my understanding is by setting SPARK_USER
to the submission user name when creating the driver Pod in the submission client and executor Pods in the driver, this should be fixed.
In non-secure Hadoop, we can use "--conf spark.executorEnv.[EnvironmentVariableName]" to set up HADOOP_USER_NAME via spark-submit. Because in non-secure Hadoop, HDFS only checks the environment variable HADOOP_USER_NAME and it may not need to change the Linux user to a real one.
spark-submit \ ... \ --conf spark.executorEnv.HADOOP_USER_NAME=user1 \ ...
To set env. variable is related to the pr#424. I think the same way can set up SPARK_USER env.
LGTM
Use these flag to specify which user to communicate with HDFS.
--conf spark.kubernetes.driverEnv.SPARK_USER=hadoop
--conf spark.kubernetes.driverEnv.HADOOP_USER_NAME=hadoop
--conf spark.executorEnv.HADOOP_USER_NAME=hadoop
--conf spark.executorEnv.SPARK_USER=hadoop
Sub-issue of #128.
@weiting-chen, @foxish, @ifilonenko
When the driver and executors pods access HDFS, the usernames currently appear as
root
to HDFS because k8s pods do not have Linux user accounts other thanroot
.Both Spark and Hadoop library supports environment variables that can override the username to HDFS. Spark supports the
SPARK_USER
env var, which is set to a field in the activeSparkContext
.:From
SparkContext.scala
(code):Utils.scala
(code)The
sparkUser
field is then used bySparkHadoopUtils.runAsSparkUser
.From
SparkHadoopUtils.scala
(code):But it might be the case that the K8s driver or executor code is not calling the
runAsSparkUser
method properly. We should look into this and make sure it works end-to-end. NoteSPARK_USER
is supposed to override the username for both secure and non-secure HDFS.Hadoop library has another env/property variable,
HADOOP_USER_NAME
. But it appears that this is redundant in case SPARK_USER is specified. Also this doesn't seem to work for secure HDFS. So we should probably focus first on the SPARK_USER support.From
UserGroupInformation.java
([code])(https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L204):