apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Support SPARK_USER for specifying usernames to HDFS #408

Open kimoonkim opened 7 years ago

kimoonkim commented 7 years ago

Sub-issue of #128.

@weiting-chen, @foxish, @ifilonenko

When the driver and executors pods access HDFS, the usernames currently appear as root to HDFS because k8s pods do not have Linux user accounts other than root.

Both Spark and Hadoop library supports environment variables that can override the username to HDFS. Spark supports the SPARK_USER env var, which is set to a field in the active SparkContext.:

From SparkContext.scala (code):

  // Set SPARK_USER for user who is running SparkContext.
  val sparkUser = Utils.getCurrentUserName()

Utils.scala (code)

  /**
   * Returns the current user name. This is the currently logged in user, unless that's been
   * overridden by the `SPARK_USER` environment variable.
   */
  def getCurrentUserName(): String = {
    Option(System.getenv("SPARK_USER"))
      .getOrElse(UserGroupInformation.getCurrentUser().getShortUserName())
  }

The sparkUser field is then used by SparkHadoopUtils.runAsSparkUser.

From SparkHadoopUtils.scala (code):

  /**
   * Runs the given function with a Hadoop UserGroupInformation as a thread local variable
   * (distributed to child threads), used for authenticating HDFS and YARN calls.
   *
   * IMPORTANT NOTE: If this function is going to be called repeated in the same process
   * you need to look https://issues.apache.org/jira/browse/HDFS-3545 and possibly
   * do a FileSystem.closeAllForUGI in order to avoid leaking Filesystems
   */
  def runAsSparkUser(func: () => Unit) {
    val user = Utils.getCurrentUserName()
    logDebug("running as user: " + user)
    val ugi = UserGroupInformation.createRemoteUser(user)
    transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
    ugi.doAs(new PrivilegedExceptionAction[Unit] {
      def run: Unit = func()
    })
  }

But it might be the case that the K8s driver or executor code is not calling the runAsSparkUser method properly. We should look into this and make sure it works end-to-end. Note SPARK_USER is supposed to override the username for both secure and non-secure HDFS.

Hadoop library has another env/property variable, HADOOP_USER_NAME. But it appears that this is redundant in case SPARK_USER is specified. Also this doesn't seem to work for secure HDFS. So we should probably focus first on the SPARK_USER support.

From UserGroupInformation.java ([code])(https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L204):

      //If we don't have a kerberos user and security is disabled, check
      //if user is specified in the environment or properties
      if (!isSecurityEnabled() && (user == null)) {
        String envUser = System.getenv(HADOOP_USER_NAME);
        if (envUser == null) {
          envUser = System.getProperty(HADOOP_USER_NAME);
        }
        user = envUser == null ? null : new User(envUser);
      }
liyinan926 commented 7 years ago

IIUC, it seems it's just because SPARK_USER is not set so root gets returned when getCurrentUserName gets called. So my understanding is by setting SPARK_USER to the submission user name when creating the driver Pod in the submission client and executor Pods in the driver, this should be fixed.

weiting-chen commented 7 years ago

In non-secure Hadoop, we can use "--conf spark.executorEnv.[EnvironmentVariableName]" to set up HADOOP_USER_NAME via spark-submit. Because in non-secure Hadoop, HDFS only checks the environment variable HADOOP_USER_NAME and it may not need to change the Linux user to a real one.

spark-submit \ ... \ --conf spark.executorEnv.HADOOP_USER_NAME=user1 \ ...

To set env. variable is related to the pr#424. I think the same way can set up SPARK_USER env.

rootsongjc commented 7 years ago

LGTM

Use these flag to specify which user to communicate with HDFS.

  --conf spark.kubernetes.driverEnv.SPARK_USER=hadoop 
  --conf spark.kubernetes.driverEnv.HADOOP_USER_NAME=hadoop 
  --conf spark.executorEnv.HADOOP_USER_NAME=hadoop 
  --conf spark.executorEnv.SPARK_USER=hadoop