spark 2.0 踩过的SparkSession的坑

取代了SQLContext(HiveContext)的SparkSession

背景

我的服务端的逻辑是在actor内部进行的，但发现多个actor中执行的过程中，访问到了其他actor内部session中注册的临时表

抽象的运行代码

actor的逻辑大概可以抽象成这样：

package com.ximalaya.xql.datatask

import akka.actor.{Actor, ActorLogging}
import com.ximalaya.xql.datatask.TestActor.CreateView
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

import scala.util.Random

/**
  * @author todd.chen at 17/11/2016 13:52.
  *         email : todd.chen@ximalaya.com
  */
class TestActor(sparkConf: SparkConf) extends Actor with ActorLogging {
  var sparkSession: SparkSession = _
  override def preStart(): Unit = {
    sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
  }
  override def postStop(): Unit = sparkSession.stop()
  override def receive: Receive = {
    case CreateView(limit) ⇒      sparkSession.createDataFrame(getBeans(limit)).createOrReplaceTempView("test1")
    case s: String ⇒
      log.info(s"exec $s")
      sparkSession.sql(s).show(1000)
    case e: Any ⇒ println(e)
  }
  val ids = 1 to 1000
  val names = ('a' to 'z').map(_.toString)
  val beans = for {
    id ← ids
    name ← names
  } yield Bean(id, name)

  def getBeans(limit: Int): Seq[Bean] = {
    Random.shuffle(beans).take(limit)
  }
}

object TestActor {
  case class CreateView(limit: Int)
  case class Bean(id: Int, name: String)
}

驱动的测试类

package com.ximalaya.xql.datatask
import akka.actor.{ActorSystem, Props}
import com.ximalaya.xql.datatask.TestActor.CreateView
import org.apache.spark.SparkConf

import scala.util.Random

/**
  * @author todd.chen at 17/11/2016 13:59.
  *         email : todd.chen@ximalaya.com
  */
object Test {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("local").setMaster("local[*]")
    val actorSystem = ActorSystem("testSystem")
    val actor1 = actorSystem.actorOf(Props(new TestActor(sparkConf)), name = "actor1")
    val actor2 = actorSystem.actorOf(Props(new TestActor(sparkConf)), name = "actor2")
    val actor3 = actorSystem.actorOf(Props(new TestActor(sparkConf)), name = "actor3")
    val actor4 = actorSystem.actorOf(Props(new TestActor(sparkConf)), name = "actor4")
    val actors = actor1 :: actor2 :: actor3 :: actor4 :: Nil
    def getLimit = Random.nextInt(100)
    actors.foreach(_ ! CreateView(getLimit))
    Thread sleep 10000
    actors.foreach(_ ! "select * from test1")
    Thread sleep 10000
    actors.foreach(_ ! "show tables")
    Thread sleep 10000
  }
}

初步定位问题

现象是发现多个query的执行结果完全相同，这里过滤了很多info日志，最后日志大概是这样的

20:07:46 731  WARN (org.apache.spark.SparkContext:66) - Use an existing SparkContext, some configuration may not take effect.
20:07:46 845  INFO (org.apache.spark.sql.internal.SharedState:54) - Warehouse path is '/user/hive/warehouse'.
20:07:46 849  WARN (org.apache.spark.sql.SparkSession$Builder:66) - Using an existing SparkSession; some configuration may not take effect.
20:07:46 849  WARN (org.apache.spark.sql.SparkSession$Builder:66) - Using an existing SparkSession; some configuration may not take effect.
20:07:46 850  WARN (org.apache.spark.sql.SparkSession$Builder:66) - Using an existing SparkSession; some configuration may not take effect.
20:07:49 997  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: test1
20:07:49 999  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: test1
20:07:49 997  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: test1
20:07:49 997  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: test1
20:07:54 342  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: select * from test1
20:07:54 343  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: select * from test1
20:07:54 343  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: select * from test1
[INFO] [11/17/2016 20:07:54.341] [testSystem-akka.actor.default-dispatcher-4] [akka://testSystem/user/actor4] exec select * from test1
[INFO] [11/17/2016 20:07:54.341] [testSystem-akka.actor.default-dispatcher-3] [akka://testSystem/user/actor2] exec select * from test1
[INFO] [11/17/2016 20:07:54.341] [testSystem-akka.actor.default-dispatcher-5] [akka://testSystem/user/actor1] exec select * from test1
20:07:54 351  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: select * from test1
[INFO] [11/17/2016 20:07:54.351] [testSystem-akka.actor.default-dispatcher-2] [akka://testSystem/user/actor3] exec select * from test1
+---+----+
| id|name|
+---+----+
|110|   u|
|720|   f|
+---+----+

+---+----+
| id|name|
+---+----+
|110|   u|
|720|   f|
+---+----+

+---+----+
| id|name|
+---+----+
|110|   u|
|720|   f|
+---+----+

+---+----+
| id|name|
+---+----+
|110|   u|
|720|   f|
+---+----+

+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
|    test1|       true|
+---------+-----------+

+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
|    test1|       true|
+---------+-----------+

+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
|    test1|       true|
+---------+-----------+

+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
|    test1|       true|
+---------+-----------+

进一步定位问题

我们观察到其中是有几个warn的，这一点应该就是导致了多个actor内部访问到同一个共享的sparkSession的真正原因，为了加深我们的判断，我们尝试将createOrReplaceTempView换成一个更粗暴的createTempView

  override def receive: Receive = {
    case CreateView(limit) ⇒
      sparkSession.createDataFrame(getBeans(limit)).createTempView("test1")
    case s: String ⇒
      log.info(s"exec $s")
      sparkSession.sql(s).show(1000)
    case e: Any ⇒ println(e)
  }

这两个方法的区别在源码中解释的还是很清晰的：

  /**
   * Creates a temporary view using the given name. The lifetime of this
   * temporary view is tied to the [[SparkSession]] that was used to create this Dataset.
   *
   * @throws AnalysisException if the view name already exists
   *
   * @group basic
   * @since 2.0.0
   */
  @throws[AnalysisException]
  def createTempView(viewName: String): Unit = withPlan {
    val tableDesc = CatalogTable(
      identifier = sparkSession.sessionState.sqlParser.parseTableIdentifier(viewName),
      tableType = CatalogTableType.VIEW,
      schema = Seq.empty[CatalogColumn],
      storage = CatalogStorageFormat.empty)
    CreateViewCommand(tableDesc, logicalPlan, allowExisting = false, replace = false,
      isTemporary = true)
  }

  /**
   * Creates a temporary view using the given name. The lifetime of this
   * temporary view is tied to the [[SparkSession]] that was used to create this Dataset.
   *
   * @group basic
   * @since 2.0.0
   */
  def createOrReplaceTempView(viewName: String): Unit = withPlan {
    val tableDesc = CatalogTable(
      identifier = sparkSession.sessionState.sqlParser.parseTableIdentifier(viewName),
      tableType = CatalogTableType.VIEW,
      schema = Seq.empty[CatalogColumn],
      storage = CatalogStorageFormat.empty)
    CreateViewCommand(tableDesc, logicalPlan, allowExisting = false, replace = true,
      isTemporary = true)
  }

也就是上面的createTempView如果创建视图，而视图名已经存在，将会报错，我们运行一下，看日志输出：

[ERROR] [11/17/2016 20:13:34.499] [testSystem-akka.actor.default-dispatcher-6] [akka://testSystem/user/actor1] Temporary table 'test1' already exists;
org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException: Temporary table 'test1' already exists;
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTempView(SessionCatalog.scala:341)
    at org.apache.spark.sql.execution.command.CreateViewCommand.createTemporaryView(views.scala:146)
    at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:97)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:186)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2603)
    at org.apache.spark.sql.Dataset.createTempView(Dataset.scala:2398)
    at com.ximalaya.xql.datatask.TestActor$$anonfun$receive$1.applyOrElse(TestActor.scala:43)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:482)
    at com.ximalaya.xql.datatask.TestActor.aroundReceive(TestActor.scala:30)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
    at akka.actor.ActorCell.invoke(ActorCell.scala:495)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

果然和我们的猜想是一样的

源码相关分析

通过createOrReplaceTempView方法定位到CreateViewCommand的run方法中,全类名为org.apache.spark.sql.execution.command.CreateViewCommand

override def run(sparkSession: SparkSession): Seq[Row] = {
// If the plan cannot be analyzed, throw an exception and don't proceed.
val qe = sparkSession.sessionState.executePlan(child)
qe.assertAnalyzed()
val analyzedPlan = qe.analyzed

if (tableDesc.schema != Nil && tableDesc.schema.length != analyzedPlan.output.length) {
  throw new AnalysisException(s"The number of columns produced by the SELECT clause " +
    s"(num: `${analyzedPlan.output.length}`) does not match the number of column names " +
    s"specified by CREATE VIEW (num: `${tableDesc.schema.length}`).")
}
val sessionState = sparkSession.sessionState

if (isTemporary) {
  createTemporaryView(tableDesc.identifier, sparkSession, analyzedPlan)
} else {
  // Adds default database for permanent table if it doesn't exist, so that tableExists()
  // only check permanent tables.
  val database = tableDesc.identifier.database.getOrElse(
    sessionState.catalog.getCurrentDatabase)
  val tableIdentifier = tableDesc.identifier.copy(database = Option(database))

  if (sessionState.catalog.tableExists(tableIdentifier)) {
    if (allowExisting) {
      // Handles `CREATE VIEW IF NOT EXISTS v0 AS SELECT ...`. Does nothing when the target view
      // already exists.
    } else if (replace) {
      // Handles `CREATE OR REPLACE VIEW v0 AS SELECT ...`
      sessionState.catalog.alterTable(prepareTable(sparkSession, analyzedPlan))
    } else {
      // Handles `CREATE VIEW v0 AS SELECT ...`. Throws exception when the target view already
      // exists.
      throw new AnalysisException(
        s"View $tableIdentifier already exists. If you want to update the view definition, " +
          "please use ALTER VIEW AS or CREATE OR REPLACE VIEW AS")
    }
  } else {
    // Create the view if it doesn't exist.
    sessionState.catalog.createTable(
      prepareTable(sparkSession, analyzedPlan), ignoreIfExists = false)
  }
}
Seq.empty[Row]
}

run中主要操作了传入的SparkSession的sessionState这个变量
```
val sessionState = sparkSession.sessionState
```
在操作过程中进一步操作了SessionCatalog这个类

  /**
   * Internal catalog for managing table and database states.
   */
  lazy val catalog = new SessionCatalog(
    sparkSession.sharedState.externalCatalog,
    functionResourceLoader,
    functionRegistry,
    conf,
    newHadoopConf())

SessionCatalog 中有个存放临时表的变量

/** List of temporary tables, mapping from table name to their logical plan. */
  @GuardedBy("this")
  protected val tempTables = new mutable.HashMap[String, LogicalPlan]

    /**
   * Create a temporary table.
   */
  def createTempView(
      name: String,
      tableDefinition: LogicalPlan,
      overrideIfExists: Boolean): Unit = synchronized {
    val table = formatTableName(name)
    if (tempTables.contains(table) && !overrideIfExists) {
      throw new TempTableAlreadyExistsException(name)
    }
    tempTables.put(table, tableDefinition)
  }

至此我们已经搞清楚了80%了，就是因为大家共用了同一个sessionState，进而共用了同一个catalog，导致的问题的发生

再看SparkSession的创建

问题就是第一次创建后有了defaultSession存在

/**
     * Gets an existing [[SparkSession]] or, if there is no existing one, creates a new
     * one based on the options set in this builder.
     *
     * This method first checks whether there is a valid thread-local SparkSession,
     * and if yes, return that one. It then checks whether there is a valid global
     * default SparkSession, and if yes, return that one. If no valid global default
     * SparkSession exists, the method creates a new SparkSession and assigns the
     * newly created SparkSession as the global default.
     *
     * In case an existing SparkSession is returned, the config options specified in
     * this builder will be applied to the existing SparkSession.
     *
     * @since 2.0.0
     */
    def getOrCreate(): SparkSession = synchronized {
      // Get the session from current thread's active session.
      var session = activeThreadSession.get()
      if ((session ne null) && !session.sparkContext.isStopped) {
        options.foreach { case (k, v) => session.conf.set(k, v) }
        if (options.nonEmpty) {
          logWarning("Using an existing SparkSession; some configuration may not take effect.")
        }
        return session
      }

      // Global synchronization so we will only set the default session once.
      SparkSession.synchronized {
        // If the current thread does not have an active session, get it from the global session.
        session = defaultSession.get()
        if ((session ne null) && !session.sparkContext.isStopped) {
          options.foreach { case (k, v) => session.conf.set(k, v) }
          if (options.nonEmpty) {
            logWarning("Using an existing SparkSession; some configuration may not take effect.")
          }
          return session
        }

        // No active nor global default session. Create a new one.
        val sparkContext = userSuppliedContext.getOrElse {
          // set app name if not given
          val randomAppName = java.util.UUID.randomUUID().toString
          val sparkConf = new SparkConf()
          options.foreach { case (k, v) => sparkConf.set(k, v) }
          if (!sparkConf.contains("spark.app.name")) {
            sparkConf.setAppName(randomAppName)
          }
          val sc = SparkContext.getOrCreate(sparkConf)
          // maybe this is an existing SparkContext, update its SparkConf which maybe used
          // by SparkSession
          options.foreach { case (k, v) => sc.conf.set(k, v) }
          if (!sc.conf.contains("spark.app.name")) {
            sc.conf.setAppName(randomAppName)
          }
          sc
        }
        session = new SparkSession(sparkContext)
        options.foreach { case (k, v) => session.conf.set(k, v) }
        defaultSession.set(session)

        // Register a successfully instantiated context to the singleton. This should be at the
        // end of the class definition so that the singleton is updated only if there is no
        // exception in the construction of the instance.
        sparkContext.addSparkListener(new SparkListener {
          override def onApplicationEnd(applicationEnd: SparkListenerApplicationEnd): Unit = {
            defaultSession.set(null)
            sqlListener.set(null)
          }
        })
      }

      return session
    }
  }

而defaultSession是一个原子类

  /** Reference to the root SparkSession. */
  private val defaultSession = new AtomicReference[SparkSession]

而为了让他不进入 Global synchronization so we will only set the default session once.的逻辑，我们需要在创建Session后clean这个对象

设置了断点也和我的猜想一样

defaultSession

4个actor，这个断点进入了3次

  /**
   * Clears the default SparkSession that is returned by the builder.
   *
   * @since 2.0.0
   */
  def clearDefaultSession(): Unit = {
    defaultSession.set(null)
  }

最终解决

  def getSparkSession(sparkConf: SparkConf) = SparkSession.synchronized {
   SparkSession.clearDefaultSession()
    val session = SparkSession.builder().config(sparkConf).getOrCreate()
    SparkSession.clearDefaultSession()
    session
  }
    override def preStart(): Unit = {
    sparkSession = TestActor.getSparkSession(sparkConf)
  }

log

20:41:35 942  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: test1
20:41:35 943  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: test1
20:41:35 942  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: test1
20:41:35 943  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: test1
20:41:40 85  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: select * from test1
20:41:40 85  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: select * from test1
20:41:40 86  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: select * from test1
[INFO] [11/17/2016 20:41:40.083] [testSystem-akka.actor.default-dispatcher-4] [akka://testSystem/user/actor1] exec select * from test1
[INFO] [11/17/2016 20:41:40.083] [testSystem-akka.actor.default-dispatcher-2] [akka://testSystem/user/actor2] exec select * from test1
[INFO] [11/17/2016 20:41:40.083] [testSystem-akka.actor.default-dispatcher-3] [akka://testSystem/user/actor4] exec select * from test1
[INFO] [11/17/2016 20:41:40.093] [testSystem-akka.actor.default-dispatcher-5] [akka://testSystem/user/actor3] exec select * from test1
20:41:40 93  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: select * from test1
20:41:40 709  INFO (org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator:54) - Code generated in 365.008452 ms
+---+----+
| id|name|
+---+----+
+---+----+

20:41:40 785  INFO (org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator:54) - Code generated in 17.495479 ms
+---+----+
| id|name|
+---+----+
|192|   p|
| 65|   t|
|594|   c|
| 26|   w|
|787|   s|
|456|   j|
|687|   m|
+---+----+

+---+----+
| id|name|
+---+----+
|993|   b|
|290|   a|
|706|   n|
| 85|   f|
+---+----+

+---+----+
| id|name|
+---+----+
|510|   w|
|473|   x|
|877|   c|
| 80|   f|
|826|   k|
+---+----+

断点也始终没有进去，问题得到了解决

cjuexuan / mynote