cjuexuan / mynote

237 stars 34 forks source link

spark 2.0 踩过的SparkSession的坑 #20

Open cjuexuan opened 7 years ago

cjuexuan commented 7 years ago

spark 2.0 踩过的SparkSession的坑

取代了SQLContext(HiveContext)的SparkSession

背景

我的服务端的逻辑是在actor内部进行的,但发现多个actor中执行的过程中,访问到了其他actor内部session中注册的临时表

抽象的运行代码

actor的逻辑大概可以抽象成这样:

package com.ximalaya.xql.datatask

import akka.actor.{Actor, ActorLogging}
import com.ximalaya.xql.datatask.TestActor.CreateView
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

import scala.util.Random

/**
  * @author todd.chen at 17/11/2016 13:52.
  *         email : todd.chen@ximalaya.com
  */
class TestActor(sparkConf: SparkConf) extends Actor with ActorLogging {
  var sparkSession: SparkSession = _
  override def preStart(): Unit = {
    sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
  }
  override def postStop(): Unit = sparkSession.stop()
  override def receive: Receive = {
    case CreateView(limit) ⇒      sparkSession.createDataFrame(getBeans(limit)).createOrReplaceTempView("test1")
    case s: String ⇒
      log.info(s"exec $s")
      sparkSession.sql(s).show(1000)
    case e: Any ⇒ println(e)
  }
  val ids = 1 to 1000
  val names = ('a' to 'z').map(_.toString)
  val beans = for {
    id ← ids
    name ← names
  } yield Bean(id, name)

  def getBeans(limit: Int): Seq[Bean] = {
    Random.shuffle(beans).take(limit)
  }
}

object TestActor {
  case class CreateView(limit: Int)
  case class Bean(id: Int, name: String)
}

驱动的测试类

package com.ximalaya.xql.datatask
import akka.actor.{ActorSystem, Props}
import com.ximalaya.xql.datatask.TestActor.CreateView
import org.apache.spark.SparkConf

import scala.util.Random

/**
  * @author todd.chen at 17/11/2016 13:59.
  *         email : todd.chen@ximalaya.com
  */
object Test {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("local").setMaster("local[*]")
    val actorSystem = ActorSystem("testSystem")
    val actor1 = actorSystem.actorOf(Props(new TestActor(sparkConf)), name = "actor1")
    val actor2 = actorSystem.actorOf(Props(new TestActor(sparkConf)), name = "actor2")
    val actor3 = actorSystem.actorOf(Props(new TestActor(sparkConf)), name = "actor3")
    val actor4 = actorSystem.actorOf(Props(new TestActor(sparkConf)), name = "actor4")
    val actors = actor1 :: actor2 :: actor3 :: actor4 :: Nil
    def getLimit = Random.nextInt(100)
    actors.foreach(_ ! CreateView(getLimit))
    Thread sleep 10000
    actors.foreach(_ ! "select * from test1")
    Thread sleep 10000
    actors.foreach(_ ! "show tables")
    Thread sleep 10000
  }
}

初步定位问题

现象是发现多个query的执行结果完全相同,这里过滤了很多info日志,最后日志大概是这样的

20:07:46 731  WARN (org.apache.spark.SparkContext:66) - Use an existing SparkContext, some configuration may not take effect.
20:07:46 845  INFO (org.apache.spark.sql.internal.SharedState:54) - Warehouse path is '/user/hive/warehouse'.
20:07:46 849  WARN (org.apache.spark.sql.SparkSession$Builder:66) - Using an existing SparkSession; some configuration may not take effect.
20:07:46 849  WARN (org.apache.spark.sql.SparkSession$Builder:66) - Using an existing SparkSession; some configuration may not take effect.
20:07:46 850  WARN (org.apache.spark.sql.SparkSession$Builder:66) - Using an existing SparkSession; some configuration may not take effect.
20:07:49 997  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: test1
20:07:49 999  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: test1
20:07:49 997  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: test1
20:07:49 997  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: test1
20:07:54 342  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: select * from test1
20:07:54 343  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: select * from test1
20:07:54 343  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: select * from test1
[INFO] [11/17/2016 20:07:54.341] [testSystem-akka.actor.default-dispatcher-4] [akka://testSystem/user/actor4] exec select * from test1
[INFO] [11/17/2016 20:07:54.341] [testSystem-akka.actor.default-dispatcher-3] [akka://testSystem/user/actor2] exec select * from test1
[INFO] [11/17/2016 20:07:54.341] [testSystem-akka.actor.default-dispatcher-5] [akka://testSystem/user/actor1] exec select * from test1
20:07:54 351  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: select * from test1
[INFO] [11/17/2016 20:07:54.351] [testSystem-akka.actor.default-dispatcher-2] [akka://testSystem/user/actor3] exec select * from test1
+---+----+
| id|name|
+---+----+
|110|   u|
|720|   f|
+---+----+

+---+----+
| id|name|
+---+----+
|110|   u|
|720|   f|
+---+----+

+---+----+
| id|name|
+---+----+
|110|   u|
|720|   f|
+---+----+

+---+----+
| id|name|
+---+----+
|110|   u|
|720|   f|
+---+----+

+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
|    test1|       true|
+---------+-----------+

+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
|    test1|       true|
+---------+-----------+

+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
|    test1|       true|
+---------+-----------+

+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
|    test1|       true|
+---------+-----------+

进一步定位问题

我们观察到其中是有几个warn的,这一点应该就是导致了多个actor内部访问到同一个共享的sparkSession的真正原因,为了加深我们的判断,我们尝试将createOrReplaceTempView换成一个更粗暴的createTempView

  override def receive: Receive = {
    case CreateView(limit) ⇒
      sparkSession.createDataFrame(getBeans(limit)).createTempView("test1")
    case s: String ⇒
      log.info(s"exec $s")
      sparkSession.sql(s).show(1000)
    case e: Any ⇒ println(e)
  }

这两个方法的区别在源码中解释的还是很清晰的:

  /**
   * Creates a temporary view using the given name. The lifetime of this
   * temporary view is tied to the [[SparkSession]] that was used to create this Dataset.
   *
   * @throws AnalysisException if the view name already exists
   *
   * @group basic
   * @since 2.0.0
   */
  @throws[AnalysisException]
  def createTempView(viewName: String): Unit = withPlan {
    val tableDesc = CatalogTable(
      identifier = sparkSession.sessionState.sqlParser.parseTableIdentifier(viewName),
      tableType = CatalogTableType.VIEW,
      schema = Seq.empty[CatalogColumn],
      storage = CatalogStorageFormat.empty)
    CreateViewCommand(tableDesc, logicalPlan, allowExisting = false, replace = false,
      isTemporary = true)
  }

  /**
   * Creates a temporary view using the given name. The lifetime of this
   * temporary view is tied to the [[SparkSession]] that was used to create this Dataset.
   *
   * @group basic
   * @since 2.0.0
   */
  def createOrReplaceTempView(viewName: String): Unit = withPlan {
    val tableDesc = CatalogTable(
      identifier = sparkSession.sessionState.sqlParser.parseTableIdentifier(viewName),
      tableType = CatalogTableType.VIEW,
      schema = Seq.empty[CatalogColumn],
      storage = CatalogStorageFormat.empty)
    CreateViewCommand(tableDesc, logicalPlan, allowExisting = false, replace = true,
      isTemporary = true)
  }

也就是上面的createTempView如果创建视图,而视图名已经存在,将会报错,我们运行一下,看日志输出:

[ERROR] [11/17/2016 20:13:34.499] [testSystem-akka.actor.default-dispatcher-6] [akka://testSystem/user/actor1] Temporary table 'test1' already exists;
org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException: Temporary table 'test1' already exists;
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTempView(SessionCatalog.scala:341)
    at org.apache.spark.sql.execution.command.CreateViewCommand.createTemporaryView(views.scala:146)
    at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:97)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:186)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2603)
    at org.apache.spark.sql.Dataset.createTempView(Dataset.scala:2398)
    at com.ximalaya.xql.datatask.TestActor$$anonfun$receive$1.applyOrElse(TestActor.scala:43)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:482)
    at com.ximalaya.xql.datatask.TestActor.aroundReceive(TestActor.scala:30)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
    at akka.actor.ActorCell.invoke(ActorCell.scala:495)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

果然和我们的猜想是一样的

源码相关分析

  /**
   * Internal catalog for managing table and database states.
   */
  lazy val catalog = new SessionCatalog(
    sparkSession.sharedState.externalCatalog,
    functionResourceLoader,
    functionRegistry,
    conf,
    newHadoopConf())
/** List of temporary tables, mapping from table name to their logical plan. */
  @GuardedBy("this")
  protected val tempTables = new mutable.HashMap[String, LogicalPlan]

    /**
   * Create a temporary table.
   */
  def createTempView(
      name: String,
      tableDefinition: LogicalPlan,
      overrideIfExists: Boolean): Unit = synchronized {
    val table = formatTableName(name)
    if (tempTables.contains(table) && !overrideIfExists) {
      throw new TempTableAlreadyExistsException(name)
    }
    tempTables.put(table, tableDefinition)
  }

至此我们已经搞清楚了80%了,就是因为大家共用了同一个sessionState,进而共用了同一个catalog,导致的问题的发生

再看SparkSession的创建

问题就是第一次创建后有了defaultSession存在

/**
     * Gets an existing [[SparkSession]] or, if there is no existing one, creates a new
     * one based on the options set in this builder.
     *
     * This method first checks whether there is a valid thread-local SparkSession,
     * and if yes, return that one. It then checks whether there is a valid global
     * default SparkSession, and if yes, return that one. If no valid global default
     * SparkSession exists, the method creates a new SparkSession and assigns the
     * newly created SparkSession as the global default.
     *
     * In case an existing SparkSession is returned, the config options specified in
     * this builder will be applied to the existing SparkSession.
     *
     * @since 2.0.0
     */
    def getOrCreate(): SparkSession = synchronized {
      // Get the session from current thread's active session.
      var session = activeThreadSession.get()
      if ((session ne null) && !session.sparkContext.isStopped) {
        options.foreach { case (k, v) => session.conf.set(k, v) }
        if (options.nonEmpty) {
          logWarning("Using an existing SparkSession; some configuration may not take effect.")
        }
        return session
      }

      // Global synchronization so we will only set the default session once.
      SparkSession.synchronized {
        // If the current thread does not have an active session, get it from the global session.
        session = defaultSession.get()
        if ((session ne null) && !session.sparkContext.isStopped) {
          options.foreach { case (k, v) => session.conf.set(k, v) }
          if (options.nonEmpty) {
            logWarning("Using an existing SparkSession; some configuration may not take effect.")
          }
          return session
        }

        // No active nor global default session. Create a new one.
        val sparkContext = userSuppliedContext.getOrElse {
          // set app name if not given
          val randomAppName = java.util.UUID.randomUUID().toString
          val sparkConf = new SparkConf()
          options.foreach { case (k, v) => sparkConf.set(k, v) }
          if (!sparkConf.contains("spark.app.name")) {
            sparkConf.setAppName(randomAppName)
          }
          val sc = SparkContext.getOrCreate(sparkConf)
          // maybe this is an existing SparkContext, update its SparkConf which maybe used
          // by SparkSession
          options.foreach { case (k, v) => sc.conf.set(k, v) }
          if (!sc.conf.contains("spark.app.name")) {
            sc.conf.setAppName(randomAppName)
          }
          sc
        }
        session = new SparkSession(sparkContext)
        options.foreach { case (k, v) => session.conf.set(k, v) }
        defaultSession.set(session)

        // Register a successfully instantiated context to the singleton. This should be at the
        // end of the class definition so that the singleton is updated only if there is no
        // exception in the construction of the instance.
        sparkContext.addSparkListener(new SparkListener {
          override def onApplicationEnd(applicationEnd: SparkListenerApplicationEnd): Unit = {
            defaultSession.set(null)
            sqlListener.set(null)
          }
        })
      }

      return session
    }
  }

而defaultSession是一个原子类

  /** Reference to the root SparkSession. */
  private val defaultSession = new AtomicReference[SparkSession]

而为了让他不进入 Global synchronization so we will only set the default session once.的逻辑,我们需要在创建Session后clean这个对象

设置了断点也和我的猜想一样

defaultSession

4个actor,这个断点进入了3次

  /**
   * Clears the default SparkSession that is returned by the builder.
   *
   * @since 2.0.0
   */
  def clearDefaultSession(): Unit = {
    defaultSession.set(null)
  }

最终解决

  def getSparkSession(sparkConf: SparkConf) = SparkSession.synchronized {
   SparkSession.clearDefaultSession()
    val session = SparkSession.builder().config(sparkConf).getOrCreate()
    SparkSession.clearDefaultSession()
    session
  }
    override def preStart(): Unit = {
    sparkSession = TestActor.getSparkSession(sparkConf)
  }

log

20:41:35 942  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: test1
20:41:35 943  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: test1
20:41:35 942  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: test1
20:41:35 943  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: test1
20:41:40 85  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: select * from test1
20:41:40 85  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: select * from test1
20:41:40 86  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: select * from test1
[INFO] [11/17/2016 20:41:40.083] [testSystem-akka.actor.default-dispatcher-4] [akka://testSystem/user/actor1] exec select * from test1
[INFO] [11/17/2016 20:41:40.083] [testSystem-akka.actor.default-dispatcher-2] [akka://testSystem/user/actor2] exec select * from test1
[INFO] [11/17/2016 20:41:40.083] [testSystem-akka.actor.default-dispatcher-3] [akka://testSystem/user/actor4] exec select * from test1
[INFO] [11/17/2016 20:41:40.093] [testSystem-akka.actor.default-dispatcher-5] [akka://testSystem/user/actor3] exec select * from test1
20:41:40 93  INFO (org.apache.spark.sql.execution.SparkSqlParser:54) - Parsing command: select * from test1
20:41:40 709  INFO (org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator:54) - Code generated in 365.008452 ms
+---+----+
| id|name|
+---+----+
+---+----+

20:41:40 785  INFO (org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator:54) - Code generated in 17.495479 ms
+---+----+
| id|name|
+---+----+
|192|   p|
| 65|   t|
|594|   c|
| 26|   w|
|787|   s|
|456|   j|
|687|   m|
+---+----+

+---+----+
| id|name|
+---+----+
|993|   b|
|290|   a|
|706|   n|
| 85|   f|
+---+----+

+---+----+
| id|name|
+---+----+
|510|   w|
|473|   x|
|877|   c|
| 80|   f|
|826|   k|
+---+----+

断点也始终没有进去,问题得到了解决

zhangzhikuan commented 7 years ago

赞,帮我解决问题了,特别是spark metadata cache的问题。