bartosz25 / spark-scala-playground

Sample processing code using Spark 2.1+ and Scala
50 stars 25 forks source link

About Serialization #6

Closed bithw1 closed 5 years ago

bithw1 commented 5 years ago

Hi @bartosz25

Following code is deduced from your post(http://www.waitingforcode.com/apache-spark/serialization-issues-part-1/read), and thank you for these great article!

import org.apache.spark.{SparkConf, SparkContext, SparkException}
import org.scalatest.{BeforeAndAfter, FlatSpec, Matchers}

class FunctionExposedObjectTest_Yuzt extends FlatSpec with Matchers with BeforeAndAfter {

  val conf = new SparkConf().setAppName("Spark lazy loading singleton test").setMaster("local")
  var sparkContext: SparkContext = null

  before {
    sparkContext = SparkContext.getOrCreate(conf)
  }

  after {
    sparkContext.stop
  }

  "not serializable object" should "make succeed" in {
    sparkContext.parallelize(0 to 1)
      .map(number => new MyLabelMaker().convertToLabel(number))
      .collect()
  }

  "not serializable object" should "make fail" in {
    sparkContext.parallelize(0 to 1)
      .map(number => new MyLabelMaker())
      .collect()
  }

}

class MyLabelMaker() {

  def convertToLabel(number: Int): String = {
    s"Number#${number}"
  }
}

The second test case fails while the first one succeeds, I don't understand the result,looks it involves closure serialization mechanism? but I am not very clear... @bartosz25 could you please take a look?

bartosz25 commented 5 years ago

collect() method will grab all data generated by executors to the driver. It means that the executors will need to serialize the data and send it through the network. Later the driver will need to deserialize it.

In your case, the first test initializes not serializable MyLabelMaker instance on the executor side and returns a String from the map transformation. Since String is serializable, the test passes. The second test fails because mapped type is not serializable. You can add an extends Serializable to MyLabelMaker to see that it's the reason.

bithw1 commented 5 years ago

Thanks @bartosz25 .

After reading your answer, I changed collect to count , and it has no serialization issue now, and I understood: I have thought the serialization of these two test cases from driver to executor but forgot to think the opposite: from executor to driver. (><)

Thank you @bartosz25 ,you have very solid knowledge, :100: These days I have been following your articles and learned a lot.