databricks / spark-integration-tests

Integration tests for Spark
Apache License 2.0
69 stars 23 forks source link

Append Spark assembly JARs to classpath rather than adding SPARK_HOME as a SBT subproject #4

Open JoshRosen opened 9 years ago

JoshRosen commented 9 years ago

The current approach of adding SPARK_HOME as a SBT subproject makes it difficult to properly change Spark versions, since it makes it very easy to wind up using assembly JARs that were built against one revision while the test driver itself runs against classes from SPARK_HOME. Instead, we should scrap the SBT subprojects and just add some SBT logic to add the assembly JARs to the run and test classpaths.

JoshRosen commented 9 years ago

I tried doing this, but ran into a few issues:

diff --git a/project/SparkIntegrationTestsBuild.scala b/project/SparkIntegrationTestsBuild.scala
index 3985927..d7e8539 100644
--- a/project/SparkIntegrationTestsBuild.scala
+++ b/project/SparkIntegrationTestsBuild.scala
@@ -30,34 +30,46 @@ object SparkIntegrationTestsBuild extends Build {
     throw new Exception("SPARK_HOME must be defined")
   }

-  lazy val sparkCore = ProjectRef(
-    file(SPARK_HOME),
-    "core"
-  )
-
+  val SPARK_ASSEMBLY_JAR = {
+    val examplesTargetDir = new File(SPARK_HOME, "assembly/target/scala-2.10/")
+    require(examplesTargetDir.exists,
+      "Missing Spark assembly JAR; run sbt/sbt assembly in SPARK_HOME")
+    val jars = examplesTargetDir.listFiles().filter(_.getName.endsWith(".jar"))
+      .filter(_.getName.startsWith("spark-assembly")).toSeq
+    require(jars.size != 0, "Missing Spark assembly JAR; run sbt/sbt assembly in SPARK_HOME")
+    require(jars.size == 1, s"Found multiple Spark assembly JARs; remove all but one: $jars")
+    jars.head
+  }

-  lazy val sparkStreaming = ProjectRef(
-    file(SPARK_HOME),
-    "streaming"
-  )
+  val SPARK_KAFKA_JAR = {
+    val kafkaTargetDir = new File(SPARK_HOME, "external/kafka/target/scala-2.10/")
+    require(kafkaTargetDir.exists, "Missing Streaming Kafka JAR; run sbt/sbt package in SPARK_HOME")
+    val jars = kafkaTargetDir.listFiles().filter(_.getName.endsWith(".jar"))
+      .filter(_.getName.startsWith("spark-streaming-kafka")).toSeq
+    require(jars.size != 0, "Missing Streaming Kafka JAR; run sbt/sbt package in SPARK_HOME")
+    require(jars.size == 1, s"Found multiple Streaming Kafka JARs; remove all but one; $jars")
+    jars.head
+  }

-  lazy val streamingKafka = ProjectRef(
-    file(SPARK_HOME),
-    "streaming-kafka"
-  )
+  println(SPARK_ASSEMBLY_JAR)

   lazy val root = Project(
     "spark-integration-tests",
     file("."),
     settings = buildSettings ++ Seq(
       scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature"),
+      unmanagedJars in Compile ++= { Seq(SPARK_ASSEMBLY_JAR, SPARK_KAFKA_JAR).classpath },
+      unmanagedJars in Compile += Attributed.blank(SPARK_ASSEMBLY_JAR),
+      unmanagedJars in Runtime ++= { Seq(SPARK_ASSEMBLY_JAR, SPARK_KAFKA_JAR).classpath },
+      unmanagedJars in Test ++= { Seq(SPARK_ASSEMBLY_JAR, SPARK_KAFKA_JAR).classpath },
       libraryDependencies ++= Seq(
         "com.jsuereth" %% "scala-arm" % "1.4",
         "fr.janalyse"   %% "janalyse-ssh" % "0.9.14",
         "com.jcraft" % "jsch" % "0.1.51",
         "org.scalatest" % "scalatest_2.10" % "2.2.1" % "test",
-        "net.sf.jopt-simple" % "jopt-simple" % "3.2" % "test"  // needed by Kafka, excluded by Spark
+        "net.sf.jopt-simple" % "jopt-simple" % "3.2" % "test", // needed by Kafka, excluded by Spark
+        "com.google.guava" % "guava" % "14.0.1"
       )
     )
-  ).dependsOn(sparkCore, sparkStreaming, streamingKafka)
+  )
 }

It seems that sbt/sbt clean package assembly may generate multiple JARs in Spark's assembly folder, one of which is the actual assembly JAR and one of which is empty. Once I delete the other JAR, I still run into some issues with Kafka. I guess that I shouldn't rely on Spark / Spark Streaming providing transitive dependencies that the integration tests project needs and should probably just declare those as dependencies.

JoshRosen commented 9 years ago

I could just add a "provided" dependency on spark-core, spark-streaming, spark-streaming-kafka, etc, but this results in the provided version being used in the test classpath, so we have to do some tricky hacks to make sure that we actually run against the classes in SPARK_HOME.