Open JoshRosen opened 9 years ago
I tried doing this, but ran into a few issues:
diff --git a/project/SparkIntegrationTestsBuild.scala b/project/SparkIntegrationTestsBuild.scala
index 3985927..d7e8539 100644
--- a/project/SparkIntegrationTestsBuild.scala
+++ b/project/SparkIntegrationTestsBuild.scala
@@ -30,34 +30,46 @@ object SparkIntegrationTestsBuild extends Build {
throw new Exception("SPARK_HOME must be defined")
}
- lazy val sparkCore = ProjectRef(
- file(SPARK_HOME),
- "core"
- )
-
+ val SPARK_ASSEMBLY_JAR = {
+ val examplesTargetDir = new File(SPARK_HOME, "assembly/target/scala-2.10/")
+ require(examplesTargetDir.exists,
+ "Missing Spark assembly JAR; run sbt/sbt assembly in SPARK_HOME")
+ val jars = examplesTargetDir.listFiles().filter(_.getName.endsWith(".jar"))
+ .filter(_.getName.startsWith("spark-assembly")).toSeq
+ require(jars.size != 0, "Missing Spark assembly JAR; run sbt/sbt assembly in SPARK_HOME")
+ require(jars.size == 1, s"Found multiple Spark assembly JARs; remove all but one: $jars")
+ jars.head
+ }
- lazy val sparkStreaming = ProjectRef(
- file(SPARK_HOME),
- "streaming"
- )
+ val SPARK_KAFKA_JAR = {
+ val kafkaTargetDir = new File(SPARK_HOME, "external/kafka/target/scala-2.10/")
+ require(kafkaTargetDir.exists, "Missing Streaming Kafka JAR; run sbt/sbt package in SPARK_HOME")
+ val jars = kafkaTargetDir.listFiles().filter(_.getName.endsWith(".jar"))
+ .filter(_.getName.startsWith("spark-streaming-kafka")).toSeq
+ require(jars.size != 0, "Missing Streaming Kafka JAR; run sbt/sbt package in SPARK_HOME")
+ require(jars.size == 1, s"Found multiple Streaming Kafka JARs; remove all but one; $jars")
+ jars.head
+ }
- lazy val streamingKafka = ProjectRef(
- file(SPARK_HOME),
- "streaming-kafka"
- )
+ println(SPARK_ASSEMBLY_JAR)
lazy val root = Project(
"spark-integration-tests",
file("."),
settings = buildSettings ++ Seq(
scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature"),
+ unmanagedJars in Compile ++= { Seq(SPARK_ASSEMBLY_JAR, SPARK_KAFKA_JAR).classpath },
+ unmanagedJars in Compile += Attributed.blank(SPARK_ASSEMBLY_JAR),
+ unmanagedJars in Runtime ++= { Seq(SPARK_ASSEMBLY_JAR, SPARK_KAFKA_JAR).classpath },
+ unmanagedJars in Test ++= { Seq(SPARK_ASSEMBLY_JAR, SPARK_KAFKA_JAR).classpath },
libraryDependencies ++= Seq(
"com.jsuereth" %% "scala-arm" % "1.4",
"fr.janalyse" %% "janalyse-ssh" % "0.9.14",
"com.jcraft" % "jsch" % "0.1.51",
"org.scalatest" % "scalatest_2.10" % "2.2.1" % "test",
- "net.sf.jopt-simple" % "jopt-simple" % "3.2" % "test" // needed by Kafka, excluded by Spark
+ "net.sf.jopt-simple" % "jopt-simple" % "3.2" % "test", // needed by Kafka, excluded by Spark
+ "com.google.guava" % "guava" % "14.0.1"
)
)
- ).dependsOn(sparkCore, sparkStreaming, streamingKafka)
+ )
}
It seems that sbt/sbt clean package assembly
may generate multiple JARs in Spark's assembly folder, one of which is the actual assembly JAR and one of which is empty. Once I delete the other JAR, I still run into some issues with Kafka. I guess that I shouldn't rely on Spark / Spark Streaming providing transitive dependencies that the integration tests project needs and should probably just declare those as dependencies.
I could just add a "provided"
dependency on spark-core
, spark-streaming
, spark-streaming-kafka
, etc, but this results in the provided version being used in the test
classpath, so we have to do some tricky hacks to make sure that we actually run against the classes in SPARK_HOME
.
The current approach of adding
SPARK_HOME
as a SBT subproject makes it difficult to properly change Spark versions, since it makes it very easy to wind up using assembly JARs that were built against one revision while the test driver itself runs against classes fromSPARK_HOME
. Instead, we should scrap the SBT subprojects and just add some SBT logic to add the assembly JARs to the run and test classpaths.