AbsaOSS / spline

Data Lineage Tracking And Visualization Solution
https://absaoss.github.io/spline/
Apache License 2.0
596 stars 154 forks source link

Can't initialize Lineage Tracking #753

Closed sidp-dev closed 4 years ago

sidp-dev commented 4 years ago

Background

Have all of the pieces set up, Spline web UI, ArangoDB.

Trying to run Spline with SBT/IntelliJ locally through a unit test and getting the below error: 20/07/13 20:20:11 ERROR SparkLineageInitializer$: Initialization failed! Spark Lineage tracking is DISABLED. java.lang.ClassNotFoundException: za.co.absa.spline.coresparkadapterapi.SparkVersionRequirementImpl at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at za.co.absa.spline.coresparkadapterapi.AdapterFactory$class.instance(AdapterFactory.scala:23) at za.co.absa.spline.coresparkadapterapi.SparkVersionRequirement$.instance$lzycompute(SparkVersionRequirement.scala:31) at za.co.absa.spline.coresparkadapterapi.SparkVersionRequirement$.instance(SparkVersionRequirement.scala:31) at za.co.absa.spline.core.SparkLineageInitializer$SparkSessionWrapper.attemptInitialization(SparkLineageInitializer.scala:72) at za.co.absa.spline.core.SparkLineageInitializer$SparkSessionWrapper.liftedTree1$1(SparkLineageInitializer.scala:56) at za.co.absa.spline.core.SparkLineageInitializer$SparkSessionWrapper.enableLineageTracking(SparkLineageInitializer.scala:55) at com.unity3d.spark.ads.AdsEventsEnrichmentPrototypeTest$$anonfun$1.apply$mcV$sp(AdsEventsEnrichmentPrototypeTest.scala:64)

I have the below dependencies set up in my build.sbt and added as dependencies to the module where I'm running the test:

val spline             = "za.co.absa.spline"            % "spline-core"                   % "0.3.5"
val splineSparkAdapter = "za.co.absa.spline"            % "spline-core-spark-adapter-api" % "0.3.5"

Set up spline.properties as below:

spline.mode=BEST_EFFORT
spline.producer.url=http://localhost:8080/producer

My sbt build seems to be working fine, and I've set up the spark session to allow for lineage tracking like below in the test suite, just adding lines that I think are relevant, please let me know if more info is needed:

import za.co.absa.spline.core.SparkLineageInitializer._
  before {

    val df_map = EnrichmentPrototype.graph(sparkSession.enableLineageTracking(), conf, debug = "True")
    // graph() is basically a sequence of dataframe reads and transforms all using that sparkSession passed as the argument and returns a keyworded map of dataframes
    val emptyDf = sparkSession.emptyDataFrame

    developersDf = df_map.getOrElse("df_developers", emptyDf)
}

  test("enrich") {
    assert(df_developers.count() == 12)
}

Question

The test is passing but without lineage tracking.

Any help on how to resolve this exception is highly appreciated as I'm not able to understand why it isn't able to locate the class. Apologies in advance for any lacking information, just let me know. Cheers!

cerveada commented 4 years ago

Hello, version 0.3.x is not supported any more and will not work with ArangoDB anyway. Use never version - like 0.5.3.

Your program should depend only on Spline-Agent. Spline core/gateway should be running on tomcat and providing the producer and consumer APIs. It's all described in the documentation: https://absaoss.github.io/spline/

sidp-dev commented 4 years ago

Thanks @cerveada . So scala and spark versions are as below in my build.sbt:

scalaVersion in ThisBuild := "2.11.8"
val sparkVersion      = "2.3.0"

I'm sorry I'm getting a bit confused on how I should set up the right dependencies from this repo - https://github.com/AbsaOSS/spline-spark-agent

I currently have below:

val spline         = "za.co.absa.spline.agent.spark" % "agent-core_2.12"        % "0.5.3"

Can you please help with the exact way to identify which version/bundle to use for my dependency above? Thanks so much in advance!

cerveada commented 4 years ago

Well the 2.12 at the end of "agent-core_2.12" is the Scala version. I think in Sbt you don't have to write it like that it's handled by Sbt, but I use mainly maven, so I don't know Sbt that well.

At the end the all Scala versions must match in your project. Otherwise, it seems to be ok.

sidp-dev commented 4 years ago

Thanks so much @cerveada. I have all the pieces set up and running without issues. I ran my Spark test successfully but the Spline UI doesn't show anything. I checked the ArangoDB UI in the collections as well of the Spline DB but they don't show any data. Looking at the examples at this repo: https://github.com/AbsaOSS/spline-spark-agent/blob/release/0.5.3/examples/src/main/scala/za/co/absa/spline/example/batch/Example1Job.scala Can you please advise if my Spark job requires to have a write for the lineage to work? Thanks again!

sidp-dev commented 4 years ago

Never mind! I added a write to my test and the lineage now shows up. Please keep up the great work! Cheers!