Open singh-ranvir opened 4 months ago
Thanks for creating your first OpenLineage issue! Your feedback is valuable and improves the project. If you haven't already, please be sure to follow the issue template!
@singh-ranvir column-level lineage does not work for Spark 2
Hi Team, we tried to test a sample job to capture the lineage and following observations were recorded:
PFB code tested:
val r = new scala.util.Random(42)
val user = for (i <- 10 to 100) yield (i, r.nextInt(100))
val userData = spark.createDataFrame(user).toDF("u_id","u_age")
userData.write.mode("overwrite").saveAsTable("ol_hive.user_data")
val data = for (i <- 0 to 1000) yield (i, "user-" + r.alphanumeric.take(5).mkString(""), r.nextInt(1000))
val dsUsage = spark.createDataFrame(data).toDF("u_id","u_name","u_usage")
dsUsage.write.mode("overwrite").saveAsTable("ol_hive.ds_usage")
val finalData = spark.sql("select ds.u_id, ds.u_name, ds.u_usage, ud.u_age from ol_hive.ds_usage ds inner join ol_hive.user_data ud on ds.u_id = ud.u_id")
finalData.write.mode("overwrite").saveAsTable("ol_hive.final_data")
Additional Details:
Spark Version: 2.4 OpenLineage Version: 1.8.0 Marquez Version: 0.46
Tested in Spark-Shell, launched with following configs:
spark-shell --master yarn --num-executors 4 --driver-memory 3g --executor-memory 3g --jars openlineage-spark-1.8.0.jar --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" --conf "spark.openlineage.transport.type=http" --conf "spark.openlineage.transport.url=" --conf "spark.openlineage.namespace=sample-job-hive-ss"