OpenLineage / OpenLineage

An Open Standard for lineage metadata collection
http://openlineage.io
Apache License 2.0
1.7k stars 294 forks source link

[SPARK] Column-Level Lineage is not captured for Final Output Datasets (Hive Tables) in Spark Job #2592

Open singh-ranvir opened 4 months ago

singh-ranvir commented 4 months ago

Hi Team, we tried to test a sample job to capture the lineage and following observations were recorded:

  1. Job level lineage along with "Schema" details for all "input" & "output" datasets were captured and same had been populated on Marquez UI.
  2. "Column-Level" lineage is not getting captured for output datasets. Upon checking, we found "exprId" for each output fields in the captured lineage data for input fields to take the same from the plan.

PFB code tested:


val r = new scala.util.Random(42)

val user = for (i <- 10 to 100) yield (i, r.nextInt(100))

val userData = spark.createDataFrame(user).toDF("u_id","u_age")

userData.write.mode("overwrite").saveAsTable("ol_hive.user_data")

val data = for (i <- 0 to 1000) yield (i, "user-" + r.alphanumeric.take(5).mkString(""), r.nextInt(1000))

val dsUsage = spark.createDataFrame(data).toDF("u_id","u_name","u_usage")

dsUsage.write.mode("overwrite").saveAsTable("ol_hive.ds_usage")

val finalData = spark.sql("select ds.u_id, ds.u_name, ds.u_usage, ud.u_age from ol_hive.ds_usage ds inner join ol_hive.user_data ud on ds.u_id = ud.u_id")

finalData.write.mode("overwrite").saveAsTable("ol_hive.final_data")


Additional Details:

Spark Version: 2.4 OpenLineage Version: 1.8.0 Marquez Version: 0.46

Tested in Spark-Shell, launched with following configs:

spark-shell --master yarn --num-executors 4 --driver-memory 3g --executor-memory 3g --jars openlineage-spark-1.8.0.jar --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" --conf "spark.openlineage.transport.type=http" --conf "spark.openlineage.transport.url=" --conf "spark.openlineage.namespace=sample-job-hive-ss"

boring-cyborg[bot] commented 4 months ago

Thanks for creating your first OpenLineage issue! Your feedback is valuable and improves the project. If you haven't already, please be sure to follow the issue template!

mobuchowski commented 4 months ago

@singh-ranvir column-level lineage does not work for Spark 2