acryl-spark-lineage 0.2.15 and newer reports S3 upstream sources as downstream

datahub-project / datahub

The Metadata Platform for your Data and AI Stack

https://datahubproject.io

Apache License 2.0

9.93k stars 2.94k forks source link

acryl-spark-lineage 0.2.15 and newer reports S3 upstream sources as downstream #11192

Open steffengr opened 3 months ago

steffengr commented 3 months ago

Describe the bug acryl-spark-lineage 0.2.15 introduced a bug which reports S3 upstream dependencies as both upstream and downstream for PySpark.

To Reproduce Steps to reproduce the behavior:

Create a PySpark job that reads from and writes to S3.
Configure lineage using the acryl-spark-lineage 0.2.15 jar file.
Run the PySpark job
Go to the Spark task on Datahub and observe that both the runs view and the lineage graph report the upstream dependency as both upstream and downstream

Expected behavior Upstream dependencies should only be reported as upstream.

Screenshots

Additional context The error occurred to me using Datahub v0.13.2 and acryl-spark-lineage 0.2.15. Using acryl-spark-lineage 0.2.14 or older works as expected.

Michalosu commented 2 days ago

+1, @treff7es are you going to spend some time on it?

treff7es commented 2 days ago

@Michalosu or @steffengr, can you share a sample PySpark job that I can use to reproduce the issue?

steffengr commented 7 hours ago

@treff7es You can use a minimal job such as the following one with any input file. I ran it on AWS Glue for testing.

Make sure that the path s3://my-bucket/in/ is either registered as a dataset in Datahub before running the job or set the configuration spark.datahub.metadata.dataset.materialize=true.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

spark.read.parquet('s3://my-bucket/in/')\
    .write.parquet("s3://my-bucket/out/")

I ran it with as little configuration as possible:

--conf spark.extraListeners=datahub.spark.DatahubSparkListener --conf spark.datahub.rest.server=https://gms.my.datahub.com

--extra-jars s3://my-bucket/jars/acryl-spark-lineage-0.2.16.jar

treff7es commented 3 hours ago

@steffengr Thanks. Please, can you test with this rc jar?

io.acryl:acryl-spark-lineage:0.2.17-rc3

It tried this with your test code and it seems to be looking good.