Open steffengr opened 3 months ago
+1, @treff7es are you going to spend some time on it?
@Michalosu or @steffengr, can you share a sample PySpark job that I can use to reproduce the issue?
@treff7es You can use a minimal job such as the following one with any input file. I ran it on AWS Glue for testing.
Make sure that the path s3://my-bucket/in/
is either registered as a dataset in Datahub before running the job or set the configuration spark.datahub.metadata.dataset.materialize=true
.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.parquet('s3://my-bucket/in/')\
.write.parquet("s3://my-bucket/out/")
I ran it with as little configuration as possible:
--conf spark.extraListeners=datahub.spark.DatahubSparkListener --conf spark.datahub.rest.server=https://gms.my.datahub.com
--extra-jars s3://my-bucket/jars/acryl-spark-lineage-0.2.16.jar
@steffengr Thanks. Please, can you test with this rc jar?
io.acryl:acryl-spark-lineage:0.2.17-rc3
It tried this with your test code and it seems to be looking good.
Describe the bug acryl-spark-lineage 0.2.15 introduced a bug which reports S3 upstream dependencies as both upstream and downstream for PySpark.
To Reproduce Steps to reproduce the behavior:
Expected behavior Upstream dependencies should only be reported as upstream.
Screenshots
Additional context The error occurred to me using Datahub v0.13.2 and acryl-spark-lineage 0.2.15. Using acryl-spark-lineage 0.2.14 or older works as expected.