Closed Lockdain closed 2 months ago
Can you try to write the content of the dataframe somewhere and check if it can send metadata in that case? In the meantime I will check and try to fix with your example. If seems like we initialize the DataHub emitter in an event which only get triggered if you read from or write to somewhere.
Hi @treff7es,
Thanks for your reply. I've updated the Spark job so it saves the data frame to a PostgreSQL table and the agent has started to push metadata to DataHub.
Now it seems to me that this behavior of the Spark agent is rather correct since the main goal of the agent is to provide a lineage, which is based upon a "source->transformation->destination" chain.\
Am I right?
Yes, it is not really useful to use the agent to capture the Spark job without inputs and outputs.
I fixed this issue as well, now you should be able to see the spark job without inputs.
Describe the bug I've tried to setup a test Spark example according to the article: https://datahubproject.io/docs/metadata-integration/java/spark-lineage-beta
Datahub 0.13.1 is running in Docker.
I use
spark-submit
running locally to execute PySpark job listed below:spark-submit --packages io.acryl:acryl-spark-lineage:0.2.11 --conf "spark.extraListeners=datahub.spark.DatahubSparkListener" test.py
The spark application finishes as expected and as per logs shows some details concerning Spark agent. But nothing appears in DataHub, seems that agent doesn't even try to publish meta events to GMS:
To Reproduce
Expected behavior Datahub shows the metadata extracted from the Spark application.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):