datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.69k stars 2.86k forks source link

Spark lineage not recorded in DataHub from AWS Glue #9942

Open denys-tyshetskyy opened 6 months ago

denys-tyshetskyy commented 6 months ago

Describe the bug I am evaluating DataHub as a data catalog and lineage tool for the Data Platform and using this article - https://aws.amazon.com/blogs/big-data/part-2-deploy-datahub-using-aws-managed-services-and-ingest-metadata-from-aws-glue-and-amazon-redshift/ as an example. After running Glue job, I can see Spark DataPipeline and DataTask created in DataHub however the lineage doesn't exist. Gms logs look ok without errors. Identical issue has been raised previously https://github.com/datahub-project/datahub/issues/8997 but then closed without any comments. Another related issue closed without resolution or explanation - https://github.com/datahub-project/datahub/issues/5724

To Reproduce Steps to reproduce the behavior:

  1. Follow instructions from Capture data lineage in the web-page mentioned above
  2. Run glue job
  3. Open datahub UI and go to Platform->Spark.
  4. Go to DataTask created after the glue job run and open Lineage section

Expected behavior Can see upstream and downstream components in the lineage for the DataTask

Screenshots If applicable, add screenshots to help explain your problem.

image

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

dorac-git commented 6 months ago

Is there any update on this issue?

github-actions[bot] commented 5 months ago

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

vinothdataeng commented 4 months ago

any update on this issue ?

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

denystyshetskyy commented 3 months ago

This still hasn't been answered so can't really close it

treff7es commented 3 months ago

@denystyshetskyy @vinothdataeng Our latest Spark plugin (0.2.8 or 0.2.9) supports Glue, and please give it a try: On Glue please set:

The spark.datahub.stage_metadata_coalescing config parameter and Glue don't send an explicit application end event.

You can check other config parameters here:

denys-tyshetskyy commented 2 months ago

Hi @treff7es for your response. I am a bit confused with what jar to use for Glue jobs (datahub-spark-lineage or acryl-spark-lineage)? If I used acryl-spark-lineage version 0.2.9, which I assume is what you are referring to, I managed to get the lineage created after the glue job finishes. The issue I get now is that the result Hive dataset table doesn't have a schema in it.

image image

Is there anything else I need to do to make the schema come through?

Also, when I run the glue ingestion from the Datahub CLI, it also creates a new Glue dataset for the same table. So now for the same table I have 1 Hive table generated by glue job with spark and 1 Glue table generated by glue ingestion.

image
treff7es commented 2 months ago

@denystyshetskyy:

  1. Spark plugin, by default, only emits the upstream lineage edge and not the datasets and the schema. If you want the plugin. Ideally, you should capture lineage with the specific DataHub source (for example using MySql source to capture datasets and schema). If you want to use the spark lineage plugin you can enable it with the following config parameters:

    • --conf "spark.datahub.metadata.dataset.materialize=true" to materialize datasets and --conf "spark.datahub.metadata.dataset.experimental_include_schema_metadata=true" to capture schema from the Spark Plugin
  2. To capture glue tables as glue and not as hive you should use set the config property: --conf "spark.datahub.metadata.dataset.hivePlatformAlias=glue"