datahub-project / datahub

The Metadata Platform for your Data and AI Stack
https://datahubproject.io
Apache License 2.0
9.93k stars 2.94k forks source link

dbt/BigQuery: Incomplete Column Level Lineage when using deduplication macros #11670

Open Starkie opened 1 month ago

Starkie commented 1 month ago

Hey all!

Describe the bug We have a dbt core project that targets a BigQuery data warehouse. When ingesting the metadata into DataHub with include_column_lineage=true and prefer_sql_parser_lineage=true, the column-level lineage (CLL) is missing for some of the datasets. The table-level lineage is fine for all of them.

We've tracked down the issue to a row deduplication macro from dbt_utils. The problem seems to be in the generated SQL code:

select unique.*
from (
     select
         array_agg (
                 original
                     order by article_name desc
            limit 1
         )[offset(0)] unique
     from all_articles original
     group by id
)

When replacing it with a simple SELECT * FROM all_articles statement, the CLL is generated correctly.

We're not sure if this is specific to DataHub or should be reported to sqlglot instead. Let me know and I can create the issue there.

To Reproduce We have created a repository with a small dbt project to reproduce it: https://github.com/Starkie/datahub-dbt-lineage-repro

Steps to reproduce the behavior:

  1. Setup a dbt project that targets BigQuery
  2. Add a model whose last statement has the following code. Replace the values between <> with the correct ones for your model:
select unique.*
from (
     select
         array_agg (
                 original
                     order by <table_column> desc
            limit 1
         )[offset(0)] unique
     from <table_name> original
     group by id
)
  1. Execute dbt with DataHub's recommended commands:
dbt source snapshot-freshness
dbt build
cp target/run_results.json target/run_results_backup.json
dbt docs generate
cp target/run_results_backup.json target/run_results.json
  1. Execute a dbt DataHub ingestion recipe that includes column-level lineage and uses the SQL parser for lineage.
    
    pipeline_name: "kafka-inventory-01-dbt"

source: type: "dbt" config: env: dev

# Coordinates
manifest_path: "./target/manifest.json"
catalog_path: "./target/catalog.json"
run_results_paths: ["./target/run_results.json"]

# Options
convert_column_urns_to_lowercase: true
target_platform: "bigquery"
skip_sources_in_lineage: true
prefer_sql_parser_lineage: true
include_column_lineage: true
infer_dbt_schemas: true


**Expected behavior**
We expected the column-level lineage between the source table and the final model to be generated.

![01-expected-lineage](https://github.com/user-attachments/assets/9a0b12fd-133e-402e-a91a-94df543ac399)

**Actual behavior**
![02-actual-lineage](https://github.com/user-attachments/assets/c38afa75-216a-4f65-9d6f-a6c805df298a)

**Desktop (please complete the following information):**
 - OS: Windows 11
 - Browser Firefox 
 - Version 131.03

- DataHub version: 0.14.1
- DataHub CLI version: 0.14.1
jjoyce0510 commented 2 weeks ago

Thank you for all the details! This is super helpful. We can try to reproduce but I think it's not a bad idea to raise this directly on SQL glot as well to ensure the parser can in general handle these types of statements.