datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.88k stars 2.92k forks source link

BigQuery Ingestion Fails to Create Lineage Due to SQL Parsing Errors in sqlglot #11654

Open Nirvikalpa108 opened 2 weeks ago

Nirvikalpa108 commented 2 weeks ago

Describe the bug

We are experiencing issues with our BigQuery ingestion jobs where there is no lineage between BigQuery tables and views in a significant number of cases.

The ingestion logs show multiple errors from the sqlglot library, which typically end in contains unsupported syntax. Falling back to parsing as a 'Command’.

We tried running sqlglot locally to parse some sample SQL that appear in our audit logs and have broken lineage and we get the same errors. We tried using no dialect and also specifying the BigQuery dialect.

We wonder if the SQL we are trying to parse is too complex as the sqlglot parser seems to fall back to parsing these statements as commands, which we think could be leading to the failure of lineage creation. In addition our SQL scripts typically create temporary tables, which could be a challenge for sqlglot to parse.

BigQuery Ingestion Job Parsing Errors

Here is a sample of some of the sqlglot parsing errors we receive in our bigquery ingestion job logs.

WARNING  {sqlglot.parser:1432} - 'EXCEPTION WHEN ERROR THEN RAISE USING message = FORMAT(@@error.message)' contains unsupported syntax. Falling back to parsing as a 'Command’

'END IF' contains unsupported syntax. Falling back to parsing as a 'Command'.

'DECLARE is_full_refresh bool DEFAULT false' contains unsupported syntax. Falling back to parsing as a 'Command'.

project.views.SCHEMA' contains unsupported syntax. Falling back to parsing as a 'Command’

To Reproduce

Steps to reproduce the behavior:

Here is an example SQL script that we need to parse

Here’s a snippet of the SQL that triggers the error, We have obfuscated anything specific to our business:

DECLARE current_job_start TIMESTAMP DEFAULT CURRENT_TIMESTAMP;
DECLARE max_record_ts TIMESTAMP DEFAULT NULL;
DECLARE partitions STRUCT<max_record_ts TIMESTAMP, dates ARRAY<DATE>> DEFAULT NULL;

CALL `internal_project.get_partitions`(
  ('project.dataset.view_name', 'EventTimestamp'),
  ('project.dataset.other_table', 'BusinessDate', 'EventTimestamp'),
  partitions
);

IF ARRAY_LENGTH(partitions.dates) > 0 THEN
  CREATE OR REPLACE TEMP TABLE temp_table AS
  SELECT * EXCEPT (SnapshotTimestamp)
  FROM `project.dataset.view_name`
  WHERE (IDField, FlagField, ForeignKeyField, StartDate) IN UNNEST(partitions.dates)
  ORDER BY EventTimestamp;

  -- Check if the delta table contains new data
  IF (SELECT COUNT(1) FROM temp_table_delta) > 0 THEN
    CREATE OR REPLACE TEMP TABLE final_output AS
    SELECT DISTINCT IDField, Email, UserID, EventTimestamp, BusinessDate
    FROM temp_table_delta
    WHERE EventTimestamp BETWEEN '2023-01-01' AND '2023-12-31';
  END IF;
END IF;

Here is the python script that we used to parse the above locally. We are using Python 3.8.18.

import sqlglot
import argparse

# Setup argument parser to accept the SQL file path from the command line
parser = argparse.ArgumentParser(description="Parse an SQL file using sqlglot.")
parser.add_argument("file", help="Path to the SQL file you want to parse")

args = parser.parse_args()

# Open the file specified by the user
with open(args.file, "r") as file:
    sql_query = file.read()

# Try parsing the SQL
try:
    parsed = sqlglot.parse_one(sql_query, read="bigquery")
    print(parsed)
except sqlglot.errors.ParseError as e:
    print(f"Failed to parse SQL: {e}")

When we parse this locally using sqlglot, we get the following:

'DECLARE current_job_start TIMESTAMP DEFAULT CURRENT_TIMESTAMP' contains unsupported syntax. Falling back to parsing as a 'Command'.
'DECLARE max_record_ts TIMESTAMP DEFAULT NULL' contains unsupported syntax. Falling back to parsing as a 'Command'.
'DECLARE partitions STRUCT<max_record_ts TIMESTAMP, dates ARRAY<DATE>> DEFAULT NULL' contains unsupported syntax. Falling back to parsing as a 'Command'.
'CALL `internal_project.get_partitions`(
  ('project.dataset.view_name', 'EventTimestamp'),
  ('proje' contains unsupported syntax. Falling back to parsing as a 'Command'.
'IF ARRAY_LENGTH(partitions.dates) > 0 THEN
  CREATE OR REPLACE TEMP TABLE temp_table AS
  SELECT * E' contains unsupported syntax. Falling back to parsing as a 'Command'.
Failed to parse SQL: Required keyword: 'this' missing for <class 'sqlglot.expressions.If'>. Line 19, Col: 11.
  partitions.dates)
  ORDER BY EventTimestamp;

  -- Check if the delta table contains new data
  IF (SELECT COUNT(1) FROM temp_table_delta) > 0 THEN
    CREATE OR REPLACE TEMP TABLE final_output AS
    SELEC

Expected behavior

We expect sqlglot to parse complex SQL statements, like the sample above. (We hope that this will lead to the BigQuery ingestion job correctly creating lineage between tables and views)

Desktop (please complete the following information):

Additional context

Our BigQuery ingestion job configuration

source:
  type: bigquery
  config:
    project_on_behalf: {{ bq_slots_project }}
    project_id_pattern:
      allow:
        - .*{{ gcp_project }}

    # Dataset pattern configurations, including deny rules for temporary datasets
    dataset_pattern:
      deny:
        - ^temp_.*
        - .*_temp$
        - .*_temp_.*
        - .*-temp.*
        - .*temporary.*
        - tpc_ds

    use_exported_bigquery_audit_metadata: true
    bigquery_audit_metadata_datasets:
      - {{ gcp_project }}.bigquery_audit_log
    use_date_sharded_audit_log_tables: true
    upstream_lineage_in_report: true
    include_usage_statistics: true
    capture_table_label_as_tag: true
    capture_dataset_label_as_tag: true
    extract_column_lineage: true
    convert_urns_to_lowercase: true

    stateful_ingestion:
      enabled: true
      state_provider:
        type: "datahub"
        config:
          datahub_api:
            server: "<http://gott-datahub-gms:8080>"

pipeline_name: {{ pipeline }}
sink:
  type: "datahub-rest"
  config:
    server: "<http://gott-datahub-gms:8080>"

Questions:

I am an SQL newbie and really appreciate any guidance you can provide!

  1. Is it possible for sqlglot to parse this kind of SQL and we’re just doing something wrong?
  2. Are there any known workarounds or configurations we could apply to better handle this?
  3. Is there any plan to improve support for the kind of SQL we need to parse in sqlglot?
  4. Given the kind of SQL we need to parse, do you think it would be possible for us to contribute changes to sqlglot? If so, do you happen to know if this would be something the maintainers would likely welcome?
  5. Do you have any advice on SQL parsing libraries that would be better able to parse our SQL?
jjoyce0510 commented 9 hours ago

Thanks for the issue! Our team is looking into this and will try to get back on your set of questions. :)