We are experiencing issues with our BigQuery ingestion jobs where there is no lineage between BigQuery tables and views in a significant number of cases.
The ingestion logs show multiple errors from the sqlglot library, which typically end in contains unsupported syntax. Falling back to parsing as a 'Command’.
We tried running sqlglot locally to parse some sample SQL that appear in our audit logs and have broken lineage and we get the same errors. We tried using no dialect and also specifying the BigQuery dialect.
We wonder if the SQL we are trying to parse is too complex as the sqlglot parser seems to fall back to parsing these statements as commands, which we think could be leading to the failure of lineage creation. In addition our SQL scripts typically create temporary tables, which could be a challenge for sqlglot to parse.
BigQuery Ingestion Job Parsing Errors
Here is a sample of some of the sqlglot parsing errors we receive in our bigquery ingestion job logs.
WARNING {sqlglot.parser:1432} - 'EXCEPTION WHEN ERROR THEN RAISE USING message = FORMAT(@@error.message)' contains unsupported syntax. Falling back to parsing as a 'Command’
'END IF' contains unsupported syntax. Falling back to parsing as a 'Command'.
'DECLARE is_full_refresh bool DEFAULT false' contains unsupported syntax. Falling back to parsing as a 'Command'.
project.views.SCHEMA' contains unsupported syntax. Falling back to parsing as a 'Command’
To Reproduce
Steps to reproduce the behavior:
Here is an example SQL script that we need to parse
Here’s a snippet of the SQL that triggers the error, We have obfuscated anything specific to our business:
DECLARE current_job_start TIMESTAMP DEFAULT CURRENT_TIMESTAMP;
DECLARE max_record_ts TIMESTAMP DEFAULT NULL;
DECLARE partitions STRUCT<max_record_ts TIMESTAMP, dates ARRAY<DATE>> DEFAULT NULL;
CALL `internal_project.get_partitions`(
('project.dataset.view_name', 'EventTimestamp'),
('project.dataset.other_table', 'BusinessDate', 'EventTimestamp'),
partitions
);
IF ARRAY_LENGTH(partitions.dates) > 0 THEN
CREATE OR REPLACE TEMP TABLE temp_table AS
SELECT * EXCEPT (SnapshotTimestamp)
FROM `project.dataset.view_name`
WHERE (IDField, FlagField, ForeignKeyField, StartDate) IN UNNEST(partitions.dates)
ORDER BY EventTimestamp;
-- Check if the delta table contains new data
IF (SELECT COUNT(1) FROM temp_table_delta) > 0 THEN
CREATE OR REPLACE TEMP TABLE final_output AS
SELECT DISTINCT IDField, Email, UserID, EventTimestamp, BusinessDate
FROM temp_table_delta
WHERE EventTimestamp BETWEEN '2023-01-01' AND '2023-12-31';
END IF;
END IF;
Here is the python script that we used to parse the above locally. We are using Python 3.8.18.
import sqlglot
import argparse
# Setup argument parser to accept the SQL file path from the command line
parser = argparse.ArgumentParser(description="Parse an SQL file using sqlglot.")
parser.add_argument("file", help="Path to the SQL file you want to parse")
args = parser.parse_args()
# Open the file specified by the user
with open(args.file, "r") as file:
sql_query = file.read()
# Try parsing the SQL
try:
parsed = sqlglot.parse_one(sql_query, read="bigquery")
print(parsed)
except sqlglot.errors.ParseError as e:
print(f"Failed to parse SQL: {e}")
When we parse this locally using sqlglot, we get the following:
'DECLARE current_job_start TIMESTAMP DEFAULT CURRENT_TIMESTAMP' contains unsupported syntax. Falling back to parsing as a 'Command'.
'DECLARE max_record_ts TIMESTAMP DEFAULT NULL' contains unsupported syntax. Falling back to parsing as a 'Command'.
'DECLARE partitions STRUCT<max_record_ts TIMESTAMP, dates ARRAY<DATE>> DEFAULT NULL' contains unsupported syntax. Falling back to parsing as a 'Command'.
'CALL `internal_project.get_partitions`(
('project.dataset.view_name', 'EventTimestamp'),
('proje' contains unsupported syntax. Falling back to parsing as a 'Command'.
'IF ARRAY_LENGTH(partitions.dates) > 0 THEN
CREATE OR REPLACE TEMP TABLE temp_table AS
SELECT * E' contains unsupported syntax. Falling back to parsing as a 'Command'.
Failed to parse SQL: Required keyword: 'this' missing for <class 'sqlglot.expressions.If'>. Line 19, Col: 11.
partitions.dates)
ORDER BY EventTimestamp;
-- Check if the delta table contains new data
IF (SELECT COUNT(1) FROM temp_table_delta) > 0 THEN
CREATE OR REPLACE TEMP TABLE final_output AS
SELEC
Expected behavior
We expect sqlglot to parse complex SQL statements, like the sample above. (We hope that this will lead to the BigQuery ingestion job correctly creating lineage between tables and views)
Desktop (please complete the following information):
I am an SQL newbie and really appreciate any guidance you can provide!
Is it possible for sqlglot to parse this kind of SQL and we’re just doing something wrong?
Are there any known workarounds or configurations we could apply to better handle this?
Is there any plan to improve support for the kind of SQL we need to parse in sqlglot?
Given the kind of SQL we need to parse, do you think it would be possible for us to contribute changes to sqlglot? If so, do you happen to know if this would be something the maintainers would likely welcome?
Do you have any advice on SQL parsing libraries that would be better able to parse our SQL?
Describe the bug
We are experiencing issues with our BigQuery ingestion jobs where there is no lineage between BigQuery tables and views in a significant number of cases.
The ingestion logs show multiple errors from the
sqlglot
library, which typically end incontains unsupported syntax. Falling back to parsing as a 'Command’
.We tried running
sqlglot
locally to parse some sample SQL that appear in our audit logs and have broken lineage and we get the same errors. We tried using no dialect and also specifying the BigQuery dialect.We wonder if the SQL we are trying to parse is too complex as the
sqlglot
parser seems to fall back to parsing these statements as commands, which we think could be leading to the failure of lineage creation. In addition our SQL scripts typically create temporary tables, which could be a challenge forsqlglot
to parse.BigQuery Ingestion Job Parsing Errors
Here is a sample of some of the sqlglot parsing errors we receive in our bigquery ingestion job logs.
To Reproduce
Steps to reproduce the behavior:
Here is an example SQL script that we need to parse
Here’s a snippet of the SQL that triggers the error, We have obfuscated anything specific to our business:
Here is the python script that we used to parse the above locally. We are using Python 3.8.18.
When we parse this locally using sqlglot, we get the following:
Expected behavior
We expect sqlglot to parse complex SQL statements, like the sample above. (We hope that this will lead to the BigQuery ingestion job correctly creating lineage between tables and views)
Desktop (please complete the following information):
Additional context
Our BigQuery ingestion job configuration
Questions:
I am an SQL newbie and really appreciate any guidance you can provide!
sqlglot
?