GoogleCloudPlatform / zetasql-toolkit

The ZetaSQL Toolkit is a library that helps users use ZetaSQL Java API to perform SQL analysis for multiple query engines, including BigQuery and Cloud Spanner.
Apache License 2.0
35 stars 8 forks source link

lineage from union statement not parsed #21

Closed iyinoluwaayoola closed 9 months ago

iyinoluwaayoola commented 10 months ago

Hi maintainers, First of all, thanks for this awesome contribution to the community.

I'm trying to extract the column lineage from a union query, for example:

CREATE OR REPLACE TABLE `project.dataset.table` AS (
    SELECT
        *,
        MIN(title) OVER (
            PARTITION BY title
            ORDER BY title
            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        ) AS min_title
    FROM (
        SELECT
            title, comment
        FROM
            `bigquery-public-data.samples.wikipedia` UNION ALL
        SELECT
            corpus, word
        FROM
            `bigquery-public-data.samples.shakespeare`
    )
)

The extracted lineage looks like this:

project.dataset.table.title
                <- $union_all.title
project.dataset.table.comment
                <- $union_all.comment
project.dataset.table.min_title
                <- $union_all.title

But I'll expect:

project.dataset.table.title
                <- bigquery-public-datasamples.wikipedia.title
                <- bigquery-public-datasamples.shakespeare.corpus
project.dataset.table.comment
                <- bigquery-public-datasamples.wikipedia.comment
                <- bigquery-public-datasamples.shakespeare.word
project.dataset.table.min_title
                <- bigquery-public-datasamples.wikipedia.title
                <- bigquery-public-datasamples.shakespeare.corpus

Can you kindly point me to the required changes or help to support this case? Also what are the limitations of the current extractors, are there unsupported scenarios?

ppaglilla commented 9 months ago

Thank you for reporting this and apologies for the late reply!

We recently released version 0.4.1 of the ZetaSQL Toolkit, which expands the column-level lineage support with a few new features. That includes properly mapping columns for set operations (i.e. UNIONs, INTERSECTs, etc). I tested the query you shared explicitly and it works as expected.

Regarding your question, the current version of column-level lineage extraction should have all bases covered. It supports joins, subqueries, WITH clauses, set operations, STRUCT columns and operations, etc. The only major features that are explicitly not supported at the moment are the PIVOT and UNPIVOT operators.

However, we are yet to do more thorough testing to verify edge cases work as intended. I hope to do that in the near future. In the meantime, feel free to open an issue for any case you find not working as expected!