GoogleCloudPlatform / zetasql-toolkit

The ZetaSQL Toolkit is a library that helps users use ZetaSQL Java API to perform SQL analysis for multiple query engines, including BigQuery and Cloud Spanner.
Apache License 2.0
38 stars 9 forks source link

Extract column lineage from indirect dependencies #27

Open Chr96er opened 10 months ago

Chr96er commented 10 months ago

The current implementation of column lineage extraction does not address real world applications as far as I can tell.

The two most common applications I can think of are: 1) Impact analysis when changing definition of columns or considering deletion 2) Understanding definition of columns

Both of these applications require extracting indirect column dependencies from e.g. WHERE, GROUP BY, IF and CASE WHEN clauses. As far as I can tell this is not possible at the moment (unless I'm missing a configuration option).

Example 1) (this is actually from the ExtractColumnLevelLineage.java examples):

UPDATE `bigquery-public-data.samples.wikipedia` W
    SET title = S.corpus, comment = S.word
FROM (SELECT corpus, UPPER(word) AS word FROM `bigquery-public-data.samples.shakespeare`) S
WHERE W.title = S.corpus;

extracts:

Lineage:
bigquery-public-data.samples.wikipedia.comment
                <- bigquery-public-data.samples.shakespeare.word
bigquery-public-data.samples.wikipedia.title
                <- bigquery-public-data.samples.shakespeare.corpus

when I would like to see:

Lineage:
bigquery-public-data.samples.wikipedia.comment
                <- bigquery-public-data.samples.shakespeare.word
                <- bigquery-public-data.samples.shakespeare.corpus*
                <- bigquery-public-data.samples.wikipedia.title*
bigquery-public-data.samples.wikipedia.title
                <- bigquery-public-data.samples.shakespeare.corpus
                <- bigquery-public-data.samples.wikipedia.title*

The * indicates an indirect dependency (of course the output is generated by the user, so zetasql would have to provide some attribute for indirect relationships). Both columns indirectly depend on corpus and title. The definition of comment changes, if we make changes to either corpus or title in upstream tables. I would like to be aware of that when I make changes to those columns.

Another two made up examples: Example 2) GROUP BY:

Query:
CREATE TABLE foo.bar AS
SELECT
    SUM(word_count) words,
    corpus,
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;

Lineage:
foo.bar.corpus
                <- bigquery-public-data.samples.shakespeare.corpus
foo.bar.words
                <- bigquery-public-data.samples.shakespeare.word_count

where I would like to see:

Lineage:
foo.bar.corpus
                <- bigquery-public-data.samples.shakespeare.corpus
foo.bar.words
                <- bigquery-public-data.samples.shakespeare.word_count
                <- bigquery-public-data.samples.shakespeare.corpus*

Example 3) IF:

Query:
CREATE TABLE foo.bar AS
SELECT
    SUM(IF(word_count > 10, 1, 0)) word_count_gt_ten,
    corpus,
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;

Lineage:
foo.bar.word_count_gt_ten
foo.bar.corpus
                <- bigquery-public-data.samples.shakespeare.corpus

where I would like to see:

Lineage:
foo.bar.word_count_gt_ten
                <- bigquery-public-data.samples.shakespeare.word_count*
foo.bar.corpus
                <- bigquery-public-data.samples.shakespeare.corpus

Please add configuration options for detecting these indirect dependencies, otherwise I don't see anyone adapting column level lineage which could be extremely powerful.

ppaglilla commented 10 months ago

You bring up a very good point, thank you for reporting this!

I originally envisioned the column lineage feature to focus only on columns that were used to write to others directly. But now that you bring this discussion up; I agree that we should have visibility on filters, groups, etc.

I'll need to look into it a bit to give you a timeline. My first impression is that it'd require decent refactor of how lineage is currently implemented. But I'd like to have this as well!

Chr96er commented 10 months ago

Thanks, would be amazing to have that feature!