GoogleCloudPlatform / zetasql-toolkit

The ZetaSQL Toolkit is a library that helps users use ZetaSQL Java API to perform SQL analysis for multiple query engines, including BigQuery and Cloud Spanner.
Apache License 2.0
39 stars 10 forks source link

Resolved column no longer have project id or dataset id since 0.5.0 #67

Closed dion-ricky closed 4 months ago

dion-ricky commented 4 months ago

Resolved column lineage doesn't have project id and dataset id beginning with v0.5.0.

Try to execute this example code from README using v0.4.1 and v0.5.0 then compare both output, you'll notice that the project id and dataset id that are present in 0.4.1 is missing in 0.5.0.

My questions are:

  1. Is this intentional and what is the goal?
  2. If so, how can I get project id and dataset id using only the table name?

BigQuery example:

String query =
    "INSERT INTO `bigquery-public-data.samples.wikipedia` (title) VALUES ('random title');\n"
    + "SELECT title, language FROM `bigquery-public-data.samples.wikipedia` WHERE title = 'random title';";

// Create a BigQueryCatalog
// By default, it will use the BigQuery API with application-default credentials
// to fetch BigQuery resources.
BigQueryCatalog catalog = new BigQueryCatalog(/*bqProjectId=*/"bigquery-public-data");

// Add resources to the catalog
// After a resource is added, it will be available when ZetaSQL perform analysis
catalog.addTable("bigquery-public-data.samples.wikipedia");

// Configure the analyzer options using the BigQuery feature set
AnalyzerOptions options = new AnalyzerOptions();
options.setLanguageOptions(BigQueryLanguageOptions.get());

// Use the ZetaSQLToolkitAnalyzer to run the analyzer
// It results an iterator over the resulting AnalyzedStatements
ZetaSQLToolkitAnalyzer analyzer = new ZetaSQLToolkitAnalyzer(options);
Iterator<AnalyzedStatement> statementIterator = analyzer.analyzeStatements(query, catalog);

// Use the resulting AnalyzedStatements
statementIterator.forEachRemaining(analyzedStatement -> {
    analyzedStatement.getResolvedStatement().ifPresent(System.out::println);
});

Output for v0.4.1:

InsertStmt
+-table_scan=
| +-TableScan(column_list=bigquery-public-data.samples.wikipedia.[title#1, id#2, language#3, wp_namespace#4, is_redirect#5, revision_id#6, contributor_ip#7, contributor_id#8, contributor_username#9, timestamp#10, is_minor#11, is_bot#12, reversion_id#13, comment#14, num_characters#15], table=bigquery-public-data.samples.wikipedia, column_index_list=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
+-insert_column_list=[bigquery-public-data.samples.wikipedia.title#1]
+-row_list=
| +-InsertRow
|   +-value_list=
|     +-DMLValue
|       +-value=
|         +-Literal(type=STRING, value=string_value: "random title")
+-column_access_list=WRITE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE

QueryStmt
+-output_column_list=
| +-bigquery-public-data.samples.wikipedia.title#1 AS `title` [STRING]
| +-bigquery-public-data.samples.wikipedia.language#3 AS `language` [STRING]
+-query=
  +-ProjectScan
    +-column_list=bigquery-public-data.samples.wikipedia.[title#1, language#3]
    +-input_scan=
      +-FilterScan
        +-column_list=bigquery-public-data.samples.wikipedia.[title#1, id#2, language#3, wp_namespace#4, is_redirect#5, revision_id#6, contributor_ip#7, contributor_id#8, contributor_username#9, timestamp#10, is_minor#11, is_bot#12, reversion_id#13, comment#14, num_characters#15]
        +-input_scan=
        | +-TableScan(column_list=bigquery-public-data.samples.wikipedia.[title#1, id#2, language#3, wp_namespace#4, is_redirect#5, revision_id#6, contributor_ip#7, contributor_id#8, contributor_username#9, timestamp#10, is_minor#11, is_bot#12, reversion_id#13, comment#14, num_characters#15], table=bigquery-public-data.samples.wikipedia, column_index_list=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
        +-filter_expr=
          +-FunctionCall(ZetaSQL:$equal(STRING, STRING) -> BOOL)
            +-ColumnRef(type=STRING, column=bigquery-public-data.samples.wikipedia.title#1)
            +-Literal(type=STRING, value=string_value: "random title")

Output for v0.5.0:

InsertStmt
+-table_scan=
| +-TableScan(column_list=wikipedia.[title#1, id#2, language#3, wp_namespace#4, is_redirect#5, revision_id#6, contributor_ip#7, contributor_id#8, contributor_username#9, timestamp#10, is_minor#11, is_bot#12, reversion_id#13, comment#14, num_characters#15], table=bigquery-public-data.samples.wikipedia, column_index_list=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
+-insert_column_list=[wikipedia.title#1]
+-row_list=
| +-InsertRow
|   +-value_list=
|     +-DMLValue
|       +-value=
|         +-Literal(type=STRING, value=string_value: "random title")
+-column_access_list=WRITE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE

QueryStmt
+-output_column_list=
| +-wikipedia.title#1 AS `title` [STRING]
| +-wikipedia.language#3 AS `language` [STRING]
+-query=
  +-ProjectScan
    +-column_list=wikipedia.[title#1, language#3]
    +-input_scan=
      +-FilterScan
        +-column_list=wikipedia.[title#1, id#2, language#3, wp_namespace#4, is_redirect#5, revision_id#6, contributor_ip#7, contributor_id#8, contributor_username#9, timestamp#10, is_minor#11, is_bot#12, reversion_id#13, comment#14, num_characters#15]
        +-input_scan=
        | +-TableScan(column_list=wikipedia.[title#1, id#2, language#3, wp_namespace#4, is_redirect#5, revision_id#6, contributor_ip#7, contributor_id#8, contributor_username#9, timestamp#10, is_minor#11, is_bot#12, reversion_id#13, comment#14, num_characters#15], table=bigquery-public-data.samples.wikipedia, column_index_list=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
        +-filter_expr=
          +-FunctionCall(ZetaSQL:$equal(STRING, STRING) -> BOOL)
            +-ColumnRef(type=STRING, column=wikipedia.title#1)
            +-Literal(type=STRING, value=string_value: "random title")
ppaglilla commented 4 months ago

Thanks for pointing this out!

It was caused by an upstream ZetaSQL change and is fixed in version 0.5.2.