GoogleCloudPlatform / zetasql-toolkit

The ZetaSQL Toolkit is a library that helps users use ZetaSQL Java API to perform SQL analysis for multiple query engines, including BigQuery and Cloud Spanner.
Apache License 2.0
39 stars 10 forks source link

ResolvedProjectScans in ResolvedWithEntry does not preserve column aliases. #78

Closed TunahanOcal closed 4 months ago

TunahanOcal commented 4 months ago

Hello,

I noticed that if a column does not contain a transformation, columnList in ResolvedProjectScan in ResolvedWithEntry does not preserve column aliases. I have added an example below. Such as,

with base as(
select
 a.id as t_id,
 ifnull(a.name,"") as calculated_name 
 from table a  
) 

In the column list tableName of the column t_id returns table it's name returns id. However, tableName of calculated_name returns base and it's name is calculated_name.

Is this expected behavior?

Thanks in advance.

ppaglilla commented 4 months ago

That is the default behavior of ZetaSQL for all ResolvedProjectScan, but you can change it by setting the createNewColumnForEachProjectedOutput option in your AnalyzerOptions.

AnalyzerOptions options = new AnalyzerOptions();
options.setCreateNewColumnForEachProjectedOutput(true);
options.setLanguageOptions(BigQueryLanguageOptions.get());  // If you're analyzing for BigQuery
TunahanOcal commented 4 months ago

Thanks for the quick response.

TunahanOcal commented 4 months ago

I have tried the setCreateNewColumnForEachProjectedOutput and setPreserveColumnAliases methods. However, it does not change anything. There are some problems I cannot understand. Such as user-defined functions cannot be found even if they are in the catalog, table names do not include schema and project name and the columns in with queries do not include aliases.

This column alias problem affects the lineage methods.

ppaglilla commented 4 months ago

Lineage should work well even without column aliases preserved, unless there's some bug you're encountering that I don't know about. Lineage ultimately doesn't care about the aliases you used in the middle of your query if no computation was done to get there.

As for tables and UDFs, maybe I can help you take a closer look at what you're working on. Can you provide any code snippets of things that don't work as you expect so that we can take a look?

TunahanOcal commented 4 months ago

Thanks again, I get the idea. I created different type of lineage extractor for my specific needs. I can review my code again. I created workaround solution for table names. As for UDF function, I will open another issue with more details. Do you have any other suggestion for the column alias problem.

ppaglilla commented 4 months ago

Sounds good! I'm closing the issue then.

Not much to comment about the column aliases bit on my end. The createNewColumnForEachProjectedOutput option should project a new column (which will take the alias) for all outputs of ResolvedProjectScans. Feel free to share a snippet to reproduce if you see a different behavior.