GoogleCloudPlatform / zetasql-toolkit

The ZetaSQL Toolkit is a library that helps users use ZetaSQL Java API to perform SQL analysis for multiple query engines, including BigQuery and Cloud Spanner.
Apache License 2.0
39 stars 10 forks source link

Add support for BigQuery wildcard table #70

Open dion-ricky opened 4 months ago

dion-ricky commented 4 months ago

Consider adding support for BigQuery wildcard table reference https://cloud.google.com/bigquery/docs/querying-wildcard-tables.

dion-ricky commented 4 months ago

I have added support for wildcard reference #73, but it's half-baked since only the last table registered to the catalog is detected as parent column lineage.

For example, in bigquery-public-data.noaa_gsod dataset gsod202* matches with gsod2020, gsod2021, gsod2022, gsod2023, and gsod2024. However since gsod202* is registered in the catalog pointing to gsod2024, the parent column reference is only from that one table when the parent should be all tables which name begins with gsod202.

ppaglilla commented 4 months ago

I have added support for wildcard reference #73, but it's half-baked since only the last table registered to the catalog is detected as parent column lineage.

For example, in bigquery-public-data.noaa_gsod dataset gsod202* matches with gsod2020, gsod2021, gsod2022, gsod2023, and gsod2024. However since gsod202* is registered in the catalog pointing to gsod2024, the parent column reference is only from that one table when the parent should be all tables which name begins with gsod202.

Commenting here for future reference.

From the latest developments over on #73, lineage would point to the wildcard reference directly (e.g. column X of table gsod202*). Primarily because we're no longer registering all tables that match the wildcard to avoid the catalog growing a lot. There's a limit on how big a Java catalog can be.

I think that's a fair tradeoff. Someone analyzing the lineage could somewhat easily expand wildcards themselves using the BigQuery API.