Open dion-ricky opened 4 months ago
I have added support for wildcard reference #73, but it's half-baked since only the last table registered to the catalog is detected as parent column lineage.
For example, in bigquery-public-data.noaa_gsod
dataset gsod202*
matches with gsod2020
, gsod2021
, gsod2022
, gsod2023
, and gsod2024
. However since gsod202*
is registered in the catalog pointing to gsod2024
, the parent column reference is only from that one table when the parent should be all tables which name begins with gsod202
.
I have added support for wildcard reference #73, but it's half-baked since only the last table registered to the catalog is detected as parent column lineage.
For example, in
bigquery-public-data.noaa_gsod
datasetgsod202*
matches withgsod2020
,gsod2021
,gsod2022
,gsod2023
, andgsod2024
. However sincegsod202*
is registered in the catalog pointing togsod2024
, the parent column reference is only from that one table when the parent should be all tables which name begins withgsod202
.
Commenting here for future reference.
From the latest developments over on #73, lineage would point to the wildcard reference directly (e.g. column X of table gsod202*
). Primarily because we're no longer registering all tables that match the wildcard to avoid the catalog growing a lot. There's a limit on how big a Java catalog can be.
I think that's a fair tradeoff. Someone analyzing the lineage could somewhat easily expand wildcards themselves using the BigQuery API.
Consider adding support for BigQuery wildcard table reference https://cloud.google.com/bigquery/docs/querying-wildcard-tables.