GoogleCloudPlatform / zetasql-toolkit

The ZetaSQL Toolkit is a library that helps users use ZetaSQL Java API to perform SQL analysis for multiple query engines, including BigQuery and Cloud Spanner.
Apache License 2.0
39 stars 10 forks source link

Unable to find table when parsing query containing TVF #77

Closed dion-ricky closed 4 months ago

dion-ricky commented 4 months ago

Hi,

I've tried extracting lineage from a query containing TVF but I got this exception:

com.google.zetasql.toolkit.catalog.bigquery.exceptions.MissingFunctionResultType: BigQuery routine <project.dataset.tvf> is missing an explicit return type and it could not be inferred. Consider adding an explicit RETURNS clause to it.

Here's a sample of what the TVF look like:

CREATE OR REPLACE FUNCTION `project.dataset.tvf`()
  AS (
    SELECT a, b, c FROM `project.dataset.foo`
  )

The catalog is populated using addAllResourcesUsedInQuery. I did check the language options and I think TVF should already be supported when using BigQueryLanguageOptions.get().

During debugging it looks like the issue is caused by the parser unable to find table project.dataset.foo which were read inside the inner body of the TVF: image

Should I register that table to the catalog as well? Is it not automatically added on addAllResourcesUsedInQuery?

Thank you

dion-ricky commented 4 months ago

I solved it by registering tables referenced in the inner body of the TVF.

But then I got this error:

com.google.zetasql.toolkit.AnalysisException: TVF calls should be fully quoted. Expected `project.dataset.tvf`, found `project`.`dataset`.`tvf`.

Is there any reason why this is enforced? As BigQuery supports executing TVF without it being fully quoted.

ppaglilla commented 4 months ago

The correct workaround is to register the tables used in your TVF as you did. For context, what's happening is the following.

When the BigQuery catalog is loading routines from the API, the return type is only present in the API response if you've included an explicit RETURNS clause. If that return type is present, we're good to build the proper function signature and register it in the catalog. However, if that isn't present, we do a best effort attempt at analyzing the the function body. That will fail if the resources referenced in the function are not pre-loaded in the catalog.

I haven't explored potentially calling catalog.addAllResourcesUsedInQuery(...) in that situation. That could improve this, but it ends with us performing API calls to request resources without the user intending to. A better solution might be to just document this behavior well and have users pre-register resources they need.

ppaglilla commented 4 months ago

I solved it by registering tables referenced in the inner body of the TVF.

But then I got this error:

com.google.zetasql.toolkit.AnalysisException: TVF calls should be fully quoted. Expected `project.dataset.tvf`, found `project`.`dataset`.`tvf`.

Is there any reason why this is enforced? As BigQuery supports executing TVF without it being fully quoted.

That's actually an old warning we should remove. Thanks for pointing that out!

For context of why that was there. Yes, BigQuery supports it. The thing that doesn't support it is the SimpleCatalog we have available in the Java world. Previously we threw that error if we found a TVF or procedure which was not fully quoted, and asked the user to fully quote them.

The latest releases actually rewrite your query and fully quote everything before analyzing, so that error is not longer necessary. See here.

Feel free to submit a PR removing that check from the ZetaSQLToolkitAnalyzer class if you'd like. I can get around to it before the next release otherwise 🙂.

ppaglilla commented 4 months ago

Closing this issue and opening a separate one about the old error check which should be removed