GoogleCloudPlatform / zetasql-toolkit

The ZetaSQL Toolkit is a library that helps users use ZetaSQL Java API to perform SQL analysis for multiple query engines, including BigQuery and Cloud Spanner.
Apache License 2.0
39 stars 9 forks source link

Analyzing Queries Without Catalog Validations #18

Closed SDorgan closed 1 year ago

SDorgan commented 1 year ago

I’m having troubles with the usage of the analyzer in the AnalyzeWithoutCatlog example. If I try to validate the following queries by themselves: SELECT column from t; or UPDATE t SET column1 = '';

The client will always answers with: INVALID_ARGUMENT: Table not found: t

I’m looking into analyzing queries and accessing their syntax tree without any Catalog Validation. Is there a way to achieve this with the current toolkit?

ppaglilla commented 1 year ago

The first thing that comes to mind is that you might want to use the parser instead of using the analyzer. The parser only performs syntactic validation on queries and returns the syntax tree (in the form of an ASTStatement).

You'll see that the parse tree is a direct representation of the query text. It does not have things like the types of columns, the function signatures, etc. But it might be enough for your use case. Feel free to share what your use case is and I might be able to help determine if you can just use the parser.

This is a short example of how to only parse a statement. The ASTStatement is the root of the syntax tree. You can work with it directly or use an implementation of the ParseTreeVisitor class to traverse it.

String query = "UPDATE t SET column1 = '';";
ASTStatement parsedStatement = Parser.parseStatement(query, BigQueryLanguageOptions.get());
System.out.println(parsedStatement);

Now, talking about the analyzer a little.

The analyzer takes the output from the parser, performs semantic validations and builds the resolved tree. Since it performs semantic validations; it fails when a table doesn't exist, when a function call has an invalid signature, when there's a missing GROUP BY, etc. The resolved tree can't really be built without having catalog information.

If you don't provide a catalog yourself when analyzing with the toolkit, the library will start with an empty catalog. It then takes care of maintaining that catalog when you, for example, analyze a CREATE statement. I see now that the AnalyzeWithoutCatalog name might not be great for the example 😅.

So, if you're using the toolkit, this will be analyzed just fine:

CREATE TEMP TABLE `t` (
  column1 STRING
);

INSERT INTO `t`(column1) VALUES ('');

SELECT column1 FROM `t`;

But analyzing this will always fail, unless you provide a catalog where t exists:

SELECT column1 FROM `t`;
SDorgan commented 1 year ago

Sorry for the late response. I think the Parser with the parseScript function might be what I was looking for actually. I'll look into the ParseTreeVisitor as well. Thanks once again 😄

ppaglilla commented 1 year ago

You're welcome! Thank you for the question!