GoogleCloudPlatform / zetasql-toolkit

The ZetaSQL Toolkit is a library that helps users use ZetaSQL Java API to perform SQL analysis for multiple query engines, including BigQuery and Cloud Spanner.
Apache License 2.0
39 stars 10 forks source link

Unrecognized BigQuery alias for unquoted table reference #97

Open dion-ricky opened 2 months ago

dion-ricky commented 2 months ago

When querying an unquoted table reference, BigQuery automatically provided alias for the table name in case insensitive manner. Example:

SELECT
  BLOCKS,
  blocks
FROM
  bigquery-public-data.crypto_bitcoin.blocks
WHERE
  timestamp_month = "2024-09-01"
LIMIT
  1

However, when parsing such query I got exception saying BLOCKS is not a recognized name: image

This won't work on quoted table reference, for example BigQuery will refuse to execute this query:

SELECT
  BLOCKS
FROM
  `bigquery-public-data.crypto_bitcoin.blocks`
WHERE
  timestamp_month = "2024-09-01"
LIMIT
  1

So, I think this error is caused by the toolkit automatically adding quotes to all table reference. I'm not sure about this, haven't look into it in more detail.

Anyone available to check and fix is really appreciated. Thanks.

dion-ricky commented 2 months ago

Confirmed automatic quoting name paths on this line: https://github.com/GoogleCloudPlatform/zetasql-toolkit/blob/e196efea5d8c70491dffd95b343e5c23bf9a4b13/zetasql-toolkit-core/src/main/java/com/google/zetasql/toolkit/ZetaSQLToolkitAnalyzer.java#L210

Tried to comment it but it ends up making the parser not recognizing the table name. I'm thinking of other workaround.

dion-ricky commented 2 months ago

Tried this workaround by creating a statement rewriter and adding an alias to all table path expression that doesn't currently have alias. Since BQ automatically created alias by using the table name, we can do the same thing here.

I won't open a PR for this, because I believe the core package can do something better than rewriting the query. Therefore I will keep this issue open for now.

Here's the code that I use to add alias. You can call AddAliasForTableName.rewrite( {{ your query here }} ) before parsing your query.

import com.google.zetasql.Parser;
import com.google.zetasql.parser.ASTNodes;
import com.google.zetasql.parser.ParseTreeVisitor;
import com.google.zetasql.toolkit.ParseTreeUtils;
import com.google.zetasql.toolkit.StatementRewriter;
import com.google.zetasql.toolkit.catalog.bigquery.BigQueryReference;
import com.google.zetasql.toolkit.options.BigQueryLanguageOptions;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Objects;
import java.util.stream.Collectors;

public class AddAliasForTableName {
  private static StatementRewriter.Rewrite createRewrite(int from, int to, String content) {
    return new StatementRewriter.Rewrite(from, to, content);
  }

  private static StatementRewriter.Rewrite createRewrite(ASTNode node, String content) {
    return createRewrite(
        node.getParseLocationRange().start(), node.getParseLocationRange().end(), content);
  }

  private static String tableNameFromTablePath(ASTNodes.ASTTablePathExpression expression) {
    final String[] tableName = {""};
    expression.accept(new ParseTreeVisitor() {
      @Override
      public void visit(ASTNodes.ASTIdentifier node) {
        tableName[0] = String.join(".", node.getIdString());
      }
    });
    return tableName[0];
  }

  private static boolean shouldAddAlias(ASTNodes.ASTTablePathExpression expression) {
    return Objects.isNull(expression.getAlias()) && !Objects.isNull(expression.getPathExpr());
  }

  private static boolean validTableName(ASTNodes.ASTTablePathExpression expression) {
    String tableName = tableNameFromTablePath(expression);
    return (BigQueryReference.isQualified(tableName)
        && !BigQueryReference.isWildcardReference(tableName));
  }

  private static List<StatementRewriter.Rewrite> addAlias(ASTNodes.ASTFromClause fromClause) {
    ArrayList<StatementRewriter.Rewrite> rewrites = new ArrayList<>();

    HashSet<ASTNodes.ASTTablePathExpression> expressionsToAlias = new HashSet<>();

    fromClause.accept(
        new ParseTreeVisitor() {
          @Override
          public void visit(ASTNodes.ASTTablePathExpression node) {
            if (shouldAddAlias(node) && validTableName(node)) {
              expressionsToAlias.add(node);
            }
          }
        });

    expressionsToAlias.forEach(
        expr -> {
          String tableNameStr = tableNameFromTablePath(expr);
          String[] tblNameParts = tableNameStr.split("\\.");
          rewrites.add(
              createRewrite(
                  expr, "`" + tableNameStr + "`" + " AS " + tblNameParts[tblNameParts.length - 1]));
        });

    return rewrites;
  }

  private static List<StatementRewriter.Rewrite> rewriteSelect(ASTNodes.ASTSelect select) {
    ASTNodes.ASTFromClause fromClause = select.getFromClause();
    return addAlias(fromClause);
  }

  public static String rewrite(String query) {
    ASTNodes.ASTScript parsedScript = Parser.parseScript(query, BigQueryLanguageOptions.get());

    List<ASTNodes.ASTSelect> selects =
        ParseTreeUtils.findDescendantSubtreesWithKind(parsedScript, ASTNodes.ASTSelect.class);

    List<StatementRewriter.Rewrite> rewrites =
        selects.stream()
            .flatMap(select -> rewriteSelect(select).stream())
            .collect(Collectors.toList());

    return StatementRewriter.applyRewrites(query, rewrites);
  }
}