GoogleCloudPlatform / zetasql-toolkit

The ZetaSQL Toolkit is a library that helps users use ZetaSQL Java API to perform SQL analysis for multiple query engines, including BigQuery and Cloud Spanner.
Apache License 2.0
39 stars 10 forks source link

InvalidProtocolBufferException with complex queries #31

Closed goktugkose closed 6 months ago

goktugkose commented 12 months ago

Hi, We are trying to employ ZetaSQL Toolkit for our SQL-based lineage and decision-making processes. However, with some complex queries (around 20 temp tables and a CTAS script that uses all of the temp tables), we encounter an error that prevents us from properly parsing some of the queries that we use. The code lines below represent a simplified version of our implementation in Java. During execution, the program throws the exception below.

It seems that the next() function in the AnalyzedStatement class analyzes the query while creating a communication channel between the client and BigQuery itself. It also seems that the CodedInputStream class in the protobuf-java library configures the limits of this communication channel. According to the error message that we got, the default value of defaultRecursionLimit which is 100, is not sufficient for our case. Could you help us to deal with this issue?

ZetaSQLToolkitAnalyzer analyzer = new ZetaSQLToolkitAnalyzer(options);
Iterator<AnalyzedStatement> it = analyzer.analyzeStatements(query, catalog);
while(it.hasNext()){
    ASTNodes.ASTStatement s = it.next().getParsedStatement();
}
Caused by: com.google.zetasql.io.grpc.StatusRuntimeException: CANCELLED: Failed to read message.
...
Caused by: com.google.zetasql.io.grpc.StatusRuntimeException: INTERNAL: Invalid protobuf byte sequence
...
Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message had too many levels of nesting.  
May be malicious.  Use CodedInputStream.setRecursionLimit() to increase the depth limit.
ppaglilla commented 6 months ago

We implemented a workaround for this issue in release 0.5.0. TLDR, call the following method before performing any analysis using ZetaSQL or the ZetaSQL Toolkit. This example sets the max nesting depth limit to 1000 (the default being 100).

ZetaSQLPatcher.patchMaxProtobufNestingDepth(1000);

For context, this happens because ZetaSQL's Java API uses a GRPC service to call into the actual C++ implementation. By default, grpc-java sets the max nesting depth for protos to 100 levels. Support for increasing that limit is experimental in grpc-java (source, tacking issue) and thus ZetaSQL compiles its GRPC client with the default limit of 100.

This feature of the toolkit monkey patches the ZetaSQL's GRPC client and modifies that limit to the value provided by the caller. Setting the limit to 1000 seems like a reasonable target and should be able to parse and analyze queries much larger than before.

However, do use this with caution and only when needed. This uses reflection to modify the internals of ZetaSQL's GRPC client at runtime. The approach is brittle by design and could potentially break in a later release. Consider the feature as experimental. In the future, once increasing the limit isn't experimental in grpc-java, it might be increased to a more sensible value in ZetaSQL itself and we might be able to drop this.