awslabs / aws-athena-query-federation

The Amazon Athena Query Federation SDK allows you to customize Amazon Athena with your own data sources and code.
Apache License 2.0
560 stars 295 forks source link

[BUG] Queries fail on tables with array-of-struct type with DynamoDB Connector #184

Closed SeanPrendi closed 4 years ago

SeanPrendi commented 4 years ago

Describe the bug It seems like the DynamoDB connector breaks when it encounters a field with the array-of-struct type. This seems to be intimately connected with the issue that I was facing in #182. We currently managed to bypass this issue to an extent by using DDL to define copies of the offending tables with the array-of-struct fields ignored. Interestingly, based on CloudWatch logs it seems like running the connector on tables that have columns with this type is fine by itself as they're not parsed as array-of-struct, but instead list-of-struct, which works. However, the glue schema inference marks these fields as being of type array-of-struct, which breaks the connector. The connector then falls back to its inference, which brings us back to the null reference exception that was the original reason for using the glue connector.

To Reproduce Create a table in DynamoDB with an array of structs Crawl the table in glue Query the glue table with the connector

Expected behavior The query should be executed successfully

Screenshots / Exceptions / Errors First the connector encounters the array-of-struct type, and CloudWatch logs:

INFO GlueMetadataHandler:360 - Column [offending column] with registered type array<struct<[fields]>>

Then fails and falls back to the Connector's inference schema

doGetTable: Unable to retrieve table [table] from AWSGlue in database/schema [database]. Falling back to schema inference

However, when running a table with similar data (but no empty values, so inference can be performed successfully by the dynamo connector), we see

INFO RecordHandler:154 - doHandleRequest: request[ReadRecordsRequest{queryId=[query id], tableName=TableName{schemaName=[schema], tableName=[table]}, schema=Schema<..., [field]: List<[field].element: Struct<[struct fields]>>, ...>, ...]

Connector Details (please complete the following information):

avirtuos commented 4 years ago

@atennak1 and @soojinj any thoughts on this? I think you are most familiar with this code path.

atennak1 commented 4 years ago

Was there a stacktrace printed with doGetTable: Unable to retrieve table [table] from AWSGlue in database/schema [database]. Falling back to schema inference? If so could you provide it?

SeanPrendi commented 4 years ago

Sorry for the late response, here is the stacktrace for that warning:

doGetTable: Unable to retrieve table [table] from AWSGlue in database/schema [database]. Falling back to schema inference. If inferred schema is incorrect, create a matching table in Glue to define schema (see README) java.lang.NullPointerException: null at org.apache.arrow.util.Preconditions.checkNotNull(Preconditions.java:767) ~[task/:?] at org.apache.arrow.vector.types.pojo.FieldType.(FieldType.java:49) ~[task/:?] at org.apache.arrow.vector.types.pojo.FieldType.nullable(FieldType.java:34) ~[task/:?] at com.amazonaws.athena.connector.lambda.data.FieldBuilder.build(FieldBuilder.java:256) ~[task/:?] at com.amazonaws.athena.connector.lambda.metadata.glue.GlueFieldLexer.lexComplex(GlueFieldLexer.java:88) ~[task/:?] at com.amazonaws.athena.connector.lambda.metadata.glue.GlueFieldLexer.lex(GlueFieldLexer.java:59) ~[task/:?] at com.amazonaws.athena.connectors.dynamodb.DynamoDBMetadataHandler.convertField(DynamoDBMetadataHandler.java:465) ~[task/:?] at com.amazonaws.athena.connector.lambda.handlers.GlueMetadataHandler.doGetTable(GlueMetadataHandler.java:361) ~[task/:?] at com.amazonaws.athena.connector.lambda.handlers.GlueMetadataHandler.doGetTable(GlueMetadataHandler.java:308) ~[task/:?] at com.amazonaws.athena.connectors.dynamodb.DynamoDBMetadataHandler.doGetTable(DynamoDBMetadataHandler.java:230) [task/:?] at com.amazonaws.athena.connector.lambda.handlers.MetadataHandler.doHandleRequest(MetadataHandler.java:245) [task/:?] at com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:132) [task/:?] at com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:100) [task/:?] at lambdainternal.EventHandlerLoader$2.call(EventHandlerLoader.java:909) [LambdaSandboxJava-1.0.jar:?] at lambdainternal.AWSLambda.startRuntime(AWSLambda.java:341) [LambdaSandboxJava-1.0.jar:?] at lambdainternal.AWSLambda.(AWSLambda.java:63) [LambdaSandboxJava-1.0.jar:?] at java.lang.Class.forName0(Native Method) ~[?:1.8.0_201] at java.lang.Class.forName(Class.java:348) [?:1.8.0_201] at lambdainternal.LambdaRTEntry.main(LambdaRTEntry.java:119) [LambdaJavaRTEntry-1.0.jar:?]

Let me know if anything else would be useful and I will try to provide it.

atennak1 commented 4 years ago

https://github.com/awslabs/aws-athena-query-federation/blob/d6e9b8391c1658a9c82d8243578515e54759f33c/athena-federation-sdk/src/main/java/com/amazonaws/athena/connector/lambda/metadata/glue/GlueFieldLexer.java#L85-L89

Looks like when it comes to Lists we only go one level deep. And arrayType.getValue() is null for some reason

joshuanapoli commented 4 years ago

Will you be able to go deeper than one level, or is a limitation of the platform?

atennak1 commented 4 years ago

Should be do-able. This bug fix is in queue for someone on our team to pick up.

fallonbrianmr commented 4 years ago

Facing same issue for the DocumentDB connector whenever I include arrays of structs in the schema in the Glue Catalog. When I drop fields with arrays of structs it starts working.

GENERIC_USER_ERROR: Encountered an exception[null] from your LambdaFunction

shurvitz commented 4 years ago

Fixed by #228 and #232