awslabs / aws-athena-query-federation

The Amazon Athena Query Federation SDK allows you to customize Amazon Athena with your own data sources and code.
Apache License 2.0
560 stars 297 forks source link

[BUG] Unable to resolve Glue tables with docdb connector #403

Closed jainnishant closed 3 years ago

jainnishant commented 3 years ago

As per the documentation of docdb connector:

Unlike traditional relational data stores, DocumentDB collections do not have set schema. Each entry can have different fields and data types. While we are investigating the best way to support schema-on-read usecases for this connector, it presently supports two mechanisms for generating traditional table schema information. The default mechanism is for the connector to scan a small number of documents in your collection in order to form a union of all fields and coerce fields with non-overlap data types. This basic schema inference works well for collections that have mostly uniform entries. For more diverse collections, the connector supports retrieving meta-data from the Glue Data Catalog. If the connector sees a database and table which match your DocumentDB database and collection names it will use the corresponding Glue table for schema. We recommend creating your Glue table such that it is a superset of all fields you may want to access from your DocumentDB Collection.

Now, since DocumentDB has the option of flexible schema. We created a Glue Crawler to generate the table metadata and create a Table under glue database.

Issue :

Glue created the table name in the format : databasename_tablename

The connector tries looking up Glue with tablename.

It's not able to resolve the table name which was generated by the crawler.

Expected behavior

Either add a provision to provide a list of table names to lookup in Glue or add a flag to resolve whether to lookup using the actual table name or using the generated one with the databasename prefixed.

Screenshots / Exceptions / Errors

Screenshots:

image

Connector Lambda logs

WARN DocDBMetadataHandler:197 - doGetTable: Unable to retrieve table[refdata:identifier_precedence] from AWS Glue. com.amazonaws.services.glue.model.EntityNotFoundException: Table identifier_precedence not found. (Service: AWSGlue; Status Code: 400; Error Code: EntityNotFoundException; Request ID: 5968cc7b-5292-45c3-9ed0-4b1749d063c7; Proxy: null) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1811) ~[task/:?] at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1395) ~[task/:?] at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1371) ~[task/:?] at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145) ~[task/:?] at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802) ~[task/:?] at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770) ~[task/:?] at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744) ~[task/:?] at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704) ~[task/:?] at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686) ~[task/:?] at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550) ~[task/:?] at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530) ~[task/:?] at com.amazonaws.services.glue.AWSGlueClient.doInvoke(AWSGlueClient.java:8202) ~[task/:?] at com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:8169) ~[task/:?] at com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:8158) ~[task/:?] at com.amazonaws.services.glue.AWSGlueClient.executeGetTable(AWSGlueClient.java:4774) ~[task/:?] at com.amazonaws.services.glue.AWSGlueClient.getTable(AWSGlueClient.java:4745) ~[task/:?] at com.amazonaws.athena.connector.lambda.handlers.GlueMetadataHandler.doGetTable(GlueMetadataHandler.java:330) ~[task/:?] at com.amazonaws.athena.connectors.docdb.DocDBMetadataHandler.doGetTable(DocDBMetadataHandler.java:192) [task/:?] at com.amazonaws.athena.connector.lambda.handlers.MetadataHandler.doHandleRequest(MetadataHandler.java:250) [task/:?] at com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:132) [task/:?] at com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:100) [task/:?] at lambdainternal.EventHandlerLoader$2.call(EventHandlerLoader.java:902) [LambdaSandboxJava-1.0.jar:?] at lambdainternal.AWSLambda.startRuntime(AWSLambda.java:340) [LambdaSandboxJava-1.0.jar:?] at lambdainternal.AWSLambda.(AWSLambda.java:63) [LambdaSandboxJava-1.0.jar:?] at java.lang.Class.forName0(Native Method) ~[?:1.8.0_201] at java.lang.Class.forName(Class.java:348) [?:1.8.0_201] at lambdainternal.LambdaRTEntry.main(LambdaRTEntry.java:150) [LambdaJavaRTEntry-1.0.jar:?]

Connector Details

avirtuos commented 3 years ago

Im not sure this is a bug, it looks like your table names are incorrect in Data Catalog and do not match your table (collection) names in DocumentDB. The table names in glue catalog should not be prefixed with the database name. Can you elaborate on what behavior you expect? The Connector is not compatible with Glue crawlers at present mostly due to the differences in schema inference between the crawler and the Connector. The naming convention difference is somewhat superficial and could be easily fixed in the Connector. Please reopen this issue if you have further questions or a suggestion for what an enhancement to this capability might look like. Thanks for raising the issue!

aafaq-rashid-comprinno commented 3 months ago

I encountered a similar issue when running queries on MongoDB from Athena using the DocumentDB connector. The problem was caused by the schema being in uppercase. You can resolve this by using the enable_case_insensitive_match option. When set to true, it allows case-insensitive searches against schema and table names in Amazon DocumentDB. By default, this option is set to false. Enable it if your query includes uppercase schema or table names.

danielOfir1 commented 1 month ago

I encountered a similar issue when running queries on MongoDB from Athena using the DocumentDB connector. The problem was caused by the schema being in uppercase. You can resolve this by using the enable_case_insensitive_match option. When set to true, it allows case-insensitive searches against schema and table names in Amazon DocumentDB. By default, this option is set to false. Enable it if your query includes uppercase schema or table names.

I encountered the same issue, this was the solution in my case as well. Thanks!