awslabs / aws-athena-query-federation

The Amazon Athena Query Federation SDK allows you to customize Amazon Athena with your own data sources and code.
Apache License 2.0
560 stars 295 forks source link

[QUESTION] Connecting Athena, DynamoDB Connector, and a Glue table schema #376

Closed keelzebub closed 2 years ago

keelzebub commented 3 years ago

Hi! I have a question that's very similar to https://github.com/awslabs/aws-athena-query-federation/issues/182, but that one was closed due to it being a separate issue.

Basically, I have a DynamoDB table called evt_development that contains different types of records. I've set up the DynamoDB connector and I can, for the most part, successfully query the DynamoDB table using Athena.

However, there are certain fields that, when I try to query them, return messages like Column 'converted_low' cannot be resolved. I know this has to do with schema inference, so I've created a Glue crawler and used it to get the metadata for the DynamoDB table. I've confirmed that the metadata contains the field converted_low.

So this is my setup:

After getting all that set up, I'm still receiving the same Column 'converted_low' cannot be resolved error message. My question is, how do I get the Athena query to use the Glue table metadata when querying DynamoDB?

For reference, this is the query that I'm running:

SELECT converted_low
FROM "dynamo-db-catalog"."default"."evt_development";

Thank you!

shurvitz commented 3 years ago

Hi @beaksandclaws, can you share the Glue schema and the complete CloudWatch logs (including the error). Also, which version of the connector are you using?

keelzebub commented 3 years ago

Hi @shurvitz!

Connector Version

I'm not totally sure how to get the version of the connector that was uploaded—there doesn't seem to be any way to view that via the Lambda console (or maybe there's something that I'm missing). I do know that it's 7 months old at this point.

Schema

{
  "FieldSchema": [
    { "name": "crop_longevity_ids", "type": "bigint" },
    { "name": "crop_ids", "type": "set<bigint>" },
    { "name": "crop_use_ids", "type": "set<bigint>" },
    { "name": "pk2", "type": "string" },
    { "name": "pk1", "type": "string" },
    { "name": "sk1", "type": "string" },
    { "name": "crop_uses", "type": "set<string>" },
    { "name": "crop_longevities", "type": "string" },
    { "name": "sk2", "type": "string" },
    { "name": "id", "type": "string" },
    { "name": "crops", "type": "set<string>" },
    { "name": "soil_amendment_types", "type": "set<string>" },
    { "name": "rotational", "type": "boolean" },
    { "name": "soil_amendment_type_ids", "type": "set<bigint>" },
    { "name": "weed_pest_control_general_ids", "type": "set<bigint>" },
    { "name": "weed_pest_control_specific_ids", "type": "set<bigint>" },
    { "name": "weed_pest_control_generals", "type": "set<string>" },
    { "name": "weed_pest_control_specifics", "type": "set<string>" },
    { "name": "soil_amendment_activities", "type": "set<string>" },
    { "name": "soil_amendment_activity_ids", "type": "set<bigint>" },
    { "name": "tillages", "type": "string" },
    { "name": "tillage_ids", "type": "bigint" },
    { "name": "cropping_ids", "type": "set<bigint>" },
    { "name": "irrigation_general_ids", "type": "bigint" },
    { "name": "irrigation_specifics", "type": "string" },
    { "name": "croppings", "type": "set<string>" },
    { "name": "irrigation_specific_ids", "type": "bigint" },
    { "name": "irrigation_generals", "type": "string" },
    { "name": "certification_managements", "type": "set<string>" },
    { "name": "certification_management_ids", "type": "set<bigint>" },
    { "name": "animal_specifics", "type": "set<string>" },
    { "name": "animal_specific_ids", "type": "set<bigint>" },
    { "name": "animal_generals", "type": "set<string>" },
    { "name": "animal_general_ids", "type": "set<bigint>" },
    { "name": "crop_densities", "type": "string" },
    { "name": "crop_density_ids", "type": "bigint" },
    { "name": "burnings", "type": "string" },
    { "name": "burning_ids", "type": "bigint" },
    { "name": "value_status_id", "type": "bigint" },
    { "name": "continent_ids", "type": "set<bigint>" },
    { "name": "published_dimension_unit_id", "type": "bigint" },
    { "name": "climate_group_id", "type": "bigint" },
    { "name": "country_ids", "type": "set<bigint>" },
    { "name": "percent_approved", "type": "bigint" },
    { "name": "valuation_methodology_general", "type": "string" },
    { "name": "valuation_methodology_specific_id", "type": "bigint" },
    { "name": "transcription_id", "type": "bigint" },
    { "name": "ecosystem_descriptor_specific_id", "type": "bigint" },
    { "name": "ecosystem_service_specific", "type": "string" },
    { "name": "ecosystem_descriptor_general", "type": "string" },
    { "name": "climate_group", "type": "string" },
    { "name": "area", "type": "double" },
    { "name": "dataset_ids", "type": "set<bigint>" },
    { "name": "ecosystem_descriptor_sub_specific_id", "type": "bigint" },
    { "name": "published_dimension_unit", "type": "string" },
    { "name": "ecosystem_service_general", "type": "string" },
    { "name": "ecosystem_service_category_id", "type": "bigint" },
    { "name": "ecosystem_service_sub_specific_id", "type": "bigint" },
    { "name": "countries", "type": "set<string>" },
    { "name": "ecosystem_service_specific_id", "type": "bigint" },
    { "name": "publication_type", "type": "string" },
    { "name": "withins", "type": "set<string>" },
    { "name": "continents", "type": "set<string>" },
    { "name": "ecosystem_service_id", "type": "bigint" },
    { "name": "ecosystem_descriptor_id", "type": "bigint" },
    { "name": "proximities", "type": "set<string>" },
    { "name": "bibliographic_reference", "type": "string" },
    { "name": "calculation_type_id", "type": "bigint" },
    { "name": "ecosystem_descriptor_sub_specific", "type": "string" },
    { "name": "name", "type": "string" },
    { "name": "site_id", "type": "bigint" },
    { "name": "calculated_high", "type": "double" },
    { "name": "calculated_low", "type": "double" },
    { "name": "inflated_low", "type": "double" },
    { "name": "valuation_methodology_general_id", "type": "bigint" },
    { "name": "authors", "type": "string" },
    { "name": "is_usd_acre_yr", "type": "string" },
    { "name": "transcription_state_id", "type": "bigint" },
    { "name": "subcountry_ids", "type": "set<bigint>" },
    { "name": "within_ids", "type": "set<bigint>" },
    { "name": "ecosystem_descriptor_specific", "type": "string" },
    { "name": "title", "type": "string" },
    { "name": "currency_code", "type": "string" },
    { "name": "published_low", "type": "double" },
    { "name": "calculation_type", "type": "string" },
    { "name": "proximity_ids", "type": "set<bigint>" },
    { "name": "flagged", "type": "string" },
    { "name": "primary_authors", "type": "string" },
    { "name": "valuation_methodology_specific", "type": "string" },
    { "name": "climate_type_id", "type": "bigint" },
    { "name": "ecosystem_service_category", "type": "string" },
    { "name": "published_currency_year", "type": "bigint" },
    { "name": "currency", "type": "string" },
    { "name": "published_time_unit", "type": "string" },
    { "name": "value_id", "type": "bigint" },
    { "name": "ecosystem_service_sub_specific", "type": "string" },
    { "name": "valuation_methodology_comments", "type": "string" },
    { "name": "site_scale", "type": "string" },
    { "name": "year_published", "type": "bigint" },
    { "name": "value_status", "type": "string" },
    { "name": "valuation_methodology_id", "type": "bigint" },
    { "name": "inflated_high", "type": "double" },
    { "name": "published_time_unit_id", "type": "bigint" },
    { "name": "published_high", "type": "double" },
    { "name": "ecosystem_service_general_id", "type": "bigint" },
    { "name": "climate_type", "type": "string" },
    { "name": "inflated_currency_year", "type": "bigint" },
    { "name": "ecosystem_descriptor_general_id", "type": "bigint" },
    { "name": "bibliography_url", "type": "string" },
    { "name": "subcountries", "type": "set<string>" },
    { "name": "currency_id", "type": "bigint" },
    { "name": "greenhouse_gasses", "type": "set<string>" },
    { "name": "water_type", "type": "string" },
    { "name": "water_type_id", "type": "bigint" },
    { "name": "percent_rejected", "type": "bigint" },
    { "name": "site_comments", "type": "string" },
    { "name": "other_bibliographic_information", "type": "string" },
    { "name": "bibliography_notes", "type": "string" },
    { "name": "land_cover_nlcd_general", "type": "string" },
    { "name": "converted_time_unit_id", "type": "bigint" },
    { "name": "converted_high", "type": "double" },
    { "name": "converted_time_unit", "type": "string" },
    { "name": "converted_dimension_unit_id", "type": "bigint" },
    { "name": "converted_dimension_unit", "type": "string" },
    { "name": "converted_low", "type": "double" },
    { "name": "ecosystem_descriptor_comments", "type": "string" },
    { "name": "air_pollutants", "type": "set<string>" },
    { "name": "nutrients_and_contaminants", "type": "set<string>" },
    { "name": "provisioning_use", "type": "string" },
    { "name": "calculated_time_unit", "type": "string" },
    { "name": "calculated_dimension_unit", "type": "string" },
    { "name": "valuation_methodology_sub_specific_id", "type": "bigint" },
    { "name": "valuation_methodology_sub_specific", "type": "string" },
    { "name": "ecosystem_service_comments", "type": "string" },
    { "name": "recreation_activities", "type": "set<string>" },
    { "name": "percent_pending", "type": "bigint" },
    { "name": "sub_sub_country", "type": "string" },
    { "name": "ag_mgmt_irrigation_specific_ids", "type": "set<bigint>" },
    { "name": "ag_mgmt_ids", "type": "set<string>" },
    { "name": "ag_mgmt_irrigation_specifics", "type": "set<string>" },
    { "name": "ag_mgmt_irrigation_generals", "type": "set<string>" },
    { "name": "ag_mgmt_irrigation_general_ids", "type": "set<bigint>" },
    { "name": "municipality", "type": "string" },
    { "name": "value_type_id", "type": "bigint" },
    { "name": "ag_mgmt_crop_uses", "type": "set<string>" },
    { "name": "ag_mgmt_crops", "type": "set<string>" },
    { "name": "ag_mgmt_crop_ids", "type": "set<bigint>" },
    { "name": "value_type", "type": "string" },
    { "name": "ag_mgmt_crop_use_ids", "type": "set<bigint>" },
    { "name": "elevation_id", "type": "bigint" },
    { "name": "elevation", "type": "string" },
    { "name": "publication_comments", "type": "string" },
    { "name": "year_accepted", "type": "bigint" },
    { "name": "material_valued_general", "type": "string" },
    { "name": "soil_type", "type": "string" },
    { "name": "provisioning_use_other", "type": "string" },
    { "name": "ag_mgmt_animal_specific_ids", "type": "set<bigint>" },
    { "name": "ag_mgmt_animal_specifics", "type": "set<string>" },
    { "name": "ag_mgmt_animal_generals", "type": "set<string>" },
    { "name": "ag_mgmt_animal_general_ids", "type": "set<bigint>" },
    { "name": "non_monetary_dimension_unit", "type": "string" },
    { "name": "non_monetary_dimension_type", "type": "string" },
    { "name": "material_valued_sub_specific", "type": "string" },
    { "name": "material_valued_specific", "type": "string" },
    { "name": "material_valued_general_other", "type": "string" },
    { "name": "certifications", "type": "set<string>" },
    { "name": "ag_mgmt_crop_longevity_ids", "type": "set<bigint>" },
    { "name": "ag_mgmt_soil_amendment_types", "type": "set<string>" },
    { "name": "ag_mgmt_crop_longevities", "type": "set<string>" },
    { "name": "ag_mgmt_tillage_ids", "type": "set<bigint>" },
    { "name": "material_valued_specific_other", "type": "string" },
    { "name": "ag_mgmt_tillages", "type": "set<string>" },
    { "name": "ag_mgmt_soil_amendment_type_ids", "type": "set<bigint>" },
    { "name": "non_monetary_dimension_unit_other", "type": "string" },
    { "name": "ag_mgmt_weed_pest_control_generals", "type": "set<string>" },
    { "name": "ag_mgmt_weed_pest_control_general_ids", "type": "set<bigint>" },
    { "name": "material_valued_sub_specific_other", "type": "string" },
    { "name": "agricultural_design_strategies", "type": "set<string>" },
    { "name": "ag_mgmt_soil_amendment_activity_ids", "type": "set<bigint>" },
    { "name": "ag_mgmt_crop_density_ids", "type": "set<bigint>" },
    { "name": "ag_mgmt_crop_densities", "type": "set<string>" },
    { "name": "ag_mgmt_soil_amendment_activities", "type": "set<string>" },
    { "name": "agricultural_design_strategy_other", "type": "string" },
    { "name": "ag_mgmt_cropping_ids", "type": "set<bigint>" },
    { "name": "ag_mgmt_croppings", "type": "set<string>" },
    { "name": "ag_mgmt_rotational", "type": "boolean" },
    { "name": "farm_scale_general", "type": "string" },
    { "name": "ag_mgmt_weed_pest_control_specific_ids", "type": "set<bigint>" },
    { "name": "ag_mgmt_weed_pest_control_specifics", "type": "set<string>" },
    { "name": "ag_mgmt_certification_management_ids", "type": "set<bigint>" },
    { "name": "ag_mgmt_certification_managements", "type": "set<string>" },
    { "name": "farm_scale_specific", "type": "string" },
    { "name": "crop_intensity", "type": "string" },
    { "name": "non_monetary_dimension_type_other", "type": "string" },
    { "name": "ag_mgmt_burning_ids", "type": "set<bigint>" },
    { "name": "ag_mgmt_burnings", "type": "set<string>" },
    { "name": "federal_program_generals", "type": "set<string>" }
  ]
}

CloudWatch Logs

1616770807107   "START RequestId: fcefc5f7-a62e-42d6-9a86-5c2a7b15d222 Version: $LATEST"
1616770807112   "2021-03-26 15:00:07 fcefc5f7-a62e-42d6-9a86-5c2a7b15d222 INFO  MetadataHandler:232 - doHandleRequest: request[GetTableRequest{queryId=463e7f8a-d762-43e2-b697-ade1e1176131, tableName=TableName{schemaName=default, tableName=evt_development}}]"
1616770807265   "2021-03-26 15:00:07 fcefc5f7-a62e-42d6-9a86-5c2a7b15d222 INFO  GlueMetadataHandler:360 - Column crop_longevity_ids with registered type bigint"
1616770807303   "2021-03-26 15:00:07 fcefc5f7-a62e-42d6-9a86-5c2a7b15d222 INFO  GlueMetadataHandler:360 - Column crop_ids with registered type set<bigint>"
1616770807311   "2021-03-26 15:00:07 fcefc5f7-a62e-42d6-9a86-5c2a7b15d222 WARN  DynamoDBMetadataHandler:233 - doGetTable: Unable to retrieve table evt_development from AWSGlue in database/schema default. Falling back to schema inference. If inferred schema is incorrect, create a matching table in Glue to define schema (see README)
java.lang.RuntimeException: Unexpected start type set
    at com.amazonaws.athena.connector.lambda.metadata.glue.GlueFieldLexer.lexComplex(GlueFieldLexer.java:108) ~[task/:?]
    at com.amazonaws.athena.connector.lambda.metadata.glue.GlueFieldLexer.lex(GlueFieldLexer.java:68) ~[task/:?]
    at com.amazonaws.athena.connectors.dynamodb.DynamoDBMetadataHandler.convertField(DynamoDBMetadataHandler.java:465) ~[task/:?]
    at com.amazonaws.athena.connector.lambda.handlers.GlueMetadataHandler.doGetTable(GlueMetadataHandler.java:361) ~[task/:?]
    at com.amazonaws.athena.connector.lambda.handlers.GlueMetadataHandler.doGetTable(GlueMetadataHandler.java:308) ~[task/:?]
    at com.amazonaws.athena.connectors.dynamodb.DynamoDBMetadataHandler.doGetTable(DynamoDBMetadataHandler.java:230) [task/:?]
    at com.amazonaws.athena.connector.lambda.handlers.MetadataHandler.doHandleRequest(MetadataHandler.java:250) [task/:?]
    at com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:132) [task/:?]
    at com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:100) [task/:?]
    at lambdainternal.EventHandlerLoader$2.call(EventHandlerLoader.java:902) [LambdaSandboxJava-1.0.jar:?]
    at lambdainternal.AWSLambda.startRuntime(AWSLambda.java:340) [LambdaSandboxJava-1.0.jar:?]
    at lambdainternal.AWSLambda.<clinit>(AWSLambda.java:63) [LambdaSandboxJava-1.0.jar:?]
    at java.lang.Class.forName0(Native Method) ~[?:1.8.0_201]
    at java.lang.Class.forName(Class.java:348) [?:1.8.0_201]
    at lambdainternal.LambdaRTEntry.main(LambdaRTEntry.java:150) [LambdaJavaRTEntry-1.0.jar:?]"
1616770807392   "2021-03-26 15:00:07 fcefc5f7-a62e-42d6-9a86-5c2a7b15d222 INFO  MetadataHandler:251 - doHandleRequest: response[GetTableResponse{tableName=TableName{schemaName=default, tableName=evt_development}, schema=Schema<value_status_id: Decimal(38, 9), continent_ids: List<continent_ids.element: Decimal(38, 9)>, published_dimension_unit_id: Decimal(38, 9), climate_group_id: Decimal(38, 9), country_ids: List<country_ids.element: Decimal(38, 9)>, percent_approved: Decimal(38, 9), valuation_methodology_general: Utf8, valuation_methodology_specific_id: Decimal(38, 9), transcription_id: Decimal(38, 9), ecosystem_descriptor_specific_id: Decimal(38, 9), pk2: Utf8, ecosystem_service_specific: Utf8, pk1: Utf8, ecosystem_descriptor_general: Utf8, climate_group: Utf8, area: Decimal(38, 9), dataset_ids: List<dataset_ids.element: Decimal(38, 9)>, ecosystem_descriptor_sub_specific_id: Decimal(38, 9), published_dimension_unit: Utf8, ecosystem_service_general: Utf8, ecosystem_service_category_id: Decimal(38, 9), ecosystem_service_sub_specific_id: Decimal(38, 9), countries: List<countries.element: Utf8>, ecosystem_service_specific_id: Decimal(38, 9), publication_type: Utf8, withins: List<withins.element: Utf8>, continents: List<continents.element: Utf8>, ecosystem_service_id: Decimal(38, 9), ecosystem_descriptor_id: Decimal(38, 9), proximities: List<proximities.element: Utf8>, bibliographic_reference: Utf8, calculation_type_id: Decimal(38, 9), ecosystem_descriptor_sub_specific: Utf8, name: Utf8, site_id: Decimal(38, 9), calculated_high: Decimal(38, 9), calculated_low: Decimal(38, 9), inflated_low: Decimal(38, 9), valuation_methodology_general_id: Decimal(38, 9), authors: Utf8, is_usd_acre_yr: Utf8, transcription_state_id: Decimal(38, 9), subcountry_ids: List<subcountry_ids.element: Decimal(38, 9)>, within_ids: List<within_ids.element: Decimal(38, 9)>, ecosystem_descriptor_specific: Utf8, title: Utf8, currency_code: Utf8, published_low: Decimal(38, 9), calculation_type: Utf8, proximity_ids: List<proximity_ids.element: Decimal(38, 9)>, flagged: Utf8, primary_authors: Utf8, valuation_methodology_specific: Utf8, climate_type_id: Decimal(38, 9), ecosystem_service_category: Utf8, published_currency_year: Decimal(38, 9), sk1: Utf8, currency: Utf8, published_time_unit: Utf8, value_id: Decimal(38, 9), sk2: Utf8, ecosystem_service_sub_specific: Utf8, valuation_methodology_comments: Utf8, site_scale: Utf8, year_published: Decimal(38, 9), value_status: Utf8, valuation_methodology_id: Decimal(38, 9), inflated_high: Decimal(38, 9), published_time_unit_id: Decimal(38, 9), published_high: Decimal(38, 9), ecosystem_service_general_id: Decimal(38, 9), climate_type: Utf8, inflated_currency_year: Decimal(38, 9), ecosystem_descriptor_general_id: Decimal(38, 9), bibliography_url: Utf8, subcountries: List<subcountries.element: Utf8>, currency_id: Decimal(38, 9), greenhouse_gasses: List<greenhouse_gasses.element: Utf8>>, partitionColumns=[], requestType=GET_TABLE, catalogName=dynamo-db-catalog}]"
1616770807411   "END RequestId: fcefc5f7-a62e-42d6-9a86-5c2a7b15d222"

I do notice that it looks like it might be choking on the schema type "set", although that was generated directly from the Glue Crawler. Do those need to be manually to change to "array"?

atennak1 commented 3 years ago

@beaksandclaws yes please try changing them to array. I don't remember the Glue Crawler defining set columns as sets. It tended to use Hive compatible types in its definition, so perhaps something was changed recently on the crawler side.

keelzebub commented 3 years ago

Are there any plans to support set? Because otherwise, every time the crawler runs I would have to manually go in and update the schema, which could end up being a lot of overhead.

keelzebub commented 3 years ago

Did a little bit more research, and it looks like the Glue Crawler uses a built-in classifier specifically for DynamoDB, which is why it makes schemas with the set type (because set is a valid type in DynamoDB). Given that, I think this might actually need to be logged as a bug, although possibly with the DynamoDB connector.

keelzebub commented 3 years ago

For posterity's sake, I've create a bug report here: https://github.com/awslabs/aws-athena-query-federation/issues/378

fmcastro commented 3 years ago

I'm glad I found this otherwise I'd never suspect the "Set" type... I was reading the "set" as a verb... like "Unexpected start type IS set" and had no clue to what was happening... Thank you!

RajasGujarathi commented 2 years ago

We are facing exactly the same issue, but in our case we did not define any crawler or to that extent did not even touch Glue service directly.

We took the DDB connector and installed it as an application in the AWS account. Then went to Athena and defined a DataSource. That is it, and started using Athena to query DynamoDB. Now the schema for DynamoDB table has changes and Athena does not understand the newly added column and hence throws an error Column 'xxxyyyzzz' cannot be resolved.

Is there a way to refresh the connector or tell the connector and to refresh the DynamoDB schema that it possess ?

henrymai commented 2 years ago

@fmcastro

This pull request has been merged and should address your set type request: https://github.com/awslabs/aws-athena-query-federation/pull/728

We don't currently have plans to cut a new release with this yet, but just wanted to give you a heads up in case you wanted to build the connector with this update yourselves.

henrymai commented 2 years ago

@fmcastro

We finally cut a release (2022.30.2) with this fix. Try it out and let us know if you have any problems.

fmcastro commented 2 years ago

Thanks for letting me know!

gfang-work commented 2 years ago

@RajasGujarathi - the connector automatically infers the schema by scanning a subset of items in the DDB table. You will need to follow the advice provided by Stuart in https://dev.to/aws-builders/finally-dynamodb-support-in-aws-quicksight-sort-of-2lbl

RajasGujarathi commented 2 years ago

Thanks, @gfang-work for the reference, I will give it a try.

For better/easy reachability

Stuart advice is present in the comments of the article