aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.41k stars 3.8k forks source link

Missing classification-parameter when creating table in Glue #7826

Open jorgenfroland opened 4 years ago

jorgenfroland commented 4 years ago

Hey. I haven't reported bugs before, so I hope I'm doing things correctly here.

When creating Glue table using aws_cdk.aws_glue.Table with data_format = _glue.DataFormat.JSON classification is set to Unknown. Querying the table fails.

Reproduction Steps

glue_table = _glue.Table(self,'GlueTable'
            ,database = _glue.Database.from_database_arn(self, 'GlueDatabase'
                ,'arn:aws:glue:region:{}:database/abc'.format(accound_id)
            )
            ,table_name = 'def_ghi'
            ,data_format = _glue.DataFormat.JSON
            ,bucket = s3_bucket
            ,s3_prefix = 'prefix/'

If I manually add "classification" with value "json" in the Table properties, after deploying with CDK, the query works fine.

Error Log

Amazon Invalid operation: Invalid DataCatalog response for external table "abc"."def_ghi": Cannot deserialize table. Missing mandatory field: Parameters in response from external catalog. ;

Environment


This is :bug: Bug Report

jorgenfroland commented 4 years ago

After some more fiddling around, I discovered that it probably doesn't have to do with the classification=json parameter. I managed to make it work just by editing and pressing apply. I then looked at the difference and the only thing I could find was this:

SerdeInfo before:

'SerdeInfo': {'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe'}

SerdeInfo after:

'SerdeInfo': {'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe', 'Parameters': {}}

After some further thought, I see that this also correlates with the error message above.

jorgenfroland commented 4 years ago

To get around this I have added a post-deploy code snippet using boto3 to update the table, like this:

response = glue_client.get_table(
    DatabaseName=database_name,
    Name=table_name
)
table = response['Table']
table['StorageDescriptor']['SerdeInfo']['Parameters'] = {}
table['Parameters']['classification'] = 'json' <-- not necessary, but removes the classification: Unknown
glue_client.update_table(
    DatabaseName=table['DatabaseName']
    ,TableInput={
        'Name' : table['Name']
        ,'Description': table['Description']
        ,'Retention': table['Retention']
        ,'StorageDescriptor': table['StorageDescriptor']
        ,'TableType': table['TableType']
        ,'Parameters': table['Parameters']
    }
)
iliapolo commented 4 years ago

Hi @jorgenfroland - Thanks for reporting this.

I believe this is rooted in either the Glue API or how CloudFormation invokes it. In any case, passing an empty map should be the same as not passing it at all, and CDK can probably mitigate this quirk.

Filing 👍

inirudebwoy commented 4 years ago

Thanks @jorgenfroland :) Your comment helped me solve the same problem.

riteshgrandhi commented 3 years ago

Can confirm this is happening in typescript construct as well, Kudos to @jorgenfroland. Currently the inability to add parameters like classification and S3 exclude Path with the L2 construct is indeed a problem when using Cdk for creating Glue resources. Hope it gets stable soon.

vuchetichbalint commented 3 years ago

It is still a thing, is there any update on this?

github-actions[bot] commented 2 years ago

This issue has not received any attention in 1 year. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

javierdiegof commented 1 year ago

It is still a thing, is there any update on this?

eduardowillame commented 1 year ago

I've tested using AVRO files and facing same error when using Spectrum. What I did was UNLOAD in Athena to AVRO files for testing, crawled and classification=avro. Athena can read with no issues, but in Redshift I get:

Cannot deserialize Table. Error: ----------------------------------------------- error: IsObject() code: 1000 context: rapidjson::GenericValue<Encoding, Allocator>::MemberIterator rapidjson::GenericValue<Encoding, Allocator>::FindMember(co

I've tested also unloading into Parquet and it worked well.

Does anybody know how to fix/workaround it?