aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.72k stars 3.94k forks source link

aws-glue-alpha: S3-Table table properties are added to the wrong parameters section #27365

Open guyernest opened 1 year ago

guyernest commented 1 year ago

Describe the bug

The TableInput section in the Glue AWS::Glue::Table has two different Parameters sections, one for the storage and one for the table. The current implementation of the S3-Table puts all the custom parameters into the StorageDescriptor section Parameters and leaves the other hard-coded.

The use case is for dynamic-partitioning, which uses projection.<dynamic-partitioning>.format and similar parameters to define the way that Glue (and Athena) will parse the dynamic partitioning field. This is a common way to archive data into S3 using Kinesis Firehose.

Expected Behavior

When using the following code in the CDK:

    var replication_table = new glue.S3Table(this, 'ReplicationTable', {
      database: replication_database,
      tableName: <Glue-Table-Name>, 
      columns: <Columns>,
      partitionKeys: [{
        name: 'datehour',
        type: glue.Schema.STRING,
      }],
      bucket: eventsBucket,
      s3Prefix: 'events/table=<DynamoDB-Table-Name>/',
      storedAsSubDirectories: true,
      storageParameters: [
        glue.StorageParameter.compressionType(glue.CompressionType.GZIP),
        // The parameters that are relevant for the calculation of the dynamic partitioning
        // glue.StorageParameter.custom('projection.enabled', 'true'), 
        glue.StorageParameter.custom('projection.enabled', 'true'),
        glue.StorageParameter.custom('projection.datehour.type', 'date'), 
        glue.StorageParameter.custom('projection.datehour.format', 'yyyy/MM/dd'), 
        glue.StorageParameter.custom('projection.datehour.range', '2021/01/01,NOW'), 
        glue.StorageParameter.custom('projection.datehour.interval', '1'), 
        glue.StorageParameter.custom('projection.datehour.interval.unit', 'DAYS'), 
        glue.StorageParameter.custom('storage.location.template', 's3://cdk-event-log-multi-definition-use-case-poc-events-bucket/events/table=cdk-event-log-multi-definition-use-case-poc-table/${datehour}/'), 
        glue.StorageParameter.custom("EXTERNAL", 'TRUE'),
        glue.StorageParameter.custom("compressionType", 'gzip'),
      ],
      dataFormat: glue.DataFormat.JSON,
      enablePartitionFiltering: true,
      compressed: true,
    });

I expect to get the following CFN snippet:

  "ReplicationTable2E30ABDE": {
   "Type": "AWS::Glue::Table",
   "Properties": {
    "CatalogId": {
     "Ref": "AWS::AccountId"
    },
    "DatabaseName": {
     "Ref": "DatabaseB269D8BB"
    },
    "TableInput": {
     "Name": <Glue-Table-Name>,
     "Parameters": {
      "classification": "json",
      "partition_filtering.enabled": true,
      "projection.enabled": "true",
      "projection.datehour.type": "date",
      "projection.datehour.format": "yyyy/MM/dd",
      "projection.datehour.range": "2021/01/01,NOW",
      "projection.datehour.interval": "1",
      "projection.datehour.interval.unit": "DAYS",
      "storage.location.template": "s3://<Replication-Bucket>/events/table=<DynamoDB-Table-Name>/${datehour}/",
      "EXTERNAL": "TRUE",
      "compressionType": "gzip"
     },
     "PartitionKeys": [
      {
       "Name": "datehour",
       "Type": "string"
      }
     ],
     "StorageDescriptor": {
      "Columns": [<Columns>],
      "Compressed": true,
      "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
      "Location": {
       "Fn::Join": [
        "",
        [
         "s3://",
         {
          "Ref": "EventsBucketCD4657F9"
         },
         "/events/table=<DynamoDB-Table-Name>/"
        ]
       ]
      },
      "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
      "Parameters": {
       "compression_type": "gzip"
      },
      "SerdeInfo": {
       "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe"
      },
      "StoredAsSubDirectories": true
     },
     "TableType": "EXTERNAL_TABLE"
    }
   }

Please note that the parameters are under the Table Input.

Current Behavior

Instead I get the following stack Snippet:

  "Type": "AWS::Glue::Table",
   "Properties": {
    "CatalogId": {
     "Ref": "AWS::AccountId"
    },
    "DatabaseName": {
     "Ref": "DatabaseB269D8BB"
    },
    "TableInput": {
     "Name": <Glue-Table-Name>,
     "Parameters": {
      "classification": "json",
      "has_encrypted_data": true,
      "partition_filtering.enabled": true
     },
     "PartitionKeys": [
      {
       "Name": "datehour",
       "Type": "string"
      }
     ],
     "StorageDescriptor": {
      "Columns": [ <Columns>],
      "Compressed": true,
      "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
      "Location": {
       "Fn::Join": [
        "",
        [
         "s3://",
         {
          "Ref": "EventsBucketCD4657F9"
         },
         "/events/table=<DynamoDB-Table-Name>/"
        ]
       ]
      },
      "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
      "Parameters": {
       "compression_type": "gzip",
       "projection.datehour.enabled": "true",
       "projection.datehour.type": "date",
       "projection.datehour.format": "yyyy/MM/dd",
       "projection.datehour.range": "2021/01/01,NOW",
       "projection.datehour.interval": "1",
       "projection.datehour.interval.unit": "DAYS",
       "storage.location.template": "s3://<Replication-Bucket>/events/table=<DynamoDB-Table-Name>/${datehour}/",
       "EXTERNAL": "TRUE",
       "compressionType": "gzip"
      },
      "SerdeInfo": {
       "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe"
      },
      "StoredAsSubDirectories": true
     },
     "TableType": "EXTERNAL_TABLE"
    }
   }

Please note that the dynamic partitioning parameters are added to the wrong parameters section.

Reproduction Steps

Use a similar code in your stack definition under /lib:

    var replication_table = new glue.S3Table(this, 'ReplicationTable', {
      database: replication_database,
      tableName: <Glue-Table-Name>, 
      columns: <Columns>,
      partitionKeys: [{
        name: 'datehour',
        type: glue.Schema.STRING,
      }],
      bucket: eventsBucket,
      s3Prefix: 'events/table=<DynamoDB-Table-Name>/',
      storedAsSubDirectories: true,
      storageParameters: [
        glue.StorageParameter.compressionType(glue.CompressionType.GZIP),
        // The parameters that are relevant for the calculation of the dynamic partitioning
        // glue.StorageParameter.custom('projection.enabled', 'true'), 
        glue.StorageParameter.custom('projection.enabled', 'true'),
        glue.StorageParameter.custom('projection.datehour.type', 'date'), 
        glue.StorageParameter.custom('projection.datehour.format', 'yyyy/MM/dd'), 
        glue.StorageParameter.custom('projection.datehour.range', '2021/01/01,NOW'), 
        glue.StorageParameter.custom('projection.datehour.interval', '1'), 
        glue.StorageParameter.custom('projection.datehour.interval.unit', 'DAYS'), 
        glue.StorageParameter.custom('storage.location.template', 's3://cdk-event-log-multi-definition-use-case-poc-events-bucket/events/table=cdk-event-log-multi-definition-use-case-poc-table/${datehour}/'), 
        glue.StorageParameter.custom("EXTERNAL", 'TRUE'),
        glue.StorageParameter.custom("compressionType", 'gzip'),
      ],
      dataFormat: glue.DataFormat.JSON,
      enablePartitionFiltering: true,
      compressed: true,
    });

Possible Solution

I can think of three options to solve the bug:

Additional Information/Context

As mentioned above, this is part of common pipeline of replication from a DynamoDB table to S3 to allow analytical queries on that data from Athena. In the example above (extendedS3DestinationConfiguration) the user can define the format of the dynamic partitioning of the data in Firehose. If we fix this issue with a similar focused method (option 3 above), it will be easy to extend constructs such as KinesisStreamsToKinesisFirehoseToS3, AwsDynamoDBKinesisStreamsS3 or KinesisFirehoseToS3 to support the creation of the Glue table on top of the data in S3.

CDK CLI Version

2.99.0 (build 0aa1096)

Framework Version

No response

Node.js Version

v16.18.1

OS

MacOS

Language

Typescript

Language Version

No response

Other information

No response

indrora commented 1 year ago

Can you point to the Glue docs (or CloudFormation docs for the Glue CFN) where these are described?

guyernest commented 1 year ago

Thank you @indrora for your attention.

If you check the TableInput in Glue CFN, you can see that it has Parameters and StorageDescriptor.
The StorageDescriptor CFN also has a Parameters section.

This is the source of the confusion as some of the parameters should go to the TableInput section and some to the StorageDescriptor.

guyernest commented 1 year ago

Here is another link to the specific parameters that are needed for Athena: https://docs.aws.amazon.com/athena/latest/ug/partition-projection-setting-up.html

As described above about the possible options to solve it, we can add a general option to add parameters to the TableInput in Glue or to make it more specific for the parameters that are defined for the projection for Athena.

fastrockstar commented 1 year ago

I ran into this when I tried to add the skip.header.line.count property to the table using

storage_parameters=[glue.StorageParameter.skip_header_line_count(1)]

As you showed it was written into the wrong parameter section.

After copying it to the correct place in the template and deploying it manually, the table property was correctly configured as expected.

Thank you for fixing it.