aws-cloudformation / cloudformation-coverage-roadmap

The AWS CloudFormation Public Coverage Roadmap
https://aws.amazon.com/cloudformation/
Creative Commons Attribution Share Alike 4.0 International
1.11k stars 54 forks source link

Creating Iceberg tables via CloudFormation without using the Athena API #1827

Closed dmschauer closed 9 months ago

dmschauer commented 10 months ago

Name of the resource

AWS::Glue::Table

Resource name

No response

Description

Iceberg format has been available in Athena for 1 year now, but Cloudformation still hasn't supported the creation of an Iceberg table (https://aws.amazon.com/about-aws/whats-new/2022/04/amazon-athena-acid-transactions-powered-apache-iceberg/). To create an Iceberg table the only available option is to run a DDL query directly in Athena (https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-creating-tables.html) which is not very convenient in large production environments where all cloud infrastructure is mantained in Cloudformation.

This issue https://github.com/aws-cloudformation/cloudformation-coverage-roadmap/issues/1595 already pointed out the same but unfortunately it was closed without an actual solution being implemented.

Please take a look at this response https://github.com/aws-cloudformation/cloudformation-coverage-roadmap/issues/1595#issuecomment-1583194377 which was liked by at least 7 others as well who are still facing the original problem.

There is still no direct Cloudformation support for creating Iceberg tables and you have to go via the Athena API route which is inconvenient and unexpected.

Other Details

We are working with AWS CDK to generate our CloudFormation specifications. As a workaround we are currently doing the following: using a CustomResource that calls a Lambda Function that calls the Athena API to execute an Iceberg CREATE TABLE statement. We don't consider this a long-term solution though and only as a workaround until support was added to CloudFormation

milashenko commented 10 months ago

@dmschauer Does the following chunk of CloudFormation code from https://aws.amazon.com/blogs/big-data/introducing-aws-glue-crawler-and-create-table-support-for-apache-iceberg-format/ solves the issue for you?

OpenTableFormatInput: IcebergInput: MetadataOperation: "CREATE" Version: "2"

dmschauer commented 10 months ago

@milashenko Thanks for the reply, do you know how to specify this in AWS CDK? So far in our project we only use the CDK to specify CF templates

milashenko commented 10 months ago

@dmschauer Didn't try myself, but this looks like the one https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_glue.CfnTable.OpenTableFormatInputProperty.html

dmschauer commented 10 months ago

Hi @milashenko thanks for the reply.

This didn't help directly in our case as we're using AWS CDK to generate the CF templates, but it gave me hope that CF actually does support Iceberg tables and it does! My bad! This issue can be closed.

For anyone else wondering and stumbling upon this issue, the below AWS CDK (Python) code can be used to construct Iceberg tables via CloudFormation without the need for any weird workarounds. The trick is specifying the open_table_format_input

from aws_cdk import (
    Stack,
    aws_glue as glue,
)

class IcebergtabletestStack(Stack):

    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        iceberg_table = glue.CfnTable(
            scope=self,
            id="iceberg_example_table",
            database_name="my_database",
            table_input=glue.CfnTable.TableInputProperty(
                table_type="EXTERNAL_TABLE",
                description="Enter description here",
                name="iceberg_example_table",
                storage_descriptor=glue.CfnTable.StorageDescriptorProperty(
                    location=f"s3://<my_bucket>/iceberg_example_table/",
                    columns=[
                        glue.CfnTable.ColumnProperty(name="mycol1", type="date"),
                        glue.CfnTable.ColumnProperty(name="mycol2", type="string"),
                        glue.CfnTable.ColumnProperty(name="mycol3", type="timestamp"),
                    ],
                )
            ),
            open_table_format_input=glue.CfnTable.OpenTableFormatInputProperty(
                iceberg_input=glue.CfnTable.IcebergInputProperty(
                    metadata_operation="CREATE",
                    version="2"
                )
            )
        )
aws-jeffrey-yang commented 9 months ago

Closing issue, can create Iceberg tables via CF.

oleksiiburov commented 9 months ago

Hi @milashenko , could you please assist in creating iceberg table with partitions? I am using the snippet provided in the article you shared: https://aws.amazon.com/blogs/big-data/introducing-aws-glue-crawler-and-create-table-support-for-apache-iceberg-format/ but also add partition keys:

        PartitionKeys:
          - Name: year
            Type: int
          - Name: month
            Type: int
          - Name: day
            Type: int

during CF stack deploy I got:

Cannot create partitions in an iceberg table (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 8e0a6c4f-c48e-4ddf-adc3-b3763d812d76; Proxy: null)
milashenko commented 9 months ago

@oleksiiburov Unfortunately I also was unable to add partitions as part of the template. Only later with Athena Spark notebook like:

ALTER TABLE telemetry_iceberg ADD PARTITION FIELD deviceid AS deviceid
ALTER TABLE telemetry_iceberg ADD PARTITION FIELD months(date_field) AS month

More can fe found here https://iceberg.apache.org/docs/latest/spark-ddl/#partitioned-by

sfgarcia commented 7 months ago

Hi @dmschauer. I think the issue with Iceberg tables in CDK is not totally solved, as for now the only allowed metadata operation is CREATE (https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-glue-table-iceberginput.html). It would still be necessary to run queries in Athena if you want to add columns, change column or table names, so all the operations can't be managed through AWS CDK

dmschauer commented 7 months ago

@sfgarcia I'm aware of that. This issue I opened here is merely about support for CREATE being in place at all and as it turned out it is (although partitioning via CloudFormation isn't supported, that's a separate issue https://github.com/aws-cloudformation/cloudformation-coverage-roadmap/issues/1866) "Iceberg tables in CDK" being "totally solved" isn't what this issue here is supposed to be about. I agree with what you say but I'm not sure why this information is directed at me. I'm a user as well, I don't work for AWS.

Regarding the actual problem, someone else also already opened another issue about updates to Iceberg tables not being supported by Cloudformation. I see coincidentally both of us have been tagged there (https://github.com/aws-cloudformation/cloudformation-coverage-roadmap/issues/1919#issuecomment-1923993187)

Smotrov commented 1 month ago

So far you can do like this to create an Iceberg table

  const myTable = new glue.S3Table(props.scope, 'IcebergTest2', {
    database: props.database,
    tableName: 'iceberg_test2',
    bucket: props.bucket,
    s3Prefix: 'iceberg_test2',
    dataFormat: glue.DataFormat.PARQUET,
    columns: [{
      name: 'col1',
      type: glue.Schema.STRING,
    }],
  });

  // Hack starts here to make the table Iceberg
  const cfnTable = myTable.node.defaultChild as mainGlue.CfnTable;

  cfnTable.openTableFormatInput = {
    icebergInput: {
      metadataOperation: 'CREATE',
      version: '2',
    }
  };