Tomme / dbt-athena

The athena adapter plugin for dbt (https://getdbt.com)
Apache License 2.0
140 stars 79 forks source link

Feature/table iceberg #135

Closed nicor88 closed 1 year ago

nicor88 commented 1 year ago

What

This PR introduce iceberg as table materialization.

As iceberg doesn't support CTA, the implementation do the following:

Notes

Models used to test

Without partitions

{{ config(
    materialized='table',
    format='iceberg'
) }}

SELECT 'A' AS user_id, 'pi' AS name, 'active' AS status, 17.89 AS amount, 1 as quantity
UNION ALL
SELECT 'B' AS user_id, 'sh' AS name, 'active' AS status, 1 AS amount, 10000 as quantity
UNION ALL
SELECT 'C' AS user_id, 'zh' AS name, 'not_active' AS status, 20.54 AS amount, 340000 as quantity

With partitions

{{ config(
    materialized='table',
    format='iceberg',
    partitioned_by=['status']
) }}

SELECT 'A' AS user_id, 'pi' AS name, 'active' AS status, 17.89 AS amount, 1 as quantity
UNION ALL
SELECT 'B' AS user_id, 'sh' AS name, 'active' AS status, 1 AS amount, 10000 as quantity
UNION ALL
SELECT 'C' AS user_id, 'zh' AS name, 'not_active' AS status, 20.54 AS amount, 340000 as quantity

With external location

{{ config(
    materialized='table',
    format='iceberg',
    external_location='s3://my_bucket/my_table/'
) }}

SELECT 'A' AS user_id, 'pi' AS name, 'active' AS status, 17.89 AS amount, 1 as quantity
UNION ALL
SELECT 'B' AS user_id, 'sh' AS name, 'active' AS status, 1 AS amount, 10000 as quantity
UNION ALL
SELECT 'C' AS user_id, 'zh' AS name, 'not_active' AS status, 20.54 AS amount, 340000 as quantity

With different data types

{{ config(
    materialized='table',
    format='iceberg',
    partitioned_by=['status']
) }}

SELECT
    'A' AS user_id,
    'pi' AS name,
    'active' AS status,
    17.89 AS cost,
    1 AS quantity,
    100000000 AS quantity_big,
    current_date AS my_date,
    cast(current_timestamp as timestamp) AS my_timestamp

Table properties

{{ config(
    materialized='table',
    format='iceberg',
    partitioned_by=['status'],
    table_properties={
        'write_target_data_file_size_bytes': '134217728',
        'optimize_rewrite_delete_file_threshold': '2'
        }
) }}

SELECT
    'A' AS user_id,
    'pi' AS name,
    'active' AS status,
    17.89 AS cost,
    1 AS quantity,
    100000000 AS quantity_big,
    current_date AS my_date,
    cast(current_timestamp as timestamp) AS my_timestamp

Not strict location

{{ config(
    materialized='table',
    format='iceberg',
    partitioned_by=['status'],
    external_location='s3://my_bucket/silver_athena/example_iceberg/',
    strict_location=False,
    table_properties={
        'optimize_rewrite_delete_file_threshold': '2'
        }
) }}

SELECT
    'A' AS user_id,
    'pi' AS name,
    'active' AS status,
    17.89 AS cost,
    1 AS quantity,
    100000000 AS quantity_big,
    current_date AS my_date,
    cast(current_timestamp as timestamp) AS my_timestamp

nicor88 commented 1 year ago

@Tomme could you have a look??

rumbin commented 1 year ago

a drop statement of iceberg table lead to deleting all the data in S3

What do you mean by "all data"? Does this imply any risk of losing data of other tables/relations? What would be the desired outcome of dropping the relation?

nicor88 commented 1 year ago

@rumbin dropping a table in the current adaptor setup means: looking the table location path, cleaning up via S3 apis, then finally dropping the table in athena. See implementation here.

When working with iceberg a simple drop table will drop the data in the specified location of the table, that's what I mean with all-data. Also, as we are talking about materialization=table here, in dbt world a materialization table drop the original table and re-create from scratch, so it's the first basic primitive when working with dbt.

I re-worked the comment saying:

drop statement of iceberg table lead to deleting the data in S3 automatically in the specified location

rumbin commented 1 year ago

Thanks for clarifying 😊 I know what DROP means, but with your initial description I got confused

nicor88 commented 1 year ago

@rumbin well, in athena world a drop statement of a table is not like Postres/Redshift/Snowflake. But now with Iceberg should get closer to it. Anyhow, the iceberg helper add a simple macro to just drop the table, without calling s3 path pruning.

rumbin commented 1 year ago

I see. Thanks again for explaining.

nicor88 commented 1 year ago

Closing as added to dbt-athena-community