Tomme / dbt-athena

The athena adapter plugin for dbt (https://getdbt.com)
Apache License 2.0
142 stars 79 forks source link

Support s3_data_dir and s3_data_naming #50

Open chronitis opened 2 years ago

chronitis commented 2 years ago

Problem

Currently, the only options for determining where data ends up in S3 are to set s3_staging_bucket in the connection properties or set external_location on each model.

The s3_staging_bucket argument is ignored if the Athena workgroup already has a staging bucket configured (but is used by dbt seed, which always sets a location).

We have a (not uncommon?) layout of one S3 bucket for the results of all Athena queries from which objects are rapidly expired, and another with appropriate lifecycle configuration for storing created tables we want to keep.

Aside from setting external_location on every model (which isn't so easily composable across different profiles), there isn't a nice way to control S3 layout in dbt-athena as it stands.

Implementation

This adds two new (optional) connection options:

If s3_data_dir is unset, the behaviour should be unchanged:

chronitis commented 2 years ago

Worth noting that if using a static external location (ie, s3_data_naming=schema_table), this is affected by #54, so after #43 non-incremental tables don't get cleaned up, so seeds get their data duplicated and non-incremental tables will fail with a warning that there is data in the prefix already.

Either reverting #43 or merging #49 should fix this. s3_data_naming=uuid isn't affected since it won't try and reuse old prefixes.

chronitis commented 2 years ago

Rebased on 1.0.1

jessedobbelaere commented 2 years ago

That looks exactly what I was looking for. We want to move to a similar S3 structure, where we have a query bucket (s3_staging_dir) and a data bucket. With a lot of dbt deploys, the S3 bucket grows a lot because of the new UUID folders which we want to clean-up and set lifecycle policy rules on both the query results and parquet data.

jessedobbelaere commented 2 years ago

FYI, #49 has been merged (actually by #73) so this PR should be unblocked now? 😃

mrshu commented 2 years ago

@Tomme is there anything preventing this from being merged? It does seem like a very useful addition 🙂

VDFaller commented 2 years ago

Not sure if it helps if I comment to gain visibility on this approval. But my team is also waiting on this. If there's anything I can do to help I'd love to help.

Edit I did run it (s3_data_naming: schema_table, as well as s3_data_naming: uuid) and they both ran as expected.

rumbin commented 1 year ago

I'd also be interested in having this PR merged.

mattiamatrix commented 1 year ago

This PR is currently being review at https://github.com/dbt-athena/dbt-athena/pull/4

nicor88 commented 1 year ago

This PR and its refactor is being merged in https://github.com/dbt-athena/dbt-athena/pull/39 and available in dbt-athena-community==1.3.2