Support s3_data_dir and s3_data_naming

chronitis commented 2 years ago

Problem

Currently, the only options for determining where data ends up in S3 are to set s3_staging_bucket in the connection properties or set external_location on each model.

The s3_staging_bucket argument is ignored if the Athena workgroup already has a staging bucket configured (but is used by dbt seed, which always sets a location).

We have a (not uncommon?) layout of one S3 bucket for the results of all Athena queries from which objects are rapidly expired, and another with appropriate lifecycle configuration for storing created tables we want to keep.

Aside from setting external_location on every model (which isn't so easily composable across different profiles), there isn't a nice way to control S3 layout in dbt-athena as it stands.

Implementation

This adds two new (optional) connection options:

s3_data_dir: if set, the default root directory to create tables and seeds in (eg, s3://my-data-bucket/dbt/)
s3_data_naming: the strategy for naming subdirectories of s3_data_dir in which we'll actually store tables; two implemented options:
- uuid: tables or seeds are saved to {s3_data_dir}/{uuid4}/ (this was how seed tables were already named)
- schema_table: name according to the schema (ie, Athena database) and table like {s3_data_dir}/{schema}/{table}/

If s3_data_dir is unset, the behaviour should be unchanged:

Seed tables are saved to {s3_staging_dir}/tables/{uuid4}/
Tables are saved to external_location if set, and Athena's default otherwise.

chronitis commented 2 years ago

Worth noting that if using a static external location (ie, s3_data_naming=schema_table), this is affected by #54, so after #43 non-incremental tables don't get cleaned up, so seeds get their data duplicated and non-incremental tables will fail with a warning that there is data in the prefix already.

Either reverting #43 or merging #49 should fix this. s3_data_naming=uuid isn't affected since it won't try and reuse old prefixes.

chronitis commented 2 years ago

Rebased on 1.0.1

jessedobbelaere commented 2 years ago

That looks exactly what I was looking for. We want to move to a similar S3 structure, where we have a query bucket (s3_staging_dir) and a data bucket. With a lot of dbt deploys, the S3 bucket grows a lot because of the new UUID folders which we want to clean-up and set lifecycle policy rules on both the query results and parquet data.

jessedobbelaere commented 2 years ago

FYI, #49 has been merged (actually by #73) so this PR should be unblocked now? 😃

mrshu commented 2 years ago

@Tomme is there anything preventing this from being merged? It does seem like a very useful addition 🙂

VDFaller commented 2 years ago

Not sure if it helps if I comment to gain visibility on this approval. But my team is also waiting on this. If there's anything I can do to help I'd love to help.

Edit I did run it (s3_data_naming: schema_table, as well as s3_data_naming: uuid) and they both ran as expected.

rumbin commented 1 year ago

I'd also be interested in having this PR merged.

mattiamatrix commented 1 year ago

This PR is currently being review at https://github.com/dbt-athena/dbt-athena/pull/4

nicor88 commented 1 year ago

This PR and its refactor is being merged in https://github.com/dbt-athena/dbt-athena/pull/39 and available in dbt-athena-community==1.3.2

Tomme / dbt-athena

Support s3_data_dir and s3_data_naming #50

Problem

Implementation