Open chronitis opened 2 years ago
Worth noting that if using a static external location (ie, s3_data_naming=schema_table
), this is affected by #54, so after #43 non-incremental tables don't get cleaned up, so seeds get their data duplicated and non-incremental tables will fail with a warning that there is data in the prefix already.
Either reverting #43 or merging #49 should fix this. s3_data_naming=uuid
isn't affected since it won't try and reuse old prefixes.
Rebased on 1.0.1
That looks exactly what I was looking for. We want to move to a similar S3 structure, where we have a query bucket (s3_staging_dir) and a data bucket. With a lot of dbt deploys, the S3 bucket grows a lot because of the new UUID folders which we want to clean-up and set lifecycle policy rules on both the query results and parquet data.
FYI, #49 has been merged (actually by #73) so this PR should be unblocked now? 😃
@Tomme is there anything preventing this from being merged? It does seem like a very useful addition 🙂
Not sure if it helps if I comment to gain visibility on this approval. But my team is also waiting on this. If there's anything I can do to help I'd love to help.
Edit I did run it (s3_data_naming: schema_table
, as well as s3_data_naming: uuid
) and they both ran as expected.
I'd also be interested in having this PR merged.
This PR is currently being review at https://github.com/dbt-athena/dbt-athena/pull/4
This PR and its refactor is being merged in https://github.com/dbt-athena/dbt-athena/pull/39 and available in dbt-athena-community==1.3.2
Problem
Currently, the only options for determining where data ends up in S3 are to set
s3_staging_bucket
in the connection properties or setexternal_location
on each model.The
s3_staging_bucket
argument is ignored if the Athena workgroup already has a staging bucket configured (but is used bydbt seed
, which always sets a location).We have a (not uncommon?) layout of one S3 bucket for the results of all Athena queries from which objects are rapidly expired, and another with appropriate lifecycle configuration for storing created tables we want to keep.
Aside from setting
external_location
on every model (which isn't so easily composable across different profiles), there isn't a nice way to control S3 layout indbt-athena
as it stands.Implementation
This adds two new (optional) connection options:
s3_data_dir
: if set, the default root directory to create tables and seeds in (eg,s3://my-data-bucket/dbt/
)s3_data_naming
: the strategy for naming subdirectories ofs3_data_dir
in which we'll actually store tables; two implemented options:uuid
: tables or seeds are saved to{s3_data_dir}/{uuid4}/
(this was how seed tables were already named)schema_table
: name according to the schema (ie, Athena database) and table like{s3_data_dir}/{schema}/{table}/
If
s3_data_dir
is unset, the behaviour should be unchanged:{s3_staging_dir}/tables/{uuid4}/
external_location
if set, and Athena's default otherwise.