dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
297 stars 119 forks source link

Add AWS Athena plugin #133

Closed daniel-cortez-stevenson closed 5 months ago

daniel-cortez-stevenson commented 2 years ago

Description & motivation

Added AWS Athena so we can use AWS Athena external sources in DBT. This is nice for using public data sources like Open Street Maps and Facebook Data for Good Population Density.

Assumes use of the dbt-athena adapter

Considerations:

New env variables need to be set for CI:

Checklist

daniel-cortez-stevenson commented 2 years ago

ci/circleci: integration-athena will fail as long as the environment variables are not set and the infra is not set up in the AWS account that circleci is running in.

daniel-cortez-stevenson commented 2 years ago

I hope this is a wanted contribution to the code! TBH I did this for myself and then got inspired to contribute. I'm 💯 OK with running my own modified local dbt package, so no hard feelings if this is out of scope and closed. Cheers 🍾

daniel-cortez-stevenson commented 2 years ago

The query-comment: thing is troublesome ... any ideas how to fix this 'hack' a little more sustainably? Is it the job of this package, dbt-athena, dbt-core, or AWS Athena to resolve this?

daniel-cortez-stevenson commented 2 years ago

@jtcohen6 now that we've resolved the issue with the query comment in the DBT Athena adapter, do we want to move forward with this PR?

The steps might be:

  1. Resolve the architecture issue for the Redshift and Athena plugins by duplicating some code OR by moving some shared macros to common
  2. Add CI/CD support for the Athena plugin
jessedobbelaere commented 1 year ago

I have copied the macro's in this PR as overrides for the dbt-external-tables plugin to be able to read cloudtrail data, and the macro's worked great with Athena. Any idea how this PR should proceed to get merged in?

nicor88 commented 1 year ago

@daniel-cortez-stevenson @jtcohen6 is there any plan to move this PR forward? it's a pretty interesting feature that will be important to have.

brabster commented 1 year ago

I've been having a play with this PR content (against the dbt-athena-community adapter@1.3.5, which is the recommended one afaik now) - I'm having difficulties with later versions of dbt-athena-community. How are the dependencies on the adapters handled? Seems like a change could be unwittingly introduced in an adapter that could break this build and consumers of this package. (Aside, if I get it working against latest to my satisfaction I'll raise a new PR, assuming this one doesn't get sorted in the meantime)

mathewcohle commented 1 year ago

would love to see this merge - right now it seems there is no good solution how to manage EXTERNAL TABLE resources with dbt and Athena setup

aidan-o-boyle-kroo commented 11 months ago

what is left to do with this pr? can i help complete it?

rstml commented 8 months ago

It seems that this won't be merged anytime soon. Does anyone have a working solution with the latest dbt-athena version?

I'm looking for a solution to define source Athena tables and lftag permissions.

jessedobbelaere commented 8 months ago

Yeah too bad that this PR is stuck.

I am using external tables in my day to day, I've installed this package:

 - package: dbt-labs/dbt_external_tables
    version: 0.8.2

Then added these lines to my dbt_project.yml:

dispatch:
  - macro_namespace: dbt_external_tables
    search_order: [my_custom_dbt_macros, dbt_external_tables]

# Stage the external tables when dbt starts running, using https://github.com/dbt-labs/dbt-external-tables
on-run-start:
  - "{{ dbt_external_tables.stage_external_sources() }}"

And created a private git repo with the macro from this PR, which I installed in packages.yml:

  - git: git@gitlab.com:XXX/dbt-macros.git
    revision: 0.0.1

Which has a dbt_project.yml:

name: 'my_custom_dbt_macros'
version: '1.0.0'
config-version: 2

require-dbt-version: ">=1.0.0"

And the files from this PR in a dbt_external_tables subfolder. Seems to work, but I only have a basic use case... (Cloudtrail S3 data)

rstml commented 8 months ago

Thanks @jessedobbelaere!

I'd rather had this merged, but in the mean time, if anyone needs this, I packaged the solution above at https://github.com/rstml/dbt-external-tables-athena

nicor88 commented 8 months ago

@rstml nice one, feel free to post this also in #db-athena in dbt slack workspace, I believe that some folks will be interested.

daniel-cortez-stevenson commented 8 months ago

I imagine adding Athena support could be out of scope for this repository.

@jtcohen6 can you confirm whether or not this feature is desirable?

If yes, I'd get it merge-ready.

nicor88 commented 8 months ago

There is actually even another PR https://github.com/dbt-labs/dbt-external-tables/pull/203 not sure what it's the difference.

Also looking at the supported engines I see:

rstml commented 8 months ago

203 looks very similar, but it also changes redshift helper for some reason https://github.com/dbt-labs/dbt-external-tables/pull/203/files#diff-462105eddcb198df9f7384b4bef97c092c55e93924896866db36c41373a811dc

I imagine adding Athena support could be out of scope for this repository.

Frankly, this is how I initially felt reading the docs.

However, there's great demand for something like dbt-external-tables for Athena/Trino/etc since data often loaded as files and there's a need to define metadata somewhere.

Maybe initially this wasn't intended use for this extension. But why not extend it to encompass all those databases/use cases?

I noticed there's a PR to add Trino support too: #153

brabster commented 8 months ago

@nicor88 the original PR was based on the original athena adapter, not the community one. When I wanted to use Athena with external tables last year, I saw the original was already stalled for a year so used it as a basis and made minimal changes to create #203 PR based on dbt-athena-community - which I asked for advice and merge with no response and we've been using at my client off my GitHub PR branch since last summer.

@rstml I can't remember why I had to change the redshift helper now, but I suspect it was because there was shared code between the two (in the original PR that I started from) and I had to make it deal with the serious funky quoting behaviour of Athena.

I think that now dbt-athena-community probably supports most of what's needed internally and patching into dbt-external-tables could be neater than either current PR - but I gave up putting any thought into it after being ignored for so long :shrug:

It would be great, one way or another, to properly support external tables in Athena!

nicor88 commented 7 months ago

@brabster thanks for the context, and It would be great, one way or another, to properly support external tables in Athena! I can't agree more.

@jtcohen6 @dataders could you consider to have a look at this?

brabster commented 5 months ago

@nicor88 FYI, bit of a warning on an issue I've had with my PR #203, dbt-athena-community will go and try to delete the data in S3 backing a table, with no way I can see to disable this behaviour from dbt-external-tables.

If you have laid an external table over some data in S3, then change from an external table to a normal model, it will try to delete the data underneath as part of the table drop. This is inappropriate and either results in breaking dbt run with permissions errors or potentially uncoverable loss of data.

nicor88 commented 5 months ago

@brabster f those external tables are treated as sources in dbt-athena-community nothing should happen, because they are managed outside dbt-athena, therefore there isn't any delete happening. Of course if a some point the user start to treat them inside dbt-athena, then there will be a delete underlying.

brabster commented 5 months ago

Yeah that's it. If you have a source external table called "transactions" over a bunch of files landing from somewhere, and you decide you want to put a model there instead, you change the name of the source to "transactions-external" and create a model "transactions".

The new external table "transactions-external" gets created fine, and the model "transactions" sees the existing table and dbt-athena-community starts deleting all the data underneath, which could be bad. I think that's the behaviour I've seen, anyway, just giving you a heads up about that potential issue to avoid :+1:

dataders commented 5 months ago

closing in favor of #203