Closed daniel-cortez-stevenson closed 5 months ago
ci/circleci: integration-athena will fail as long as the environment variables are not set and the infra is not set up in the AWS account that circleci is running in.
I hope this is a wanted contribution to the code! TBH I did this for myself and then got inspired to contribute. I'm 💯 OK with running my own modified local dbt package, so no hard feelings if this is out of scope and closed. Cheers 🍾
The query-comment:
thing is troublesome ... any ideas how to fix this 'hack' a little more sustainably? Is it the job of this package, dbt-athena, dbt-core, or AWS Athena to resolve this?
@jtcohen6 now that we've resolved the issue with the query comment in the DBT Athena adapter, do we want to move forward with this PR?
The steps might be:
I have copied the macro's in this PR as overrides for the dbt-external-tables plugin to be able to read cloudtrail data, and the macro's worked great with Athena. Any idea how this PR should proceed to get merged in?
@daniel-cortez-stevenson @jtcohen6 is there any plan to move this PR forward? it's a pretty interesting feature that will be important to have.
I've been having a play with this PR content (against the dbt-athena-community adapter@1.3.5, which is the recommended one afaik now) - I'm having difficulties with later versions of dbt-athena-community. How are the dependencies on the adapters handled? Seems like a change could be unwittingly introduced in an adapter that could break this build and consumers of this package. (Aside, if I get it working against latest to my satisfaction I'll raise a new PR, assuming this one doesn't get sorted in the meantime)
would love to see this merge - right now it seems there is no good solution how to manage EXTERNAL TABLE
resources with dbt and Athena setup
what is left to do with this pr? can i help complete it?
It seems that this won't be merged anytime soon. Does anyone have a working solution with the latest dbt-athena version?
I'm looking for a solution to define source Athena tables and lftag permissions.
Yeah too bad that this PR is stuck.
I am using external tables in my day to day, I've installed this package:
- package: dbt-labs/dbt_external_tables
version: 0.8.2
Then added these lines to my dbt_project.yml
:
dispatch:
- macro_namespace: dbt_external_tables
search_order: [my_custom_dbt_macros, dbt_external_tables]
# Stage the external tables when dbt starts running, using https://github.com/dbt-labs/dbt-external-tables
on-run-start:
- "{{ dbt_external_tables.stage_external_sources() }}"
And created a private git repo with the macro from this PR, which I installed in packages.yml
:
- git: git@gitlab.com:XXX/dbt-macros.git
revision: 0.0.1
Which has a dbt_project.yml
:
name: 'my_custom_dbt_macros'
version: '1.0.0'
config-version: 2
require-dbt-version: ">=1.0.0"
And the files from this PR in a dbt_external_tables
subfolder.
Seems to work, but I only have a basic use case... (Cloudtrail S3 data)
Thanks @jessedobbelaere!
I'd rather had this merged, but in the mean time, if anyone needs this, I packaged the solution above at https://github.com/rstml/dbt-external-tables-athena
@rstml nice one, feel free to post this also in #db-athena in dbt slack workspace, I believe that some folks will be interested.
I imagine adding Athena support could be out of scope for this repository.
@jtcohen6 can you confirm whether or not this feature is desirable?
If yes, I'd get it merge-ready.
There is actually even another PR https://github.com/dbt-labs/dbt-external-tables/pull/203 not sure what it's the difference.
Also looking at the supported engines I see:
Azure SQL
that are all verified adapters. dbt-athena-community is now trusted not sure if this is a criteria used to merge such integration.
I imagine adding Athena support could be out of scope for this repository.
Frankly, this is how I initially felt reading the docs.
However, there's great demand for something like dbt-external-tables
for Athena/Trino/etc since data often loaded as files and there's a need to define metadata somewhere.
Maybe initially this wasn't intended use for this extension. But why not extend it to encompass all those databases/use cases?
I noticed there's a PR to add Trino support too: #153
@nicor88 the original PR was based on the original athena adapter, not the community one. When I wanted to use Athena with external tables last year, I saw the original was already stalled for a year so used it as a basis and made minimal changes to create #203 PR based on dbt-athena-community - which I asked for advice and merge with no response and we've been using at my client off my GitHub PR branch since last summer.
@rstml I can't remember why I had to change the redshift helper now, but I suspect it was because there was shared code between the two (in the original PR that I started from) and I had to make it deal with the serious funky quoting behaviour of Athena.
I think that now dbt-athena-community probably supports most of what's needed internally and patching into dbt-external-tables could be neater than either current PR - but I gave up putting any thought into it after being ignored for so long :shrug:
It would be great, one way or another, to properly support external tables in Athena!
@brabster thanks for the context, and It would be great, one way or another, to properly support external tables in Athena!
I can't agree more.
@jtcohen6 @dataders could you consider to have a look at this?
@nicor88 FYI, bit of a warning on an issue I've had with my PR #203, dbt-athena-community will go and try to delete the data in S3 backing a table, with no way I can see to disable this behaviour from dbt-external-tables.
If you have laid an external table over some data in S3, then change from an external table to a normal model, it will try to delete the data underneath as part of the table drop. This is inappropriate and either results in breaking dbt run with permissions errors or potentially uncoverable loss of data.
@brabster f those external tables are treated as sources in dbt-athena-community nothing should happen, because they are managed outside dbt-athena, therefore there isn't any delete happening. Of course if a some point the user start to treat them inside dbt-athena, then there will be a delete underlying.
Yeah that's it. If you have a source external table called "transactions" over a bunch of files landing from somewhere, and you decide you want to put a model there instead, you change the name of the source to "transactions-external" and create a model "transactions".
The new external table "transactions-external" gets created fine, and the model "transactions" sees the existing table and dbt-athena-community starts deleting all the data underneath, which could be bad. I think that's the behaviour I've seen, anyway, just giving you a heads up about that potential issue to avoid :+1:
closing in favor of #203
Description & motivation
Added AWS Athena so we can use AWS Athena external sources in DBT. This is nice for using public data sources like Open Street Maps and Facebook Data for Good Population Density.
Assumes use of the dbt-athena adapter
Considerations:
query-comment:
indbt_project.yml
must be set to nothing/*...*/
is not supported in Athena./integration_tests/public_data/json/**/*.json
modifiedNew env variables need to be set for CI:
Checklist