Tomme / dbt-athena

The athena adapter plugin for dbt (https://getdbt.com)
Apache License 2.0
140 stars 79 forks source link

Specifying external_location isn't cleaning up the existing data. #107

Open VDFaller opened 2 years ago

VDFaller commented 2 years ago

Problem

I'm not understanding the accepted way to make external_location dynamic from dev to production.

I unfortunately can't use the s3_staging_dir from profiles.yml because of a constraint of my IT. Also the standard {s3_staging_dir}/tables/uuid doesn't make it easy to know which data goes with which table while looking through s3.

Because right now if I run it twice I'm getting

HIVE_PATH_ALREADY_EXISTS: Target directory for table 'dev_test.tbl_test' already exists: s3://my_external_location/PARQUET_STORE/dev_test/tbl_test. You may need to manually clean the data at location 's3://my_s3_staging_dir/tables/4d8dcf55-5fc0-424d-8668-0b6f182fb254' before retrying. Athena will not delete data in your account.

my tbl_test is simply

{{ config(materialized='table', external_location='s3://my_external_location/PARQUET_STORE/dev_test/tbl_test/')  }}

SELECT 'Test' AS "Fun"

I'm hoping I don't have to manually drop the table

Pipe dream

I'd love it if I could have the external_location in the form of

s3://my-bucket/some_extra_path/tbl_schema/tbl_name

Where I give the external_location prefix (different one for dev, qa, prod) and the file would know what table schema it is going to and what the table name is without me having to configure it.

Possible fix

PR 74 seems to fix the issue.

ojaved-equip commented 2 years ago

The tables not getting cleaned up was resolved by #73

For making external_location dynamic in environments I specify a different staging directory for dev vs. prod and use the target.s3_staging_dir variable when setting the external location.

{{ config(materialized='table', external_location=target.s3_staging_dir+'_tables/[table name]') }}

VDFaller commented 2 years ago

I saw that one, but I updated to the latest version of DBT and it's still happening. Has #73 not been released yet?

Also, I can't use the same bucket for staging as I do for external_location (a constraint from my IT).
I tried to add a ex_loc var to my profiles and do

{{ config(materialized='table', external_location=target.ex_loc+target.schema+'/table_workorder_component/')  }}

But that's no bueno.

'ex_loc' is undefined. This can happen when calling a macro that does not exist. Check for typos and/or install package dependencies with "dbt deps".

ojaved-equip commented 2 years ago

Did you update just DBT or dbt-athena plugin? I just updated the dbt-athena plugin and its working fine for me.

I'm not sure if there is a way to get around the different buckets but did see #50 still open which looks like what we really need.

VDFaller commented 2 years ago

Just double checked and upgraded again and it's still giving me the issue. And from the code change of #73 that looks like it only has to do with incremental tables. Mine is not incremental. Maybe that's the disconnect?

and yeah #50 looks like what I'm looking for for dev env things.

mrshu commented 2 years ago

@VDFaller #73 has already been released (in the sense that it's in the master branch). That change now applies to all tables, not just the incremental ones.

VDFaller commented 2 years ago

@mrshu Well I'm still getting this issue. /shrug I'll try dbt-glue

VDFaller commented 1 year ago

Okay so I think I found the problem. clean_up_table is using my default credentials, (which don't have permission to get table for this environment, not even remote related actually). So it's failing. And since the error code was AccessDeniedException it was failing blindly. Cascading into something hard to debug.

I changed my profile to use default and changed that to the credentials that actually matter for this and it's running fine. With or without external location.

I also changed the function to this and it worked even with default being unrelated.

    @available
    def clean_up_table(
        self, database_name: str, table_name: str
    ):
        # Look up Glue partitions & clean up
        conn = self.connections.get_thread_connection()
        session = boto3.session.Session(profile_name=conn._credentials.aws_profile_name)
        with boto3_client_lock:
            glue_client = session.client('glue')
        try:
            print("trying to get table")
            table = glue_client.get_table(
                DatabaseName=database_name,
                Name=table_name
            )
        except ClientError as e:
            if e.response['Error']['Code'] == 'EntityNotFoundException':
                logger.debug("Table '{}' does not exists - Ignoring", table_name)
                return
        if table is not None:
            logger.debug("Deleting table data from'{}'", table["Table"]["StorageDescriptor"]["Location"])
            p = re.compile('s3://([^/]*)/(.*)')
            m = p.match(table["Table"]["StorageDescriptor"]["Location"])
            if m is not None:
                bucket_name = m.group(1)
                prefix = m.group(2)
                s3_resource = session.resource('s3')
                s3_bucket = s3_resource.Bucket(bucket_name)
                s3_bucket.objects.filter(Prefix=prefix).delete()

Edit Oh but this won't work for you CI Pipeline since you use env vars. crap.

Gatsby-Lee commented 1 year ago

@VDFaller Thank you for sharing your research in the issue!! I had the similar issue like sth failed blindly.