dbt-labs / dbt-athena

The athena adapter plugin for dbt (https://getdbt.com)
https://dbt-athena.github.io
Apache License 2.0
225 stars 96 forks source link

[Bug] When creating a DBT model to replace the refreshing of an existing Athena table DBT is deleting historic data in the S3 path #698

Closed mafreihaut closed 2 months ago

mafreihaut commented 3 months ago

Is this a new bug in dbt-athena?

Current Behavior

Text from Slack db-athena:

I'm using Athena and trying to move my historical reports to DBT, but I've hit a snag. My current config is using incremental materialization with insert_overwrite strategy, partitioned by day and hour, and pointing to an external S3 location.
The problem is it's deleting past data in S3. I want to keep the table's history intact while moving to DBT for incremental updates. Also, some reports get updated at the end of the day for earlier hours within the same day.
I'm aiming to:
Keep existing historical data in S3
Use DBT for new incremental updates
Only affect the intended day and hour partitions in each run

and

A bit of discovery on my end: If a table doesn't exist  dbt.adapters.athena.constants adapter: Deleting table data: path= shows up in the logs, clearing out the existing S3 data.
If the table does exist, it will produce the desired increment update in the location I wanted.

Expected Behavior

Steps To Reproduce

Model Config:

{{ config(
    materialized='incremental',
    incremental_strategy='insert_overwrite',
    format='parquet',
    partitioned_by=['day', 'hour'],
    external_location='s3://bucket/key/existing-data'
) }}
  1. run the model and observe the data in external_location
  2. Examine the log files [Thread-1 (]: dbt.adapters.athena.constants adapter: Deleting table data: path='s3://bucket/key/existing-data', bucket='bucket', prefix='key/existing-data/'

Environment

- OS: 14.3.1 and also in AWS MWAA
- Python: 3.11
- dbt: 1.8.1
- dbt-athena-community: 1.8.1

Additional Context

Anders pointed me to creating a bug ticket for this on Slack. I think I covered it pretty well, but please contact me if there is any additional needed information. Thanks!

nicor88 commented 3 months ago

It's hard to reproduce the issue if you don't post the SQL code used in your transformation.

insert_overwrite works as intended, and it's covered by functional tests that verify that the behvior is correct. I believe that issue that you are facing is due to how you refresh the data incrementally in your model. Therefore please post your full model SQL in order for us to help you to fix the issue properly.

Also, as a possible hint, it looks to me that you might benefit from an iceberg table, and not a classic hive table for what you want to achieve.

nicor88 commented 3 months ago

I recommend to have a look here https://docs.getdbt.com/docs/build/incremental-models

In your model you need to add this conditional if, for example

{% if is_incremental() %}

where event_date >= current_date - interval '1' day

{% endif %}

doing so in your first run, all the data is processed, but in the 2nd run only "the partitions" produced by your where condition will be overwritten by the model doing so you can achieve what you state as Only affect the intended day and hour partitions in each run

nicor88 commented 2 months ago

Based on the discussion that happened in Slack, looks like the issue is because of an attempt to modify a table created outside dbt. Seems like that due to a mismatch on the schema (most probably due to the generate_schema_macro), dbt doesn't find the existing table, and therefore it deletes the data in the specified location, to then re-create the table.

I believe that this is not a bug itself but mostly an issue of the user configuration.