aws-samples / dbt-glue

This repository contains de dbt-glue adapter
Apache License 2.0
91 stars 59 forks source link

Fix tmp table location for the default file_format when custom_location is provided #361

Closed yaroslav-ost closed 3 months ago

yaroslav-ost commented 3 months ago

resolves #360

Description

This PR addresses the fix for a bug related to the usage of custom_location for tmp table creation for the default table format in incremental materialization. It was decided to use the location set up in the profile instead of writing to the same location where the original table resides.

Checklist

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

yaroslav-ost commented 3 months ago

Hi @moomindani ,

Could you please review it once you have a chance?

Thanks!

moomindani commented 3 months ago

Thank you for your contribution.

To me, it makes sense to have tmp table location under custom_location path. How about adding unique random prefix or suffix to the location in case table is temp table instead of ignoring custom_location?

Could you please add unit/integ test for that?

yaroslav-ost commented 3 months ago

Hi @moomindani,

Here is an example of custom_location: s3://s3_bucket/warehouse/my_table1 image

custom_location represents the location of the table where parquets files are stored. Making tmp tables share the same S3 location might lead to a risk of data overwriting. Ignoring custom_location for tmp table creation isn`t new approach for dbt-glue adapter. The iceberg incremental table uses the same strategy to avoid conflicts.

We already have an integration test for verifying the base incremental model. This must be already covered: https://github.com/aws-samples/dbt-glue/blob/1b4d2f12695ebd4607a80342d3fa9ba8d1b4fcc1/tests/functional/adapter/test_basic.py#L148C9-L148C25

Let me know your thoughts!

Thanks!

moomindani commented 3 months ago

I meant that we still need some fix to make the temp table location unique and dedicated, but I do not think just ignoring custom location is good idea.

We already have an integration test for verifying the base incremental model. This must be already covered:

The integ test has succeeded without this fix. We need a separate test case to verify that this fix solves the issue.

yaroslav-ost commented 3 months ago

Hi @moomindani,

Sorry for asking these questions, but I just want to clarify your idea.

but I do not think just ignoring custom location is good idea.

Other options besides using profile location will require fully reconsidering usage of custom_location for table location. custom_location already includes dedicated folder where the data for the table of the model is stored. Example: custom_location: s3://bucket/table_name/: file1.parquet file2.parquet file3.parquet

If we want to use the same custom_location for temp table, how will it look like in the example above? Will it just create tmp folder inside custom_location like s3://bucket/table_name/table_name_tmp/*? Will it be valid to create tmp table inside existing table from the storage organization?

moomindani commented 3 months ago

Fmm. I do not want to have temp table location inside the actual table location. I was thinking to add suffix like s3://<custom_location>_tmp_abcdef/* but it's not very simple and I could not come up with simple solution to give temp table location. Please ignore that portion of my comments.

I think I can approve the PR after you add the integration test with iceberg.

yaroslav-ost commented 3 months ago

Hi @moomindani, I have added the commit with the integration test. Could you please run the integration test once again to make sure everything works?

Thanks!

moomindani commented 3 months ago

Thanks. I triggered the integ test here: https://github.com/aws-samples/dbt-glue/actions/runs/8383702333

moomindani commented 3 months ago

The integ test failed. Could you please take a look? Of course, you can run the integ test in your environment by setting environment variables like following:

export DBT_AWS_ACCOUNT=123456789012
export DBT_S3_LOCATION=s3://<your-s3-location>
export DBT_GLUE_ROLE_ARN=arn:aws:iam::123456789012:role/GlueServiceRole
export DBT_GLUE_REGION=us-east-1 
pytest tests/functional/
yaroslav-ost commented 3 months ago

Hi @moomindani,

Sorry for asking you to run integration tests a third time, but I was able to run them on my personal account and everything must work now. Could you please run last time (fingers crossed) and merge the changes if everything is passed?

Thanks!

moomindani commented 3 months ago

Reran: https://github.com/aws-samples/dbt-glue/actions/runs/8422396850

moomindani commented 3 months ago

Nice, the integ test succeeded. I expect that this new integ test fails if there is not this change. Am I correct?

Could you also please try with iceberg table? It seems that glue__location_clause has specific logic for iceberg, so I am wondering if this change still works for iceberg tables too.

yaroslav-ost commented 3 months ago

Hi @moomindani, My pull-request only brings the changes to "glue__create_tmp_table_as" function which is referenced only in the creation of the default table format. The default table format was covered in integration tests. image As to the iceberg, it continues to reference "glue__location_clause" and it has its own chain of function calls for tmp table creation which doesn`t intersect with my change.

I agree with you that it would be better to cover all table formats like iceberg with integration tests since we have only the default one available right now, but let`s address this in another pull-request.

Thanks!

moomindani commented 3 months ago

That makes sense. Thanks for the explanation.

moomindani commented 3 months ago

Thank you for your contribution! Now I have merged your PR.