Closed yaroslav-ost closed 8 months ago
Hi @moomindani ,
Could you please review it once you have a chance?
Thanks!
Thank you for your contribution.
To me, it makes sense to have tmp table location under custom_location path. How about adding unique random prefix or suffix to the location in case table is temp table instead of ignoring custom_location?
Could you please add unit/integ test for that?
Hi @moomindani,
Here is an example of custom_location: s3://s3_bucket/warehouse/my_table1
custom_location represents the location of the table where parquets files are stored. Making tmp tables share the same S3 location might lead to a risk of data overwriting. Ignoring custom_location for tmp table creation isn`t new approach for dbt-glue adapter. The iceberg incremental table uses the same strategy to avoid conflicts.
We already have an integration test for verifying the base incremental model. This must be already covered: https://github.com/aws-samples/dbt-glue/blob/1b4d2f12695ebd4607a80342d3fa9ba8d1b4fcc1/tests/functional/adapter/test_basic.py#L148C9-L148C25
Let me know your thoughts!
Thanks!
I meant that we still need some fix to make the temp table location unique and dedicated, but I do not think just ignoring custom location is good idea.
We already have an integration test for verifying the base incremental model. This must be already covered:
The integ test has succeeded without this fix. We need a separate test case to verify that this fix solves the issue.
Hi @moomindani,
Sorry for asking these questions, but I just want to clarify your idea.
but I do not think just ignoring custom location is good idea.
Other options besides using profile location will require fully reconsidering usage of custom_location for table location. custom_location already includes dedicated folder where the data for the table of the model is stored. Example: custom_location: s3://bucket/table_name/: file1.parquet file2.parquet file3.parquet
If we want to use the same custom_location for temp table, how will it look like in the example above? Will it just create tmp folder inside custom_location like s3://bucket/table_name/table_name_tmp/*? Will it be valid to create tmp table inside existing table from the storage organization?
Fmm. I do not want to have temp table location inside the actual table location.
I was thinking to add suffix like s3://<custom_location>_tmp_abcdef/*
but it's not very simple and I could not come up with simple solution to give temp table location. Please ignore that portion of my comments.
I think I can approve the PR after you add the integration test with iceberg.
Hi @moomindani, I have added the commit with the integration test. Could you please run the integration test once again to make sure everything works?
Thanks!
Thanks. I triggered the integ test here: https://github.com/aws-samples/dbt-glue/actions/runs/8383702333
The integ test failed. Could you please take a look? Of course, you can run the integ test in your environment by setting environment variables like following:
export DBT_AWS_ACCOUNT=123456789012
export DBT_S3_LOCATION=s3://<your-s3-location>
export DBT_GLUE_ROLE_ARN=arn:aws:iam::123456789012:role/GlueServiceRole
export DBT_GLUE_REGION=us-east-1
pytest tests/functional/
Hi @moomindani,
Sorry for asking you to run integration tests a third time, but I was able to run them on my personal account and everything must work now. Could you please run last time (fingers crossed) and merge the changes if everything is passed?
Thanks!
Nice, the integ test succeeded. I expect that this new integ test fails if there is not this change. Am I correct?
Could you also please try with iceberg table?
It seems that glue__location_clause
has specific logic for iceberg, so I am wondering if this change still works for iceberg tables too.
Hi @moomindani, My pull-request only brings the changes to "glue__create_tmp_table_as" function which is referenced only in the creation of the default table format. The default table format was covered in integration tests. As to the iceberg, it continues to reference "glue__location_clause" and it has its own chain of function calls for tmp table creation which doesn`t intersect with my change.
I agree with you that it would be better to cover all table formats like iceberg with integration tests since we have only the default one available right now, but let`s address this in another pull-request.
Thanks!
That makes sense. Thanks for the explanation.
Thank you for your contribution! Now I have merged your PR.
resolves #360
Description
This PR addresses the fix for a bug related to the usage of custom_location for tmp table creation for the default table format in incremental materialization. It was decided to use the location set up in the profile instead of writing to the same location where the original table resides.
Checklist
CHANGELOG.md
and added information about my change to the "dbt-glue next" section.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.