aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.85k stars 681 forks source link

athena.to_iceberg function is not deleting temp_table_xxxxx properly in Athena #2826

Open Jasonxdy opened 1 month ago

Jasonxdy commented 1 month ago

Describe the bug

When calling to_iceberg function, I found out that some of the temp table (ex. temp_table_xxxxx) created by the function persists in Athena. Upon checking the CloudTrail, the DeleteTable glue API has failed to delete the temp table with EntityNotFoundException. The CreateTable API and DeleteTable API was called nearly at the same time, and since the CreateTable API is async, I assume that the DeleteTable API failed because the temp table was not created yet.

I suggest to add a logic to check whether table was created or not before calling DeleteTable API in to_iceberg function.

How to Reproduce

*P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.*
wr.athena.to_iceberg(
        df=df,
        database="sample_database",
        table="sample_iceberg_table",
        temp_path="s3://sample_bucket/temp_dir",
        keep_files=False
)

Expected behavior

The temp_table_xxxxx has to be deleted properly in Athena

Your project

No response

Screenshots

No response

OS

Window

Python version

-

AWS SDK for pandas version

-

Additional context

No response

LeonLuttenberger commented 1 month ago

Hey,

I haven't been able to recreate this issue. After the temp table is created, it's actually referenced in the Athena query here. So if the temp table has indeed not been created in time, I would expect the Athena query to fail before the Glue table deletion. Secondly, the Glue table deletion method does check if the table exists, so it's supposed to skip the deletion step if it can't find the table.

Can you please share a stack trace for when this error occurs, along with the Python and AWS SDK for pandas versions that you are using?

Best regards, Leon