aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.87k stars 680 forks source link

Question regarding wr.athena.to_iceberg #2782

Closed rossiellog closed 3 months ago

rossiellog commented 4 months ago

I am having a hard time trying to figure where the to_iceberg method tries to create and subsequently destroy the temporary table needed for the INSERT INTO … SELECT statement.

wr.athena.to_iceberg(df=data, index=True, database=os.getenv("GLUE_DATABASE"), table=os.getenv("GLUE_TABLE").lower(), merge_cols=[time_measure], workgroup=athena_workgroup, encryption="SSE-KMS", kms_key=kms_key)

This is how I am using the method. I have set up the IAM role to have necessary IAM permissions on the destination bucket, the corresponding glue db and table, as well as LakeFormation permissions on said db and table. The code still raises the following exceptions.

botocore.errorfactory.AccessDeniedException: An error occurred (AccessDeniedException) when calling the GetTable operation: Insufficient Lake Formation permission(s) on temp_table_dca47e409f4a494781e27ea08cc1f74c

botocore.errorfactory.AccessDeniedException: An error occurred (AccessDeniedException) when calling the DeleteTable operation: Insufficient Lake Formation permission(s): Required Drop on temp_table_dca47e409f4a494781e27ea08cc1f74c

I was wondering, could it be that the temp tables are created in a different db on which I did not set enough LakeFormation permissions? I tried adding LakeFormation permissions to the relevant IAM role even on default db, but nothing changed.

Any help is appreciated!

jaidisido commented 4 months ago

The temporary table is created/deleted from the same database. As the temp table has a random name, you must have Lake Formation permissions to create, describe and delete any table in that database

rossiellog commented 4 months ago

@jaidisido Thanks! I finally got what the problem was thanks to your answer. Now I am facing a different problem, but looking at wrangler's source code I think I know what's happening. Basically, it seems that the temporary table is missing a column. This column is in fact the index of the pandas dataframe I am trying to merge into the Iceberg table. Check this. Unlike the final merge operation that takes in consideration the index parameter given as input to the to_iceberg method, the linked invocation of to_parquet does not pass the value of the parameter. Since the default is False, the temporary table will not be aligned with the final one in this particular case. This leads to the error I am having. Now, I guess I can reset the dataframe index in order to retrieve the index as a column, but I think it would be nice to fix this. Let me know if my assumption is correct, or if I missed some other line that covers this case!

jaidisido commented 4 months ago

The to_iceberg API is already significantly overloaded with parameters. We would prefer not to add yet another parameter especially since as you mentioned you can reset the index before calling the method

rossiellog commented 4 months ago

I understand, but that's not the point I was trying to make. You do not need to add a new parameter, you just have to pass the index parameter from outer method to inner method. If you'd like, I can fork and add a pull request so that you can check what I mean and evaluate if it makes sense.