aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.9k stars 696 forks source link

Calling athena.to_iceberg can lead to unexpected permission related issues due to default query output location #2710

Closed NBaySellier closed 6 months ago

NBaySellier commented 6 months ago

Describe the bug

We can not pass s3_output to athena.to_iceberg.

However, within athena.to_iceberg, when calling the functions _start_query_execution which is called in multiple places, we do not pass anys3_output parameter to that function call.

Instead, a default s3_output is then constructed from the boto3_session, which is based on the account_id and region. However, this can lead to unexpected access related issues as the caller may not have access to this bucket.

For example:

InvalidRequestException: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Unable to verify/create output bucket aws-athena-query-results-XXXX-YYYY

This could be fixed by allowing the user to explicitly pass s3_output to athena.to_iceberg which is then passed down to the corresponding function calls.

How to Reproduce

The error only occurs if we do not have e.g. StartQueryExecution permission on the default created bucket aws-athena-query-results-ACCOUNT-REGION

Call

import pandas as pd

df = pd.DataFrame({"a": [1, 2]})

awswrangler.athena.to_iceberg(df, database=database, table='iceberg_table_test', table_location=table_location, temp_path=temp_table_path, boto3_session=boto3_session)

Expected behavior

No access related issues for some unspecified generated bucket. Instead, should be able to pass location of s3_output myself.

Your project

No response

Screenshots

No response

OS

Mac M1

Python version

3.9

AWS SDK for pandas version

3.7

Additional context

No response

jaidisido commented 6 months ago

Sounds fair to me, should be addressed in #2727

NBaySellier commented 6 months ago

@jaidisido Forgive me if I'm wrong but in that MR, the s3_output is not actually being propagated to any of the functions that are being called within to_iceberg, right? Therefore the actual functionality doesn't seem to have changed?