aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.89k stars 690 forks source link

Why does 3.7.0 s3.to_parquet require Glue:CreateTable permissions? #2708

Closed sandra-selfdecode closed 6 months ago

sandra-selfdecode commented 6 months ago

After upgrading to 3.7.0 our lambda stopped working with the error:

[ERROR] AccessDeniedException: An error occurred (AccessDeniedException) when calling the CreateTable operation: User: arn:aws:sts::***:assumed-role**** is not authorized to perform: glue:CreateTable on resource: arn:aws:glue:us-east-1:****:catalog because no identity-based policy allows the glue:CreateTable action
Traceback (most recent call last):
  File "/var/lang/lib/python3.11/site-packages/sentry_sdk/integrations/aws_lambda.py", line 169, in sentry_handler
    reraise(*exc_info)
  File "/var/lang/lib/python3.11/site-packages/sentry_sdk/_compat.py", line 115, in reraise
    raise value
  File "/var/lang/lib/python3.11/site-packages/sentry_sdk/integrations/aws_lambda.py", line 160, in sentry_handler
    return handler(aws_event, aws_context, *args, **kwargs)
  File "/var/task/app.py", line 81, in handler
    response = write_parquet(df.drop(columns="GT Score"))
  File "/var/task/utils/db.py", line 20, in write_parquet
    return wr.s3.to_parquet(
  File "/var/lang/lib/python3.11/site-packages/awswrangler/_config.py", line 715, in wrapper
    return function(**args)
  File "/var/lang/lib/python3.11/site-packages/awswrangler/_utils.py", line 177, in inner
    return func(*args, **kwargs)
  File "/var/lang/lib/python3.11/site-packages/awswrangler/s3/_write_parquet.py", line 721, in to_parquet
    return strategy.write(
  File "/var/lang/lib/python3.11/site-packages/awswrangler/s3/_write.py", line 396, in write
    self._create_glue_table(**create_table_args)
  File "/var/lang/lib/python3.11/site-packages/awswrangler/s3/_write_parquet.py", line 289, in _create_glue_table
    return _create_parquet_table(
  File "/var/lang/lib/python3.11/site-packages/awswrangler/catalog/_create.py", line 307, in _create_parquet_table
    _create_table(
  File "/var/lang/lib/python3.11/site-packages/awswrangler/catalog/_create.py", line 155, in _create_table
    client_glue.create_table(**args)
  File "/var/lang/lib/python3.11/site-packages/botocore/client.py", line 535, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/var/lang/lib/python3.11/site-packages/botocore/client.py", line 980, in _make_api_call
    raise error_class(parsed_response, operation_name)

Why is adding this permission necessary? I do not want the lambda to be able to create a new table within the database, I only want it to be able to add parquet files to an existing table.

jaidisido commented 6 months ago

@sandra-selfdecode we had a breaking change in 3.7.0 that might be related to this but I can't tell without more details. In short, Glue tables of type GOVERNED are not supported anymore. Is your Glue table of that type by any chance?

To help debug, can you please provide:

  1. An anonymised version of the API call you are making (wr.s3.to_parquet(path='s3://bucket/...')
  2. Can you turn on Logging in your Lambda? You would then be able to compare the logs between 3.6 and 3.7 to spot a difference. In particular we would want to compare the difference in logged table_input. My suspicion is that some difference is detected in this new version that leads to the new table creation call
sandra-selfdecode commented 6 months ago

Does GOVERNED include EXTERNAL? It's not lake formation, but it's made with this cdk code:

return CfnTable(
        scope,
        id,
        catalog_id=ENVIRONMENT.aws_account_id,
        database_name=ATHENA_DATABASE_NAME,
        table_input=CfnTable.TableInputProperty(
            name=table_name,
            description=description,
            parameters={
                "EXTERNAL": "TRUE",
                "has_encrypted_data": False,
                "parquet.compression": "GZIP",
            },
            partition_keys=[
                CfnTable.ColumnProperty(name=key[0], type=key[1])
                for key in partition_keys
            ]
            if partition_keys
            else None,
            storage_descriptor=CfnTable.StorageDescriptorProperty(
                columns=[
                    CfnTable.ColumnProperty(name=column[0], type=column[1])
                    for column in columns
                ],
                input_format=(
                    "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"
                ),
                output_format=(
                    "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"
                ),
                location=location,
                serde_info=CfnTable.SerdeInfoProperty(
                    serialization_library=(
                        "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
                    ),
                ),
            ),
            table_type="EXTERNAL_TABLE",
        ),
    )

Here is the write command:

return wr.s3.to_parquet(
        df=df,
        dataset=True,
        compression="gzip",
        use_threads=True,
        partition_cols=[partition],
        database=DATABASE_NAME,
        table=TABLE,
        dtype=column_dtypes,
        mode="overwrite_partitions",
    )
jaidisido commented 6 months ago

We believe we have identified the issue and it should be fixed with #2711. Patch release (3.7.1) will follow