aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.89k stars 689 forks source link

Governed Tables Fails to create with catalog.create_parquet_table method #1049

Closed AdrianoNicolucci closed 2 years ago

AdrianoNicolucci commented 2 years ago

When I attempt to create a new governed table that does not exist already using the catalog.create_parquet_table method. I run into the error below. The governed table does not exist already so it shouldn't be trying to run the "UpdateTable operation. This method has the option to accept governed tables as the table type so as a user of the api, my assumption is I can use this to create a new governed table.

InvalidInputException: An error occurred (InvalidInputException) when calling the UpdateTable operation: Glue UpdateTable operation cannot change TableType to or from GOVERNED.

jaidisido commented 2 years ago

Are you certain that the table does not exist? One explanation could be that another user has already created the table and has not granted Lake Formation permissions to other users. It would explain why you cannot see it although it does exist.

The below code reproduces the error:

import awswrangler as wr

# With table_type set to EXTERNAL_TABLE (default)
wr.catalog.create_parquet_table(
    database="my_db",
    table="my_tbl",
    path="s3://my_lf_bucket/my_tbl",
    columns_types={"col0": "int", "col1": "double"},
    compression="snappy",
)

# With table_type set to GOVERNED
wr.catalog.create_parquet_table(
    database="my_db",
    table="my_tbl",
    path="s3://my_lf_bucket/my_tbl",
    columns_types={"col0": "int", "col1": "double"},
    compression="snappy",
    table_type="GOVERNED",
)

Second call throws the InvalidInputException error, but only because the first call already created a non-governed table.

AdrianoNicolucci commented 2 years ago

Hi jaidisido,

Yes, I have confirmed that the table does not exist already. The user has admin access so I figured it would not be an issue. However, I wonder if this is a larger issue related to governed tables and IAM. I am not able to perform:

transaction_id = wr.lakeformation.start_transaction(read_only=False)

which gives this error: ClientError: An error occurred (ThrottlingException) when calling the StartTransaction operation (reached max retries: 5): Rate exceeded

When I attempt to find required permissions in IAM in the console, the permissions dont' simply exist:

image

jaidisido commented 2 years ago

Believe or not but even a role with both LF and IAM admin permissions would not see a table created by another user in LF unless they explicitly have permissions to it. Have you tried with a different table name altogether?

I don't think the Invalid Action error is of concern. I see it as well in my account and can still perform the actions and create tables. My guess is that IAM has simply not updated their policy validator tool but the actions are still valid. Plus you would receive a 403 Access Denied error not a throttle issue.

AdrianoNicolucci commented 2 years ago

Thanks, I had no idea about the policy validator could be out of sync. I created the appropriate permissions to configure the data lake locations and permissions in lake formation. I registered the s3 bucket data location in lake formation. I am no longer getting that same error.