aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.91k stars 700 forks source link

Lake Formation Table Creation misnames table causing EntityNotFoundException during wr.s3.to_parquet() #579

Closed fliverance closed 2 years ago

fliverance commented 3 years ago

Describe the bug

def write_table(name):
    wr.s3.to_parquet(
        df=pd.DataFrame({
            'col': [1, 2, 3],
            'col2': ['A', 'A', 'B'],
        }),
        path="s3://lf-hubble-preview/gov_legislators/" + name,
        dataset=True,
        mode='overwrite',
        database='gov_legislators',
        table=name,
        table_type='GOVERNED'
    )

write_table("table_with_isolated_2_number") # Succeeds
write_table("table_with_combined_2NUM_number") # Fails

To Reproduce Run the code, see table w/ numeral and string concatenated fails with example stack trace:

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/Users/fliver/Library/Application Support/JetBrains/IdeaIC2020.3/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/Users/fliver/Library/Application Support/JetBrains/IdeaIC2020.3/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/fliver/workplace/PandasLakeformation/perf/main.py", line 86, in <module>
    write_table("table_with_combined_2NUM_number") # Fails
  File "/Users/fliver/workplace/PandasLakeformation/perf/main.py", line 72, in write_table
    wr.s3.to_parquet(
  File "/Volumes/workplace/PandasLakeformation/.env/lib/python3.9/site-packages/awswrangler/_config.py", line 418, in wrapper
    return function(**args)
  File "/Volumes/workplace/PandasLakeformation/.env/lib/python3.9/site-packages/awswrangler/s3/_write_parquet.py", line 611, in to_parquet
    paths, partitions_values = _to_dataset(
  File "/Volumes/workplace/PandasLakeformation/.env/lib/python3.9/site-packages/awswrangler/s3/_write_dataset.py", line 202, in _to_dataset
    del_objects: List[Dict[str, Any]] = _get_table_objects(
  File "/Volumes/workplace/PandasLakeformation/.env/lib/python3.9/site-packages/awswrangler/lakeformation/_utils.py", line 83, in _get_table_objects
    response = client_lakeformation.get_table_objects(**scan_kwargs)
  File "/Volumes/workplace/PandasLakeformation/.env/lib/python3.9/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Volumes/workplace/PandasLakeformation/.env/lib/python3.9/site-packages/botocore/client.py", line 676, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.EntityNotFoundException: An error occurred (EntityNotFoundException) when calling the GetTableObjects operation: Table table_with_combined_2num_number not found.

If you do a wr.catalog.tables(database="gov_legislators"), you'll find the 2nd table was created with an extra underscore, 'table_with_combined_2_num_number', and the code subsequently fails in the middle of creation, trying to GetTableObjects, due to the munged table name.

jaidisido commented 3 years ago

Thank you for raising this, I can confirm that this a bug. It's caused by the _sanitize_table_name(str) / _sanitize_name(str) method which is used to ensure the passed argument is a valid Athena table name (e.g. no uppercase, stripping accents...).

It looks like it's returning the wrong output in this particular use case though. Working on a fix

jaidisido commented 3 years ago

This issue is linked to this one https://github.com/awslabs/aws-data-wrangler/issues/533 and should be addressed in the same release