aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.87k stars 680 forks source link

wr.lakeformation.read_sql_query() on table with deleted S3 files returns empty dataframe instead of error #626

Closed fliverance closed 2 years ago

fliverance commented 3 years ago

Describe the bug When using wrangler to write a parquet file to a table, it's possible for the upload to fail (i.e. if network is interrupted or something). This can cause lf catalog to appear that it has the S3 file, but it does not actually exist. When the table is subsequently read, it returns an empty DF instead of throwing an exception, leading to silent failures/data corruption/etc. Wrangler should be able to identify that the LF parquet file was not actually found, and should fail fast and throw an exception instead.

To Reproduce Steps to reproduce the behavior. Also add details about Python and Wrangler's version and how the library was installed.

VENV setup:

BRANCH=main-governed-tables

VENV=./venv
rm -rf $VENV
python3.9 -m venv $VENV && source venv/bin/activate
pip3.9 install git+https://github.com/awslabs/aws-data-wrangler@$BRANCH

pip3.9 install numpy pandas faker aiobotocore[boto3] fsspec s3fs

# Also remember to update botocore to match main-governed-tables APIs

Example pseudocode:

    table_name="whatever"
    database_name="some_database"
    s3_prefix="s3://somewhere-good"
    wr.s3.to_parquet(
        some_df,
        path=s3_prefix + table_name,
        dataset=True,
        mode='overwrite',
        database=database_name,
        table=table_name,
        table_type='GOVERNED'
    )

    # <go delete created S3 file on AWS console>

    should_be_same_df = wr.lakeformation.read_sql_query(sql=F"select * from {table_name}", database=database_name)

    # returns empty DF instead of a 'file not found' or some other error

P.S. Don't attach files. Please, prefer add code snippets directly in the message body.

jaidisido commented 3 years ago

After running some tests, it appears that the Lake Formation engine does not return work units for an empty Glue table.

To reproduce:

import awswrangler as wr

wr.catalog.create_parquet_table(
    database="my_db",
    table="my_empty_table",
    path="s3://my_bucket/my_prefix",
    table_type="GOVERNED",
    columns_types={"name": "string", "value": "bigint"}
)

above command effectively creates an empty table in the AWS Glue catalog

Then when running a Lake Formation query against this table and attempting to get the work units from the engine, none are returned. Without work units, an arrow table cannot be obtained and the schema cannot be inferred. Thus an empty list is returned instead of a dataframe