aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.94k stars 701 forks source link

wr.athena.to_iceberg not using temp_path #2978

Closed lautarortega closed 1 month ago

lautarortega commented 1 month ago

Describe the bug

I created an iceberg table in Athena through AWS Wrangler. I rename a column through an Athena query. When I want to write more rows to the table with the new col in the df, I get this error

QueryFailed: TYPE_MISMATCH: Insert query has mismatched column types: Table: [varchar, integer, varchar], Query: [varchar, integer, varchar, integer]. If a data manifest file was generated at 's3://aws-athena-query-results-494340620388-eu-west-1/68cb1983-852e-4f12-9d17-7af88520b02a-manifest.csv', you may need to manually clean the data from locations specified in the manifest. Athena will not delete data in your account.

The error mentioning this S3 path makes me believe it is not using the temp path I passed as an argument.

How to Reproduce

data = {'first_name': ['John'],
        'age': [52],
        'city': ['Nashville']
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
df

dtype = {'first_name': 'string', 'age': 'int', 'city': 'string'}
wr.athena.to_iceberg(
        df=df,
        database=DATABASE,
        table=TABLE_NAME,
        table_location=f"s3://{BUCKET}/{TABLE_NAME}",
        temp_path=f"s3://{BUCKET}/temp/{TABLE_NAME}",
        schema_evolution=True,
        keep_files=False,
        dtype=dtype,
)

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.10

AWS SDK for pandas version

3.9.1

Additional context

No response

kukushking commented 1 month ago

Hi @lautarortega looks like the table has not been updated correctly. Can you refer the query you used to update the table?

lautarortega commented 1 month ago

Hi @kukushking! Yes, sure. This is the command I used.

ALTER TABLE test_table CHANGE COLUMN new_age age int

kukushking commented 1 month ago

Thanks, I am able to reproduce this with the following snippet:

import awswrangler as wr
import pandas as pd

DATABASE = "default"
BUCKET = "<REDACTED>"
TABLE_NAME = "iceberg1"

data = {'first_name': ['John'],
        'city': ['Nashville']
}
df = pd.DataFrame(data)

wr.athena.to_iceberg(
        df=df,
        database=DATABASE,
        table=TABLE_NAME,
        table_location=f"s3://{BUCKET}/{TABLE_NAME}",
        temp_path=f"s3://{BUCKET}/temp/{TABLE_NAME}",
        schema_evolution=True,
        keep_files=False,
)

wr.athena.start_query_execution(f"ALTER TABLE {TABLE_NAME} CHANGE COLUMN first_name new_first_name string", database=DATABASE)

data = {'new_first_name': ['Lily'],
        'city': ['Ontario']
}
df = pd.DataFrame(data)

wr.athena.to_iceberg(
        df=df,
        database=DATABASE,
        table=TABLE_NAME,
        table_location=f"s3://{BUCKET}/{TABLE_NAME}",
        temp_path=f"s3://{BUCKET}/temp/{TABLE_NAME}",
        schema_evolution=True,
        keep_files=False,
)
Traceback (most recent call last):
  ...

    raise exceptions.QueryFailed(response["Status"].get("StateChangeReason"))
awswrangler.exceptions.QueryFailed: COLUMN_NOT_FOUND: Insert column name does not exist in target table: first_name. If a data manifest file was generated at 's3://<REDACTED>/f6a7953b-dbe7-4699-a5ce-2f13168f0253-manifest.csv', you may need to manually clean the data from locations specified in the manifest. Athena will not delete data in your account.

Looking into the fix.

kukushking commented 1 month ago

This relates to https://github.com/apache/iceberg/issues/7584 in which Glue still displays old columns as if they were present in the schema, while subsequent INSERT statements include the columns that are no longer considered "current" by Iceberg.