aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.9k stars 693 forks source link

Unable to use copy_from_files to load into a redshift table w/ an identity column. #2871

Closed nlm4145 closed 2 months ago

nlm4145 commented 3 months ago

Describe the bug

In reference to this issue, it appears we are still unable to run copy_from_files when attempting to copy parquet data into a redshift table that has an identity column. It works with to_sql, but not copy_from_files.

How to Reproduce

*P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.*
  1. Upload parquet data into S3
  2. Create a table with an Identity column for said file
  3. run awswrangler.redshift.copy_from_files() to copy the file into the target table. This will return the error:

redshift_connector.error.ProgrammingError: {'S': 'ERROR', 'C': '42601', 'M': 'NOT NULL column without DEFAULT must be included in column list', 'F': '../src/pg/src/backend/commands/commands_copy.c', 'L': '2836', 'R': 'DoTheCopy'}

Expected behavior

Expected the copy query to succeed and the resulting table to contain the same data as the parquet file, with the identity column auto-incrementing.

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.11.7

AWS SDK for pandas version

3.8.0

Additional context

No response

jaidisido commented 3 months ago

I have tried this snippet and it seems to do what you are describing without issue:

    df = pd.DataFrame({"foo": ["a", "b", "c"]})
    wr.s3.to_parquet(df, f"{path}test.parquet")

    with redshift_con.cursor() as cursor:
        cursor.execute(
            f"""
            CREATE TABLE {schema}.{redshift_table} (
                id BIGINT IDENTITY(1, 1),
                foo VARCHAR(100),
                PRIMARY KEY(id)
            );
            """
        )

    wr.redshift.copy_from_files(
        path=path,
        path_suffix=f".parquet",
        con=redshift_con,
        table=redshift_table,
        data_format="parquet",
        schema=schema,
        iam_role=databases_parameters["redshift"]["role"],
    )
    df_out = wr.redshift.read_sql_table(redshift_table, redshift_con)

returns

   id foo
0   4   b
1   2   a
2   6   c