aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.94k stars 701 forks source link

Duplicates values in identity column when mixing redshift.copy and redshift.to_sql #2996

Open laderjk opened 1 month ago

laderjk commented 1 month ago

Hello everyone! We are having an issue with duplicated values in an identity column when executing redshift.to_sql() after executing a redshift.copy(). When trying to insert more than 1000 records we use the COPY method and when trying to insert less than 1000 records we use the redshift.to_sql() method as is recommended in the docs.

This is the structure of the table:

create table if not exists dim_testing
(
    id BIGINT IDENTITY NOT NULL PRIMARY KEY,
    name varchar
);

This is the code in which we are able to reproduce the issue:

import random
import string

import awswrangler as wr
import pandas as pd

def generate_df(length):
    return pd.DataFrame([''.join(random.choices(string.ascii_letters + string.digits, k=30)) for _ in range(length)], columns=['name'])

df = generate_df(1500)
wr.redshift.copy(
    df=df,
    con=redshift.conn,
    path=f's3://s3_bucket_path/dim_testing/',
    schema='public',
    table='dim_testing',
    use_column_names=True,
    mode='append'
)

df = generate_df(500)
wr.redshift.to_sql(
    df,
    redshift.conn,
    schema='public',
    table='dim_testing',
    use_column_names=True,
    mode='append'
)

After executing that code if we look for duplicates by the identity column id, we get duplicated values:

select id, count(*)
from dim_testing
group by id
having count(*) > 1;

If we only use to_sql() or copy(), the issue does not happens. It happens when executing to_sql() after a copy(). Is this normal? Are we doing something wrong? Should we only chose one write method and not use the other?

Thanks!

kukushking commented 6 days ago

Hi @laderjk thanks for opening this - looks like a race condition. Can you try passing lock=True to both calls? This will lock the table line up all the updates/inserts.