Hello everyone! We are having an issue with duplicated values in an identity column when executing redshift.to_sql() after executing a redshift.copy(). When trying to insert more than 1000 records we use the COPY method and when trying to insert less than 1000 records we use the redshift.to_sql() method as is recommended in the docs.
This is the structure of the table:
create table if not exists dim_testing
(
id BIGINT IDENTITY NOT NULL PRIMARY KEY,
name varchar
);
This is the code in which we are able to reproduce the issue:
import random
import string
import awswrangler as wr
import pandas as pd
def generate_df(length):
return pd.DataFrame([''.join(random.choices(string.ascii_letters + string.digits, k=30)) for _ in range(length)], columns=['name'])
df = generate_df(1500)
wr.redshift.copy(
df=df,
con=redshift.conn,
path=f's3://s3_bucket_path/dim_testing/',
schema='public',
table='dim_testing',
use_column_names=True,
mode='append'
)
df = generate_df(500)
wr.redshift.to_sql(
df,
redshift.conn,
schema='public',
table='dim_testing',
use_column_names=True,
mode='append'
)
After executing that code if we look for duplicates by the identity column id, we get duplicated values:
select id, count(*)
from dim_testing
group by id
having count(*) > 1;
If we only use to_sql() or copy(), the issue does not happens. It happens when executing to_sql() after a copy().
Is this normal? Are we doing something wrong? Should we only chose one write method and not use the other?
Hi @laderjk thanks for opening this - looks like a race condition. Can you try passing lock=True to both calls? This will lock the table line up all the updates/inserts.
Hello everyone! We are having an issue with duplicated values in an identity column when executing
redshift.to_sql()
after executing aredshift.copy()
. When trying to insert more than 1000 records we use the COPY method and when trying to insert less than 1000 records we use theredshift.to_sql()
method as is recommended in the docs.This is the structure of the table:
This is the code in which we are able to reproduce the issue:
After executing that code if we look for duplicates by the identity column id, we get duplicated values:
If we only use to_sql() or copy(), the issue does not happens. It happens when executing to_sql() after a copy(). Is this normal? Are we doing something wrong? Should we only chose one write method and not use the other?
Thanks!