aws / amazon-redshift-python-driver

Redshift Python Connector. It supports Python Database API Specification v2.0.
Apache License 2.0
202 stars 72 forks source link

write_dataframe() is writing value with Pandas dtype datetime64[us, UTC] without time information #206

Closed techxorcist closed 4 months ago

techxorcist commented 5 months ago

Driver version

2.0.918

Redshift version

PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.61678

Client Operating System

Windows 11 Enterprise 22621.2861

Python version

3.10.11

Table schema

create table scratch.datetime_issue (
    dw_inserted_at timestamptz
);

Problem description

  1. Expected behavior: When passed a dataframe with a column with dtype datetime64[us, UTC], e.g. 2024-01-31 21:17:06.093418+00:00, a timestamp with timezone should be written to the destination table

  2. Actual behavior: A date, e.g. 2024-01-31 00:00:00.000000 +00:00, is written for the column.

  3. Error message/stack trace: core.Connection.make_params() seems to be returning a (<RedshiftOID.DATE: 1082>, 0, <function date_out at 0x0000026C570E9750>) tuple for the datetime value when processing values from the dataframe into params for the query. This may be related though retval in core.Connection.execute() shows the time intact.

Python Driver trace logs

Reproduction code

import redshift_connector
import pandas as pd
from datetime import datetime, timezone

columns = ['dw_inserted_at']
output = pd.DataFrame({col: [datetime.now(timezone.utc)] * 5 for col in columns})
redshift_parameters = <DETAILS FOR YOUR ENV>

out_table = 'scratch.datetime_issue'

with (redshift_connector.connect(**redshift_parameters) as conn):
    with conn.cursor() as cursor:
        cursor.write_dataframe(output, out_table)
        conn.commit()
Brooke-white commented 4 months ago

Hi @techxorcist, Thank you for reporting this issue and providing an analysis, it was helpful in identifying the issue. This is an interesting issue :)

What's happening here is that the data values have type <class 'pandas._libs.tslibs.timestamps.Timestamp'> when passed into redshift-connector.

Internally, redshift-connector defines a type mapping between Python and Redshift types to ensure we send Redshift data types it knows about. We do not define a mapping in redshift-connector for <class 'pandas._libs.tslibs.timestamps.Timestamp'>, so our fallback behavior is to check if <class 'pandas._libs.tslibs.timestamps.Timestamp'> is an instance of any type we do have defined. By chance, the first hit is for datetime.Date, hence the trace showing RedshiftOID.Date as you noticed above, and the time information being chopped off.

I'm working on a fix for this and anticipate this will be included in our February release.

techxorcist commented 4 months ago

Thanks @Brooke-white. I appreciate your attention to the issue, and I'm glad it was useful and interesting.

Brooke-white commented 4 months ago

Hi @techxorcist , we've fixed this in 2.1.0 thanks for your patience :)