aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.89k stars 690 forks source link

Save to S3 with timestamp[us] data fails if Athena table already exists #2950

Closed eliabrio closed 6 days ago

eliabrio commented 1 week ago

Describe the bug

When using the function

wr.s3.to_parquet(...)

if the dataframe that is being saved to s3 has data that is of timestamp[us], and is out of bounds of timestamp[ns] the write will fail, if the Athena table already exists, if the table does not exists, data will be saved successfully to s3.

How to Reproduce

Run the following code, the second attempt will fail.

import boto3
import colorlog
import logging

def add_color_handler(logger) -> None:
    """Set up the logger to display colored and formatted messages."""
    handler = colorlog.StreamHandler()
    handler.setFormatter(
        colorlog.ColoredFormatter(
            "%(asctime)s %(log_color)s%(levelname)s%(reset)s %(message)s",
            datefmt="%Y-%m-%d %H:%M:%S",  # Add the desired date format here
            reset=True,
            log_colors={
                "DEBUG": "cyan",
                "INFO": "green",
                "WARNING": "yellow",
                "ERROR": "red",
                "CRITICAL": "red,bg_white",
            },
            secondary_log_colors={},
            style="%",
        )
    )
    logger.addHandler(handler)

import pandas as pd
import numpy as np
import awswrangler as wr

def setup_root_loggers(file_prefix=None) -> None:
    """Set up the root logger."""
    logger = logging.getLogger()
    logger.handlers.clear()
    logger.setLevel(logging.DEBUG)
    add_color_handler(logger)    

setup_root_loggers()

# Example DataFrame with a problematic timestamp column
data = {
    'schedulingstartdate': [
        '2023-01-01 00:00:00.000',
        '2023-01-02 00:00:00.000',
        '0800-01-01 00:00:00.000',  # Out-of-bounds timestamp
        '1877-09-21 00:12:43.000'
    ]
}
df = pd.DataFrame(data)

# Convert the column to datetime, coercing errors
df['schedulingstartdate'] = df['schedulingstartdate'].astype('datetime64[us]')
# df['schedulingstartdate'] = pd.to_datetime(df['schedulingstartdate'])
# df['schedulingstartdate'] = df['schedulingstartdate'].astype('datetime64[us]')

# Filter out rows with out-of-bounds timestamps
# df = df.dropna(subset=['schedulingstartdate'])

# Convert the column to the desired format
# df['schedulingstartdate'] = df['schedulingstartdate'].astype('datetime64[ns]')

print(df.dtypes)
print(df)

client = boto3.client("s3")
region = boto3.Session().region_name
account_id = boto3.Session().client("sts").get_caller_identity().get("Account")        
print(f"Boto3 is using the region: {region}, on account: {account_id}")

bucket_name = '<test_bucket_name>'
table_name = "test_table"
database_name = "database_name"
wr.s3.to_parquet(
    df=df,
    path="s3://" + bucket_name + '/' + table_name,
    dataset=True,
    compression="snappy",
    use_threads=True,
    database=database_name,
    table=table_name,
    pyarrow_additional_kwargs={"coerce_timestamps": "us"},
)

print("Starting Second Attempt")

wr.s3.to_parquet(
    df=df,
    path="s3://" + bucket_name + '/' + table_name,
    dataset=True,
    compression="snappy",
    use_threads=True,
    database=database_name,
    table=table_name,
    pyarrow_additional_kwargs={"coerce_timestamps": "us"},
)

Expected behavior

One of the two outcomes:

  1. It is not possible to save any datetime that is out if bounds of timestamp[ns].
  2. It should be able to save timestamp[us\s\ms] and not only [ns]. preffered

Your project

No response

Screenshots

No response

OS

Mac / Linux

Python version

3.11, 3.12

AWS SDK for pandas version

3.9.1

Additional context

Looking at the code, and doing some local tests, I think the issue is in the following line of code: https://github.com/aws/aws-sdk-pandas/blob/main/awswrangler/_data_types.py#L704 , and subsequently this line: https://github.com/aws/aws-sdk-pandas/blob/main/awswrangler/_data_types.py#L332 , which has a one-to-one translation from athena timestamp to timstamp[ns], no matter what the actual type in the dataframe.

Possible solution approach In general not aware of any reason not to respect the actual data type in the dataframe, as long as it is also a timestamp field, and using the existing function like is_timestamp we can validate this.

Can take the task of fixing the bug. In the relatively simple solution described above, or any other solution we can come up with as part of a discussion.

Also probably somewhat related: https://github.com/aws/aws-sdk-pandas/issues/2926

eliabrio commented 1 week ago

PR with suggested fix: https://github.com/aws/aws-sdk-pandas/pull/2953