if the dataframe that is being saved to s3 has data that is of timestamp[us], and is out of bounds of timestamp[ns] the write will fail, if the Athena table already exists, if the table does not exists, data will be saved successfully to s3.
How to Reproduce
Run the following code, the second attempt will fail.
import boto3
import colorlog
import logging
def add_color_handler(logger) -> None:
"""Set up the logger to display colored and formatted messages."""
handler = colorlog.StreamHandler()
handler.setFormatter(
colorlog.ColoredFormatter(
"%(asctime)s %(log_color)s%(levelname)s%(reset)s %(message)s",
datefmt="%Y-%m-%d %H:%M:%S", # Add the desired date format here
reset=True,
log_colors={
"DEBUG": "cyan",
"INFO": "green",
"WARNING": "yellow",
"ERROR": "red",
"CRITICAL": "red,bg_white",
},
secondary_log_colors={},
style="%",
)
)
logger.addHandler(handler)
import pandas as pd
import numpy as np
import awswrangler as wr
def setup_root_loggers(file_prefix=None) -> None:
"""Set up the root logger."""
logger = logging.getLogger()
logger.handlers.clear()
logger.setLevel(logging.DEBUG)
add_color_handler(logger)
setup_root_loggers()
# Example DataFrame with a problematic timestamp column
data = {
'schedulingstartdate': [
'2023-01-01 00:00:00.000',
'2023-01-02 00:00:00.000',
'0800-01-01 00:00:00.000', # Out-of-bounds timestamp
'1877-09-21 00:12:43.000'
]
}
df = pd.DataFrame(data)
# Convert the column to datetime, coercing errors
df['schedulingstartdate'] = df['schedulingstartdate'].astype('datetime64[us]')
# df['schedulingstartdate'] = pd.to_datetime(df['schedulingstartdate'])
# df['schedulingstartdate'] = df['schedulingstartdate'].astype('datetime64[us]')
# Filter out rows with out-of-bounds timestamps
# df = df.dropna(subset=['schedulingstartdate'])
# Convert the column to the desired format
# df['schedulingstartdate'] = df['schedulingstartdate'].astype('datetime64[ns]')
print(df.dtypes)
print(df)
client = boto3.client("s3")
region = boto3.Session().region_name
account_id = boto3.Session().client("sts").get_caller_identity().get("Account")
print(f"Boto3 is using the region: {region}, on account: {account_id}")
bucket_name = '<test_bucket_name>'
table_name = "test_table"
database_name = "database_name"
wr.s3.to_parquet(
df=df,
path="s3://" + bucket_name + '/' + table_name,
dataset=True,
compression="snappy",
use_threads=True,
database=database_name,
table=table_name,
pyarrow_additional_kwargs={"coerce_timestamps": "us"},
)
print("Starting Second Attempt")
wr.s3.to_parquet(
df=df,
path="s3://" + bucket_name + '/' + table_name,
dataset=True,
compression="snappy",
use_threads=True,
database=database_name,
table=table_name,
pyarrow_additional_kwargs={"coerce_timestamps": "us"},
)
Expected behavior
One of the two outcomes:
It is not possible to save any datetime that is out if bounds of timestamp[ns].
It should be able to save timestamp[us\s\ms] and not only [ns]. preffered
Possible solution approach
In general not aware of any reason not to respect the actual data type in the dataframe, as long as it is also a timestamp field, and using the existing function like is_timestamp we can validate this.
Can take the task of fixing the bug. In the relatively simple solution described above, or any other solution we can come up with as part of a discussion.
Describe the bug
When using the function
if the dataframe that is being saved to s3 has data that is of timestamp[us], and is out of bounds of timestamp[ns] the write will fail, if the Athena table already exists, if the table does not exists, data will be saved successfully to s3.
How to Reproduce
Run the following code, the second attempt will fail.
Expected behavior
One of the two outcomes:
Your project
No response
Screenshots
No response
OS
Mac / Linux
Python version
3.11, 3.12
AWS SDK for pandas version
3.9.1
Additional context
Looking at the code, and doing some local tests, I think the issue is in the following line of code: https://github.com/aws/aws-sdk-pandas/blob/main/awswrangler/_data_types.py#L704 , and subsequently this line: https://github.com/aws/aws-sdk-pandas/blob/main/awswrangler/_data_types.py#L332 , which has a one-to-one translation from athena timestamp to timstamp[ns], no matter what the actual type in the dataframe.
Possible solution approach In general not aware of any reason not to respect the actual data type in the dataframe, as long as it is also a timestamp field, and using the existing function like
is_timestamp
we can validate this.Can take the task of fixing the bug. In the relatively simple solution described above, or any other solution we can come up with as part of a discussion.
Also probably somewhat related: https://github.com/aws/aws-sdk-pandas/issues/2926