aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.94k stars 702 forks source link

to_parquet does not include timestamp logical type information #2972

Closed mattinbits closed 2 months ago

mattinbits commented 2 months ago

Describe the bug

When writing a dataframe to parquet using AWS wrangler, date and timestamp columns in the dataframe do not have logical types included in the resulting parquet files. This is in contrasts to pandas to_parquet behaviour.

How to Reproduce

The setup:

from datetime import datetime
import awswrangler as wr
import pandas as pd

ts = lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S.%f")
dt = lambda x: datetime.strptime(x, "%Y-%m-%d").date()

df = pd.DataFrame({
    "date": [dt("2020-01-01"), dt("2020-01-02")], 
    "timestamp": [ts("2020-01-01 00:00:00.0"), ts("2020-01-02 00:00:00.0")]
})

ts = lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S.%f")
dt = lambda x: datetime.strptime(x, "%Y-%m-%d").date()

df = pd.DataFrame({
     "date": [dt("2020-01-01"), dt("2020-01-02")], 
     "timestamp": [ts("2020-01-01 00:00:00.0"), ts("2020-01-02 00:00:00.0")]
 })

When writing with aws wrangler:

wr.s3.to_parquet(df, "s3://mybucket/test.parquet")

When inspecting this file with Parquet tools:

############ file meta data ############
created_by: parquet-cpp-arrow version 17.0.0
num_columns: 2
num_rows: 2
num_row_groups: 1
format_version: 1.0
serialized_size: 1448

############ Columns ############
date
timestamp

############ Column(date) ############
name: date
path: date
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Date
converted_type (legacy): DATE
compression: SNAPPY (space_saved: -5%)

############ Column(timestamp) ############
name: timestamp
path: timestamp
max_definition_level: 1
max_repetition_level: 0
physical_type: INT96
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 0%)

When writing using pandas:

df.to_parquet("local.parquet")

When inspecting with parquet tools:

############ file meta data ############
created_by: parquet-cpp-arrow version 17.0.0
num_columns: 2
num_rows: 2
num_row_groups: 1
format_version: 2.6
serialized_size: 1935

############ Columns ############
date
timestamp

############ Column(date) ############
name: date
path: date
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Date
converted_type (legacy): DATE
compression: SNAPPY (space_saved: -5%)

############ Column(timestamp) ############
name: timestamp
path: timestamp
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -4%)

Expected behavior

The parquet file written by AWS Wrangler preserves date and timestamp logical type information.

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.11.9

AWS SDK for pandas version

3.9.0

Additional context

No response

jaidisido commented 2 months ago

This is caused by the default pyarrow arguments in the ParquetWritter class. Specifically, we set the flavor to spark to maximize compatibility with various systems. However, this comes at the expense of losing some features like logical types.

Switching off the flavor argument:

wr.s3.to_parquet(
    df,
    "s3://my-bucket/test_flavor.parquet",
    pyarrow_additional_kwargs={"flavor": None},
)

cancels that effect

############ Column(timestamp) ############
name: timestamp
path: timestamp
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -4%)