Closed mattinbits closed 2 months ago
This is caused by the default pyarrow arguments in the ParquetWritter class. Specifically, we set the flavor to spark
to maximize compatibility with various systems. However, this comes at the expense of losing some features like logical types.
Switching off the flavor argument:
wr.s3.to_parquet(
df,
"s3://my-bucket/test_flavor.parquet",
pyarrow_additional_kwargs={"flavor": None},
)
cancels that effect
############ Column(timestamp) ############
name: timestamp
path: timestamp
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -4%)
Describe the bug
When writing a dataframe to parquet using AWS wrangler, date and timestamp columns in the dataframe do not have logical types included in the resulting parquet files. This is in contrasts to pandas to_parquet behaviour.
How to Reproduce
The setup:
When writing with aws wrangler:
When inspecting this file with Parquet tools:
When writing using pandas:
When inspecting with parquet tools:
Expected behavior
The parquet file written by AWS Wrangler preserves date and timestamp logical type information.
Your project
No response
Screenshots
No response
OS
Linux
Python version
3.11.9
AWS SDK for pandas version
3.9.0
Additional context
No response