aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
636 stars 153 forks source link

[BUG]: parquet containing ConvertedType TIMESTAMP_MILLIS may throw IndexOutOfRangeException while reading. #542

Closed artnim closed 2 months ago

artnim commented 3 months ago

Library Version

4.24.0

OS

Ubuntu Linux

OS Architecture

64 bit

How to reproduce?

  1. Create a parquet with timeseries as datetime64[ms]
import pandas as pd

pd.set_option('io.parquet.engine', 'fastparquet')

start_time = pd.Timestamp('2024-08-21T00:00:00+02:00')
end_time = pd.Timestamp('2024-08-22T00:00:00+02:00')
series = pd.date_range(start_time, end_time, freq=pd.Timedelta(minutes=15))[:-1]

series = series.tz_convert('UTC').tz_localize(None).astype("datetime64[ms]")

df = pd.DataFrame({'datetime': series})

print(df)
print(df.dtypes)

df.to_parquet('./test.parquet')
  1. Open the resulting test.parquet using floor image

The same works like a charm reading a parquet with created with the library. So I suggest a dependency with the fastparquet engine used my site.

Failing test

No response

artnim commented 3 months ago

Meanwhile, I can confirm that the problem is caused by what fastparquet writes. The problem does not exist with pyarrow.