dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
787 stars 178 forks source link

The parquet format specification is not followed for Interval type (i.e. timedeltas) #937

Open mgab opened 1 month ago

mgab commented 1 month ago

Describe the issue:

The way timedelta values (a.k.a. durations, intervals...) are stored in parquet does not follow the file format specification. According to the parquet specification, the logical type Interval should be stored as:

INTERVAL is used for an interval of time. It must annotate a fixed_len_byte_array of length 12. This array stores three little-endian unsigned integers that represent durations at different granularities of time. The first stores a number in months, the second stores a number in days, and the third stores a number in milliseconds. This representation is independent of any particular timezone or date. (...)

Currently, fastparquet does not follow the format specification on this type. This affects the ability to read parquets written with other tools or to read with other tools parquets written with fastparquet if there is any field with this type.

I guess it might be a known issue rather than a bug, but I couldn't find info about it.

Minimal Complete Verifiable Example:

import pandas as pd
from fastparquet import write

df = pd.DataFrame([{'seconds': 30, 'duration': pd.to_timedelta(30, unit='seconds')}])

write('/test/test.parquet', df)

Then use either hangxie/parquet-tools, ktrueda/parquet-tools or any similar tool to inspect the schema to find that it looks like:

{"Tag":"name=Schema",
 "Fields":[
  {"Tag":"name=Seconds, type=INT64, repetitiontype=OPTIONAL"},
  {"Tag":"name=Duration, type=INT64, convertedtype=TIME_MICROS, repetitiontype=OPTIONAL"}
]}

instead of something along the lines of

{"Tag":"name=Duckdb_schema",
 "Fields":[
  {"Tag":"name=Seconds, type=INT32, convertedtype=INT_32, repetitiontype=OPTIONAL"},
  {"Tag":"name=Duration, type=FIXED_LEN_BYTE_ARRAY, convertedtype=INTERVAL, length=12, repetitiontype=OPTIONAL"}
]}

Anything else we need to know?:

There's a bit more context on this StackOverflow question

Environment:

martindurant commented 1 month ago

It isn't "known" in the sense that anyone has raised this before, but the INTERVAL type it a particularly unwieldy encoding, as you can see. pyarrow does not use it, but stores the data as INT64 like fastparquet.

mgab commented 1 month ago

Fair, and yet fastparquet and pyarrow do not seem to be compatible when writing and reading this type on a parquet file:

Only when reading it with the same tool (either of the two) you end up preserving the timedelta type.

In any case, what would be the proper solution? Would a PR that implements the format specification for the INTERVAL type be desirable? Would there be any concern about the compatibility against pyarrow?

martindurant commented 1 month ago

Would a PR that implements the format specification for the INTERVAL type be desirable?

You are welcome to try, but I think it might be a little work. It is not a high priority for me (we have had this model for a long time!). Fixing reading arrow with the INT encoding is perhaps more important.