Open mgab opened 1 month ago
It isn't "known" in the sense that anyone has raised this before, but the INTERVAL type it a particularly unwieldy encoding, as you can see. pyarrow does not use it, but stores the data as INT64 like fastparquet.
Fair, and yet fastparquet and pyarrow do not seem to be compatible when writing and reading this type on a parquet file:
timedelta
with fastparquet and loading it with pyarrow transforms it to a datetime.time
timedelta
with pyarrow and loading it with fastparquet transforms it to an int
Only when reading it with the same tool (either of the two) you end up preserving the timedelta
type.
In any case, what would be the proper solution? Would a PR that implements the format specification for the INTERVAL type be desirable? Would there be any concern about the compatibility against pyarrow?
Would a PR that implements the format specification for the INTERVAL type be desirable?
You are welcome to try, but I think it might be a little work. It is not a high priority for me (we have had this model for a long time!). Fixing reading arrow with the INT encoding is perhaps more important.
Describe the issue:
The way timedelta values (a.k.a. durations, intervals...) are stored in parquet does not follow the file format specification. According to the parquet specification, the logical type
Interval
should be stored as:Currently,
fastparquet
does not follow the format specification on this type. This affects the ability to read parquets written with other tools or to read with other tools parquets written withfastparquet
if there is any field with this type.I guess it might be a known issue rather than a bug, but I couldn't find info about it.
Minimal Complete Verifiable Example:
Then use either hangxie/parquet-tools, ktrueda/parquet-tools or any similar tool to inspect the schema to find that it looks like:
instead of something along the lines of
Anything else we need to know?:
There's a bit more context on this StackOverflow question
Environment: