JuliaIO / Parquet.jl

Julia implementation of Parquet columnar file format reader
Other
116 stars 32 forks source link

DateTime reader support #133

Open nickrobinson251 opened 3 years ago

nickrobinson251 commented 3 years ago

Is it currently possible to read data in as a DateTime? If not, what would need to be done for this to be added? (Sort of partner to #108, although i don't know how related reader and writer support are).

Current behaviour seems to be to read datetimes in as Int64 values.

For example, generating some data in Python:

>>> import pandas as pd
>>> 
>>> t1 = pd.Timestamp('2018-01-01 06:00:00+0000', tz='UTC')
>>> t2 = pd.Timestamp('2018-01-01 07:00:00+0000', tz='UTC')
>>> df = pd.DataFrame([t1, t2], columns=["datetime_utc"])
>>> df["datetime_utc"].dtype
datetime64[ns, UTC]
>>>
>>> df.to_parquet("datetimes.parquet")

and then reading it in Julia

julia> pq_file = Parquet.File("datetimes.parquet")
Parquet file: datetimes.parquet
    version: 1
    nrows: 2
    created by: parquet-cpp version 1.5.1-SNAPSHOT
    cached: 0 column chunks

julia> schema(pq_file)
Schema:
    required schema {
      optional INT64 datetime_utc # (from TIMESTAMP_MICROS)
    }

i’ve tried to use the map_logical_types keyword, for example Dict(["datetime_utc"] => (DateTime, Parquet.logical_timestamp)), but this errors with ERROR: unsupported storage type 2 for DateTime.

oxinabox commented 3 years ago

I think this might just be a bug on this line with the wrong/incomplete storage type listed https://github.com/JuliaIO/Parquet.jl/blob/a21df68a57add5b6c48902f4ec775146fe0ef3a1/src/codec.jl#L227

The INT96 is defined here https://github.com/JuliaIO/Parquet.jl/blob/a21df68a57add5b6c48902f4ec775146fe0ef3a1/src/PAR2/PAR2_types.jl#L11

From the same file: type 2 is INT32. I I suspect a branch for that needs to be added. Maybe for INT64 also?

tanmaykm commented 3 years ago

We need to have an implementation that can decode Int64 logical timestamps and then plug it in there.

This is the format specification: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp.

The Parquet.logical_timestamp method currently handles only Int96 format and can't be used to decode Int64 encoded format.