adjust / parquet_fdw

Parquet foreign data wrapper for PostgreSQL
PostgreSQL License
351 stars 38 forks source link

Parquet timestamps are read with a 1-second precision #52

Closed maxime-jolliet-airseas closed 1 year ago

maxime-jolliet-airseas commented 1 year ago

Hi,

I observed that parquet microsecond or millisecond-precision timestamps are read by the second in postgres.

Steps to reproduce, in python :

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

n = 100
df = pd.DataFrame({"time": [pd.Timestamp(datetime.datetime(2000,1,1)+datetime.timedelta(seconds=0.101*i)) for i in range(n)],
                  "value_float": [0+i/100 for i in range(n)]})
table = pa.Table.from_pandas(df, preserve_index=False)
pq.write_table(table, '/path/to/test.parquet", coerce_timestamps='ms')

Then, in postgres:

CREATE FOREIGN TABLE test (time TIMESTAMP, value_float DOUBLE PRECISION) server parquet options (filename '/path/to/test.parquet');
select * from test;

> "2000-01-01 00:00:00" 0
> "2000-01-01 00:00:00" 0.01
> "2000-01-01 00:00:00" 0.02
> ...

After analysis from @ahoy-jon the problem comes from https://github.com/adjust/parquet_fdw/blob/master/src/common.hpp#L27 This solves the problem for milliseconds:

case arrow::TimeUnit::MILLI:                            \
            ts = time_t_to_timestamptz((i) / 1000);             \
            ts = TimestampTzPlusMilliseconds(ts, i % 1000);     \
            break;                                              \

If interested, I can provide a patch for other time unit as well.

Regards,