apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.78k stars 430 forks source link

Add new optional type parameters Offset to TIMESTAMP #458

Open ryancasburn-KAI opened 1 week ago

ryancasburn-KAI commented 1 week ago

Describe the enhancement requested

Hi, I'm new around here, please let me know if this request is better elsewhere.

I'd like to propose an optional type parameter called Offset to TIMESTAMP logical types.

In my common use case of Parquet files, the data is a running log with many rows, such that any one row group is unlikely to have more than a few days at a time.

The idea of the Offset parameter would be to store for each row group (in Int64) an offset from Unix epoch, then the data would be stored relative to that offset.

This provides a couple of benefits:

  1. row groups could be selectively downsized (when possible) to INT32 physical types. This could save significant amounts of file size if I understand correctly. At millisecond level accuracy, INT32 could support row groups up to ~48 days long.[^1]
  2. The docs identify that all TIMESTAMPs, but particularly those with NANOs accuracy have range limitations due to the INT64 limitation. Adding an Offset would allow practically unlimited ranges for TIMESTAMPs.

[^1]: with an offset set in the middle of row group values, given the signed nature of INT32

wgtmac commented 3 days ago

Thanks for opening the issue! I think the current file size is not an issue as we have delta encoding. The problems of adding offset to row group metadata I can see so far are: