Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.79k stars 108 forks source link

[IO] Ensure that native Parquet downloader supports deprecated INT96 timestamps #1215

Closed jaychia closed 8 months ago

jaychia commented 10 months ago

Is your feature request related to a problem? Please describe.

Some old Parquet files may contain int96 timestamps (written by spark)

Daft should support reading this data in its native Parquet reader.

UAC:

  1. Unit test for reading these files
  2. Ensure that timestamps are correctly read into timestamps with the appropriate resolution (see Arrow's read_table with the coerce_int96_timestamp_unit argument)
jaychia commented 10 months ago

Findings:

  1. Int96 Parquet values are essentially a date (date32) + nanoseconds-after-midnight (time64) (see: Parquet documentation)
    1. This means it supports the same date range as a date32 (~5.8M years)
    2. Int96 timestamps are always at a nanosecond resolution
  2. The Daft/Arrow timestamp(ns) and timestamp(us) int64-based types have date ranges of 584 years and 584,000 years respectively, which makes conversions of Int96 values to these types potentially dangerous
    1. By default, both Daft and Arrow reads Parquet int96 into timestamp(ns)
    2. If timestamps fall within range (years 1678 — 2262 for timestamp(ns)) then there is no problem
    3. However, if they do not, then an overflow occurs and both Daft and PyArrow will present wrong results without throwing an error
  3. PyArrow does provide a coerce_int96_timestamp_unit kwarg to override the default behavior of parsing int96 timestamps as timestamp(ns) to different resolutions such as ms/us, losing precision but potentially working around the overflow issues by expanding the range of timestamps
    1. Note that this has no implication on the actual interpretation of the int96 values -- it only exists as a workaround to the overflow issues

Proposed actions:

  1. Daft can throw an error instead of overflowing when parsing int96 timestamps to timestamp(ns)
  2. Daft can add an argument similar to PyArrow's coerce_int96_timestamp_unit for overriding the default behavior in case it encounters dates that fall outside of the expressible range of a timestamp(ns).
raghumdani commented 10 months ago

Great explanation. Option 2 sounds good to me as we want to avoid read failures as well as data loss due to the precision errors.

pdames commented 10 months ago

Great explanation. Option 2 sounds good to me as we want to avoid read failures as well as data loss due to the precision errors.

Agreed - IIRC, we also have INT96 timestamps in multiple datasets that are known to have been written with non-nanosecond precision (e.g. millisecond precision), and thus we must explicitly have the ability to override the assumed precision of these timestamps for correctness.

raghumdani commented 8 months ago

Looks like this is closed. Can we close this issue?

jaychia commented 8 months ago

Thanks @raghumdani !