Closed jaychia closed 8 months ago
Findings:
date32
) + nanoseconds-after-midnight (time64
) (see: Parquet documentation)
date32
(~5.8M years)timestamp(ns)
and timestamp(us)
int64-based types have date ranges of 584 years and 584,000 years respectively, which makes conversions of Int96 values to these types potentially dangerous
timestamp(ns)
timestamp(ns)
) then there is no problemcoerce_int96_timestamp_unit
kwarg to override the default behavior of parsing int96 timestamps as timestamp(ns)
to different resolutions such as ms
/us
, losing precision but potentially working around the overflow issues by expanding the range of timestamps
Proposed actions:
timestamp(ns)
coerce_int96_timestamp_unit
for overriding the default behavior in case it encounters dates that fall outside of the expressible range of a timestamp(ns)
.Great explanation. Option 2 sounds good to me as we want to avoid read failures as well as data loss due to the precision errors.
Great explanation. Option 2 sounds good to me as we want to avoid read failures as well as data loss due to the precision errors.
Agreed - IIRC, we also have INT96 timestamps in multiple datasets that are known to have been written with non-nanosecond precision (e.g. millisecond precision), and thus we must explicitly have the ability to override the assumed precision of these timestamps for correctness.
Looks like this is closed. Can we close this issue?
Thanks @raghumdani !
Is your feature request related to a problem? Please describe.
Some old Parquet files may contain int96 timestamps (written by spark)
Daft should support reading this data in its native Parquet reader.
UAC:
coerce_int96_timestamp_unit
argument)