datafusion-contrib / datafusion-orc

Implementation of Apache ORC file format use Apache Arrow in-memory format
Apache License 2.0
28 stars 8 forks source link

TimestampNanosecond may not be able to represent whole ORC timestamp range #75

Closed Jefffrey closed 2 weeks ago

Jefffrey commented 3 months ago

ORC encodes timestamps as seconds since ORC epoch, and nanoseconds since last second.

We translate this into the TimestampNanoseconds type.

However this may limit the range that ORC timestamp in actuality can represent.

Need to consider this?

waynexia commented 1 month ago

https://github.com/datafusion-contrib/datafusion-orc/pull/91 adds a validation on converting timestamp data. It would be more generic if we could support all four types of timestamp precision.

progval commented 1 month ago

The options I see:

  1. parse them as null
  2. try to parse whole batches as nanoseconds. If it overflows try milliseconds. If it is missing precision return an error; or if it overflow again then seconds. If it is missing precision return an error
  3. Use statistics (if available) to find the min and max value and pick precision based on that, then decode. While decoding if we see we are losing precision for any of the values, return an error.
  4. allow the user to configure what precision to use somehow, and stick to that

variant: allow users to opt in to losing precision instead of returning an error. or maybe of returning null or saturating.

What do you think?

waynexia commented 1 month ago

allow the user to configure what precision to use somehow, and stick to that

4 looks more concrete to me. ORC's spec says all timestamps are with nanosecond precision. To not lose precision while reading an ORC file, we may have to check if all the timestamp in the file are end with 000 or 000000 etc, which means the writer is probably using milli or micro seconds. But this check might be expensive. So providing a way for the caller to specify the expected precision looks easier to achieve and use.

Jefffrey commented 1 month ago

Option 4 also makes sense to me too. In reader builder options can allow user to configure second/milli/micro precision explicitly, and maybe allow it to be strict or not. Strict = fail if there is a value with higher precision than configured (e.g. has nanoseconds when expecting TimestampSeconds) and non-strict = truncate extra precision (just lose the nanoseconds).

By default we would have TimestampNanoseconds (doesn't really matter if strict or not as we can't get more precise anyway)

progval commented 1 month ago

How would you make it configurable per-column? A map from indices to policies?

Jefffrey commented 1 month ago

How would you make it configurable per-column? A map from indices to policies?

I didn't consider this. I guess it could be grouped as part of some wider configuration where the user can provide an expected Arrow schema to decode into for the whole file (which could allow them to take advantage of dictionary encoding for example), which includes specifying the specific timestamp precision.