Closed Jefffrey closed 2 weeks ago
https://github.com/datafusion-contrib/datafusion-orc/pull/91 adds a validation on converting timestamp data. It would be more generic if we could support all four types of timestamp precision.
The options I see:
variant: allow users to opt in to losing precision instead of returning an error. or maybe of returning null or saturating.
What do you think?
allow the user to configure what precision to use somehow, and stick to that
4 looks more concrete to me. ORC's spec says all timestamps are with nanosecond precision. To not lose precision while reading an ORC file, we may have to check if all the timestamp in the file are end with 000
or 000000
etc, which means the writer is probably using milli or micro seconds. But this check might be expensive. So providing a way for the caller to specify the expected precision looks easier to achieve and use.
Option 4 also makes sense to me too. In reader builder options can allow user to configure second/milli/micro precision explicitly, and maybe allow it to be strict or not. Strict = fail if there is a value with higher precision than configured (e.g. has nanoseconds when expecting TimestampSeconds) and non-strict = truncate extra precision (just lose the nanoseconds).
By default we would have TimestampNanoseconds (doesn't really matter if strict or not as we can't get more precise anyway)
How would you make it configurable per-column? A map from indices to policies?
How would you make it configurable per-column? A map from indices to policies?
I didn't consider this. I guess it could be grouped as part of some wider configuration where the user can provide an expected Arrow schema to decode into for the whole file (which could allow them to take advantage of dictionary encoding for example), which includes specifying the specific timestamp precision.
ORC encodes timestamps as seconds since ORC epoch, and nanoseconds since last second.
We translate this into the TimestampNanoseconds type.
However this may limit the range that ORC timestamp in actuality can represent.
Need to consider this?