datafusion-contrib / datafusion-orc

Implementation of Apache ORC file format use Apache Arrow in-memory format
Apache License 2.0
28 stars 8 forks source link

Pass target arrow type to array_decoder_factory #92

Closed progval closed 3 weeks ago

progval commented 3 weeks ago

This is the prerequisite to support configurable timestamp precision mentioned in https://github.com/datafusion-contrib/datafusion-orc/issues/75#issuecomment-2130754391

Jefffrey commented 3 weeks ago

I plan to take a look at this later today :+1:

progval commented 3 weeks ago

One thing I'm concerning is we may also need to provide a method that can convert ORC type to Arrow type, the so-called "default" option (or there is one and I miss it?). Otherwise the external user of array_decoder_factory needs to maintain that mapping by themselves.

Yes, I'm planning to add it after #93 is merged. Probably .with_default_time_unit(TimeUnit) or .with_default_timestamp_type::<T: arrow::datatypes::ArrowTimestampType>() on the reader builder

progval commented 3 weeks ago

Do you think it's possible to store the Field inside Column itself, considering it's already storing DataType?

https://github.com/datafusion-contrib/datafusion-orc/blob/ac5a8ab29e0fb43629ca93ffff1e722412418b36/src/column.rs#L15-L21

Might cut down on having to pass field everywhere 🤔

It looked like a good idea, but it actually complexifies everything:

  1. matching both types was only in array_decoder/mod.rs while destructuring the ORC type, but is now spread between that module, column.rs, and array_decoder/{struct,list,map,union}.rs. In particular, the union decoder expects UnionFields which is an iterable of (i8, Field) unlike Fields that every other one would use, so it can't rely on Column::children() to match both types
  2. Column::children() now needs to return a Result
  3. We would need the Field when constructing the Column, so we need to pass Fields through Stripe::new, so also in StripeFactory
  4. Overall it makes the code larger and less readable. See my attempt: https://github.com/progval/datafusion-orc/compare/target-type...progval:datafusion-orc:target-type-in-column (which doesn't even include all of point 3, or moving validation that both types match from array_decoder_factory to Column::new to make sure the Column is self-consistent)