Open progval opened 3 months ago
In Arrow, dictionary columns are a separate data type, while in ORC they are a per-stripe encoding.
Oh this is a very good pickup :eyes:
Thanks for this, I'll try review soon :+1:
I'm thinking we'll have to read all String into String type array and forgo copying dictionary encoded string stripe columns directly into Arrow dictionary arrays:
So I think will need to change the logic for decoding dictionary encoded string stripes to just decode to regular StringArray
So this means datafusion won't be able to use ORC dictionaries for predicate pushdown, right?
So this means datafusion won't be able to use ORC dictionaries for predicate pushdown, right?
I haven't thought that far ahead yet honestly. The main takeaway is that we'll always decode to StringArrays. As for how that's done internally and how it can affect predicate pushdown, that remains to be seen.
So I realized there is an arrow kernel for casting/converting from dictionary to primitive so I used it as a quick fix: https://github.com/datafusion-contrib/datafusion-orc/commit/7f6655245d6b384e610e690d86484f12b17849bf
My understanding of arrow dictionary encoding and how it interacts with datafusion is still immature so I'm continuing to do some reading on this matter (some of my assumptions in above comments are probably incorrect).
I noticed parquet had a similar issue which I have been reading through: https://github.com/apache/arrow-rs/issues/171
I've created an issue here for tracking: https://github.com/datafusion-contrib/datafusion-orc/issues/72
I probably won't spend too much more time investigating this until the crate becomes more feature complete (will focus on correctness over performance for now), but will appreciate any further insights/contributions :+1:
In Arrow, dictionary columns are a separate data type, while in ORC they are a per-stripe encoding. This means we cannot get an Arrow schema for a whole ORC file, the Arrow schema is only valid per-stripe
Unfortunately, this breaks a feature of this crate (which I assume is important for
datafusion
), and I don't see a way out. Thoughts?This changes the error in
test1
from "Incorrect datatype" error to a difference in serialized output.(Note:
test1.orc
has a binary column, so you should apply #67 first if you want to see the change.)