Add support for dictionary stripes - Githubissues

datafusion-contrib / datafusion-orc

Implementation of Apache ORC file format use Apache Arrow in-memory format

Apache License 2.0

28 stars 8 forks source link

Add support for dictionary stripes #68

Open progval opened 3 months ago

progval commented 3 months ago

In Arrow, dictionary columns are a separate data type, while in ORC they are a per-stripe encoding. This means we cannot get an Arrow schema for a whole ORC file, the Arrow schema is only valid per-stripe

Unfortunately, this breaks a feature of this crate (which I assume is important for datafusion), and I don't see a way out. Thoughts?

This changes the error in test1 from "Incorrect datatype" error to a difference in serialized output.

(Note: test1.orc has a binary column, so you should apply #67 first if you want to see the change.)

Jefffrey commented 3 months ago

In Arrow, dictionary columns are a separate data type, while in ORC they are a per-stripe encoding.

Oh this is a very good pickup :eyes:

Thanks for this, I'll try review soon :+1:

Jefffrey commented 3 months ago

I'm thinking we'll have to read all String into String type array and forgo copying dictionary encoded string stripe columns directly into Arrow dictionary arrays:

We want to be able to read an entire file as consistent record batches (with same schema) since I think otherwise it could confuse the consumer (e.g. datafusion)
We don't want cases where if we read only a particular stripe (e.g. due to pruning) it can give a recordbatch with different schema depending on which stripe is read
If we consider multiple files being queried by datafusion, we wouldn't want some having dictionary type array whilst others have string type array

So I think will need to change the logic for decoding dictionary encoded string stripes to just decode to regular StringArray

progval commented 3 months ago

So this means datafusion won't be able to use ORC dictionaries for predicate pushdown, right?

Jefffrey commented 3 months ago

So this means datafusion won't be able to use ORC dictionaries for predicate pushdown, right?

I haven't thought that far ahead yet honestly. The main takeaway is that we'll always decode to StringArrays. As for how that's done internally and how it can affect predicate pushdown, that remains to be seen.

Jefffrey commented 3 months ago

So I realized there is an arrow kernel for casting/converting from dictionary to primitive so I used it as a quick fix: https://github.com/datafusion-contrib/datafusion-orc/commit/7f6655245d6b384e610e690d86484f12b17849bf

My understanding of arrow dictionary encoding and how it interacts with datafusion is still immature so I'm continuing to do some reading on this matter (some of my assumptions in above comments are probably incorrect).

I noticed parquet had a similar issue which I have been reading through: https://github.com/apache/arrow-rs/issues/171

Jefffrey commented 3 months ago

I've created an issue here for tracking: https://github.com/datafusion-contrib/datafusion-orc/issues/72

I probably won't spend too much more time investigating this until the crate becomes more feature complete (will focus on correctness over performance for now), but will appreciate any further insights/contributions :+1: