Short-term roadmap for this implementation

waynexia commented 7 months ago

Previous discussion: https://github.com/apache/arrow-datafusion/issues/4707

Though the ORC format is not as widely used as parquet in arrow-rs and datafusion related projects, there are still some (growing, to my feelings) interesting and requirements on this format. As @Jefffrey said here, a noticeable and viable milestone for this project is it can be merged into arrow-rs. This draft roadmap is raised to help us discuss, arrange and take our efforts toward that milestone.

Given the ORC format is less complex than parquet, there are still many work to do in various aspects. Here is a list of functionalities need to be done if we consider making ORC files queriable from datafusion as the primary use case on this stage. Please feel free to add/remove/set priorities to them. It's likely that we can't finish all of them in a short term, thus marking what are going to be done is also important.

[x] primitive data types (ORC refs)
- [x] tiny int https://github.com/datafusion-contrib/datafusion-orc/issues/22
- [x] timestamp with local time zone https://github.com/datafusion-contrib/datafusion-orc/issues/13
- [x] decimal https://github.com/datafusion-contrib/datafusion-orc/issues/18
[x] common compress methods https://github.com/datafusion-contrib/datafusion-orc/issues/10
[x] user metadata
[x] other encodings https://github.com/datafusion-contrib/datafusion-orc/pull/24
[ ] Benchmark https://github.com/datafusion-contrib/datafusion-orc/issues/8

The below are also related but with lower priorities

[x] compound data types https://github.com/datafusion-contrib/datafusion-orc/issues/14
- [x] struct https://github.com/datafusion-contrib/datafusion-orc/pull/26
- [x] list
- [x] map
- [x] union
[ ] file metadata and statistics
[ ] pruning https://github.com/datafusion-contrib/datafusion-orc/issues/15

Long term items:

[ ] encryption

~~Then something I'm not sure about. Looking for more information. Also feel free to change previous two lists.~~

Jefffrey commented 7 months ago

Thanks for writing this here.

Just to preface, I'm no expert in ORC nor do I technically have a usecase for it, so can take my thoughts with a grain of salt. With that said:

Encryption can probably be placed lowest priority, probably into the longer-term roadmap. Even parquet in arrow-rs doesn't yet support encryption
We can probably bump the encodings to highest priority. I assume you're referring to the V1 encodings, which should be simpler to implement than V2 which seems to already be present
I haven't looked into statistics and indexes much, but they do seem important for stuff like predicate pushdown, so can be medium priority or so

I'll create more issues based on this roadmap

Also I assume all our focus will be on a read implementation first, with write coming much later

Another question I have is if we'll focus solely on arrow interop (that is, we focus only on reading from ORC directly into arrow arrays). Parquet crate in arrow-rs seems to support a more generic ColumnReader API for users who don't need arrow. If we focus only on arrow then we can optimize the read behaviour as such, wheres it might require a separate read implemention for a more generic API

alamb commented 7 months ago

BTW some potentially relevant documents in case anyone is interested:

A Deep Dive into Common Open Formats for Analytical

An Empirical Evaluation of Columnar Storage Formats

Jefffrey commented 7 months ago

BTW some potentially relevant documents in case anyone is interested:

A Deep Dive into Common Open Formats for Analytical

An Empirical Evaluation of Columnar Storage Formats

Thanks for these, will definitely give a read!

klangner commented 2 months ago

What is missing from this roadmap which is required to allow this library be added to the datafusion (and arrow-rs, polars?) I would be interested in helping with this effort as we have orc files which we would like to query.

Jefffrey commented 2 months ago

hey @klangner thanks for the interest!

For DataFusion there is an issue for it: https://github.com/datafusion-contrib/datafusion-orc/issues/63

Right now it lacks support for projection, not to mention the code is sequestered in an example instead of being code as part of the library.

For arrow-rs it's basically just... all features needed for supporting read use cases (sorry if this is too vague :sweat_smile: )

I'm not familiar with polars so I can't say on that front.

For now I'm imagining enhancing the API for RecordBatch reading (akin to what parquet provides in arrow-rs) and also creating the necessary impl's to allow DataFusion to read from ORC files using this library.

datafusion-contrib / datafusion-orc

Short-term roadmap for this implementation #7