datafusion-contrib / datafusion-orc

Implementation of Apache ORC file format use Apache Arrow in-memory format
Apache License 2.0
28 stars 7 forks source link

Short-term roadmap for this implementation #7

Open waynexia opened 7 months ago

waynexia commented 7 months ago

Previous discussion: https://github.com/apache/arrow-datafusion/issues/4707

Though the ORC format is not as widely used as parquet in arrow-rs and datafusion related projects, there are still some (growing, to my feelings) interesting and requirements on this format. As @Jefffrey said here, a noticeable and viable milestone for this project is it can be merged into arrow-rs. This draft roadmap is raised to help us discuss, arrange and take our efforts toward that milestone.

Given the ORC format is less complex than parquet, there are still many work to do in various aspects. Here is a list of functionalities need to be done if we consider making ORC files queriable from datafusion as the primary use case on this stage. Please feel free to add/remove/set priorities to them. It's likely that we can't finish all of them in a short term, thus marking what are going to be done is also important.

The below are also related but with lower priorities

Long term items:

Then something I'm not sure about. Looking for more information. Also feel free to change previous two lists.

Jefffrey commented 7 months ago

Thanks for writing this here.

Just to preface, I'm no expert in ORC nor do I technically have a usecase for it, so can take my thoughts with a grain of salt. With that said:

I'll create more issues based on this roadmap

Also I assume all our focus will be on a read implementation first, with write coming much later

Another question I have is if we'll focus solely on arrow interop (that is, we focus only on reading from ORC directly into arrow arrays). Parquet crate in arrow-rs seems to support a more generic ColumnReader API for users who don't need arrow. If we focus only on arrow then we can optimize the read behaviour as such, wheres it might require a separate read implemention for a more generic API

alamb commented 7 months ago

BTW some potentially relevant documents in case anyone is interested:

A Deep Dive into Common Open Formats for Analytical

An Empirical Evaluation of Columnar Storage Formats

Jefffrey commented 7 months ago

BTW some potentially relevant documents in case anyone is interested:

A Deep Dive into Common Open Formats for Analytical

An Empirical Evaluation of Columnar Storage Formats

Thanks for these, will definitely give a read!

klangner commented 2 months ago

What is missing from this roadmap which is required to allow this library be added to the datafusion (and arrow-rs, polars?) I would be interested in helping with this effort as we have orc files which we would like to query.

Jefffrey commented 2 months ago

hey @klangner thanks for the interest!

For DataFusion there is an issue for it: https://github.com/datafusion-contrib/datafusion-orc/issues/63

Right now it lacks support for projection, not to mention the code is sequestered in an example instead of being code as part of the library.

For arrow-rs it's basically just... all features needed for supporting read use cases (sorry if this is too vague :sweat_smile: )

I'm not familiar with polars so I can't say on that front.

For now I'm imagining enhancing the API for RecordBatch reading (akin to what parquet provides in arrow-rs) and also creating the necessary impl's to allow DataFusion to read from ORC files using this library.