Open waynexia opened 7 months ago
Thanks for writing this here.
Just to preface, I'm no expert in ORC nor do I technically have a usecase for it, so can take my thoughts with a grain of salt. With that said:
I'll create more issues based on this roadmap
Also I assume all our focus will be on a read implementation first, with write coming much later
Another question I have is if we'll focus solely on arrow interop (that is, we focus only on reading from ORC directly into arrow arrays). Parquet crate in arrow-rs seems to support a more generic ColumnReader API for users who don't need arrow. If we focus only on arrow then we can optimize the read behaviour as such, wheres it might require a separate read implemention for a more generic API
BTW some potentially relevant documents in case anyone is interested:
BTW some potentially relevant documents in case anyone is interested:
Thanks for these, will definitely give a read!
What is missing from this roadmap which is required to allow this library be added to the datafusion (and arrow-rs, polars?) I would be interested in helping with this effort as we have orc files which we would like to query.
hey @klangner thanks for the interest!
For DataFusion there is an issue for it: https://github.com/datafusion-contrib/datafusion-orc/issues/63
Right now it lacks support for projection, not to mention the code is sequestered in an example instead of being code as part of the library.
For arrow-rs it's basically just... all features needed for supporting read use cases (sorry if this is too vague :sweat_smile: )
I'm not familiar with polars so I can't say on that front.
For now I'm imagining enhancing the API for RecordBatch reading (akin to what parquet provides in arrow-rs) and also creating the necessary impl's to allow DataFusion to read from ORC files using this library.
Previous discussion: https://github.com/apache/arrow-datafusion/issues/4707
Though the ORC format is not as widely used as parquet in arrow-rs and datafusion related projects, there are still some (growing, to my feelings) interesting and requirements on this format. As @Jefffrey said here, a noticeable and viable milestone for this project is it can be merged into arrow-rs. This draft roadmap is raised to help us discuss, arrange and take our efforts toward that milestone.
Given the ORC format is less complex than parquet, there are still many work to do in various aspects. Here is a list of functionalities need to be done if we consider making ORC files queriable from datafusion as the primary use case on this stage. Please feel free to add/remove/set priorities to them. It's likely that we can't finish all of them in a short term, thus marking what are going to be done is also important.
The below are also related but with lower priorities
Long term items:
Then something I'm not sure about. Looking for more information. Also feel free to change previous two lists.