datafusion-contrib / datafusion-orc

Implementation of Apache ORC file format use Apache Arrow in-memory format
Apache License 2.0
41 stars 10 forks source link

Refactor top-level interface #42

Closed Jefffrey closed 9 months ago

Jefffrey commented 10 months ago

(Where top-level interface refers to how DataFusion will use this library to read ORC files as that is the main intention of the crate)

Since we want this library to integrate with DataFusion, we should try provide a more clean interface for it to be able to read ORC files as record batches.

In current way:

https://github.com/datafusion-contrib/datafusion-orc/blob/cbeb8bdb3e90bc1d8c1d8a12df9d07baec905617/tests/basic/main.rs#L14-L22

Similar can be said for async version.

We can take inspiration from how parquet does it:

Jefffrey commented 10 months ago

I will work on trying to simplify the Reader/Cursor part a bit, maybe try to replicate what parquet does here: https://github.com/apache/arrow-rs/blob/master/parquet/src/file/reader.rs#L40-L68