datafusion-contrib / datafusion-orc

Implementation of Apache ORC file format use Apache Arrow in-memory format
Apache License 2.0
40 stars 10 forks source link

discuss: Move into the Apache ORC PMC and develop as `apache/orc-rust` #120

Open Xuanwo opened 1 month ago

Xuanwo commented 1 month ago

Hello, everyone. I am initiating this discussion to explore the possibility of moving into the Apache ORC PMC and developing apache/orc-rust.

By developing apache/orc-rust, we will establish this implementation as the official Rust version of ORC, thereby creating a larger and more cohesive community for those interested in a Rust ORC implementation. This will make it much easier for us to build a community around this project.

What are your thoughts? I plan to discuss this with the orc community if contributors are satisfied with it.


cc @Jefffrey @WenyXu @progval @waynexia @klangner @alamb @v0y4g3r @youngsofun @harveyyue

waynexia commented 1 month ago

About the place to move in, we can also consider Arrow or Datafusion, given this repo is deeply involved with Apache Arrow and Datafusion at the API level. Like the fact that parquet-rs has been maintained in arrow-rs for a long time.

progval commented 1 month ago

given this repo is deeply involved with Apache Arrow and Datafusion at the API level

To be honest I find it a bit surprising to have integration with a query engine in an ORC library. Would it make sense to split the Datafusion-related bits into either their own crate (with all the fun of keeping versions in sync), or move them to Datafusion (like it already does with Parquet)?

waynexia commented 1 month ago

Cross reference: https://lists.apache.org/thread/zrwnhwojf9v5c58hov8hcnpt03ftf3ql

Xuanwo commented 1 month ago

Hi, I wasn't involved in the datafusion side of development, so I'm not familiar with the ORC and Datafusion integration. From my perspective as a passerby, datafusion-orc seems more like Datafusion integration rather than focusing solely on the ORC format. I'm a bit worried that this might reduce the likelihood of potential users or contributors finding us.

I agree with @progval that it would be better to separate the ORC support component directly into DataFusion, similar to how parquet is handled.

Xuanwo commented 1 month ago

We have three possible options:

wgtmac commented 1 month ago

Chiming in from the Apache ORC community. I'm very excited for the discussion! Sorry that I'm not familiar with rust. For the approach of apache/orc-rust, I'd like to know what's the current dependency and what's the estimated amount of work to spilt the repository to remove the datafusion integration part?

Xuanwo commented 1 month ago

I'd like to know what's the current dependency

I created a dependency list, and I believe it meets the requirements of the ASF.

please checkout the details here:

Details

0BSD (1): adler@1.0.2 Apache-2.0 (114): adler@1.0.2, ahash@0.8.11, android-tzdata@0.1.1, android_system_properties@0.1.5, anyhow@1.0.86, arrow@52.0.0, arrow-arith@52.0.0, arrow-array@52.0.0, arrow-buffer@52.0.0, arrow-cast@52.0.0, arrow-csv@52.0.0, arrow-data@52.0.0, arrow-ipc@52.0.0, arrow-json@52.0.0, arrow-ord@52.0.0, arrow-row@52.0.0, arrow-schema@52.0.0, arrow-select@52.0.0, arrow-string@52.0.0, autocfg@1.3.0, base64@0.22.1, bitflags@1.3.2, bumpalo@3.16.0, cc@1.0.104, cfg-if@1.0.0, chrono@0.4.38, chrono-tz@0.9.0, chrono-tz-build@0.3.0, const-random@0.1.18, const-random-macro@0.1.16, core-foundation-sys@0.8.6, crc32fast@1.4.2, either@1.13.0, equivalent@1.0.1, fallible-streaming-iterator@0.1.9, flatbuffers@24.3.25, flate2@1.0.30, getrandom@0.2.15, half@2.4.1, hashbrown@0.14.5, heck@0.5.0, iana-time-zone@0.1.60, iana-time-zone-haiku@0.1.2, indexmap@2.2.6, itertools@0.12.1, itoa@1.0.11, jobserver@0.1.31, js-sys@0.3.69, lazy_static@1.5.0, lexical-core@0.8.5, lexical-parse-float@0.8.5, lexical-parse-integer@0.8.6, lexical-util@0.8.5, lexical-write-float@0.8.5, lexical-write-integer@0.8.5, libc@0.2.155, libm@0.2.8, log@0.4.22, miniz_oxide@0.7.4, num@0.4.3, num-bigint@0.4.6, num-complex@0.4.6, num-integer@0.1.46, num-iter@0.1.45, num-rational@0.4.2, num-traits@0.2.19, once_cell@1.19.0, orc-rust@0.3.1, pkg-config@0.3.30, proc-macro2@1.0.86, prost@0.12.6, prost-derive@0.12.6, quote@1.0.36, rand@0.8.5, rand_core@0.6.4, regex@1.10.5, regex-automata@0.4.7, regex-syntax@0.8.4, rustc_version@0.4.0, rustversion@1.0.17, ryu@1.0.18, semver@1.0.23, serde@1.0.203, serde_derive@1.0.203, serde_json@1.0.119, siphasher@0.3.11, snafu@0.8.3, snafu-derive@0.8.3, static_assertions@1.1.0, syn@2.0.68, thiserror@1.0.61, thiserror-impl@1.0.61, unicode-ident@1.0.12, unicode-width@0.1.13, version_check@0.9.4, wasi@0.11.0+wasi-snapshot-preview1, wasm-bindgen@0.2.92, wasm-bindgen-backend@0.2.92, wasm-bindgen-macro@0.2.92, wasm-bindgen-macro-support@0.2.92, wasm-bindgen-shared@0.2.92, windows-core@0.52.0, windows-targets@0.52.5, windows_aarch64_gnullvm@0.52.5, windows_aarch64_msvc@0.52.5, windows_i686_gnu@0.52.5, windows_i686_gnullvm@0.52.5, windows_i686_msvc@0.52.5, windows_x86_64_gnu@0.52.5, windows_x86_64_gnullvm@0.52.5, windows_x86_64_msvc@0.52.5, zerocopy@0.7.34, zstd-safe@6.0.6, zstd-sys@2.0.9+zstd.1.5.5 Apache-2.0 WITH LLVM-exception (1): wasi@0.11.0+wasi-snapshot-preview1 BSD-2-Clause (1): zerocopy@0.7.34 BSD-3-Clause (1): snap@1.1.1 BSL-1.0 (1): ryu@1.0.18 CC0-1.0 (1): tiny-keccak@2.0.2 MIT (115): adler@1.0.2, ahash@0.8.11, aho-corasick@1.1.3, android-tzdata@0.1.1, android_system_properties@0.1.5, anyhow@1.0.86, atoi@2.0.0, autocfg@1.3.0, base64@0.22.1, bitflags@1.3.2, bumpalo@3.16.0, byteorder@1.5.0, bytes@1.6.0, cc@1.0.104, cfg-if@1.0.0, chrono@0.4.38, chrono-tz@0.9.0, chrono-tz-build@0.3.0, comfy-table@7.1.1, const-random@0.1.18, const-random-macro@0.1.16, core-foundation-sys@0.8.6, crc32fast@1.4.2, crunchy@0.2.2, csv@1.3.0, csv-core@0.1.11, either@1.13.0, equivalent@1.0.1, fallible-streaming-iterator@0.1.9, flate2@1.0.30, getrandom@0.2.15, half@2.4.1, hashbrown@0.14.5, heck@0.5.0, iana-time-zone@0.1.60, iana-time-zone-haiku@0.1.2, indexmap@2.2.6, itertools@0.12.1, itoa@1.0.11, jobserver@0.1.31, js-sys@0.3.69, lazy_static@1.5.0, lexical-core@0.8.5, lexical-parse-float@0.8.5, lexical-parse-integer@0.8.6, lexical-util@0.8.5, lexical-write-float@0.8.5, lexical-write-integer@0.8.5, libc@0.2.155, libm@0.2.8, log@0.4.22, lz4_flex@0.11.3, lzokay-native@0.1.0, memchr@2.7.4, miniz_oxide@0.7.4, num@0.4.3, num-bigint@0.4.6, num-complex@0.4.6, num-integer@0.1.46, num-iter@0.1.45, num-rational@0.4.2, num-traits@0.2.19, once_cell@1.19.0, parse-zoneinfo@0.3.1, phf@0.11.2, phf_codegen@0.11.2, phf_generator@0.11.2, phf_shared@0.11.2, pkg-config@0.3.30, proc-macro2@1.0.86, quote@1.0.36, rand@0.8.5, rand_core@0.6.4, regex@1.10.5, regex-automata@0.4.7, regex-syntax@0.8.4, rustc_version@0.4.0, rustversion@1.0.17, semver@1.0.23, serde@1.0.203, serde_derive@1.0.203, serde_json@1.0.119, siphasher@0.3.11, snafu@0.8.3, snafu-derive@0.8.3, static_assertions@1.1.0, strum@0.26.3, strum_macros@0.26.4, syn@2.0.68, thiserror@1.0.61, thiserror-impl@1.0.61, twox-hash@1.6.3, unicode-ident@1.0.12, unicode-width@0.1.13, version_check@0.9.4, wasi@0.11.0+wasi-snapshot-preview1, wasm-bindgen@0.2.92, wasm-bindgen-backend@0.2.92, wasm-bindgen-macro@0.2.92, wasm-bindgen-macro-support@0.2.92, wasm-bindgen-shared@0.2.92, windows-core@0.52.0, windows-targets@0.52.5, windows_aarch64_gnullvm@0.52.5, windows_aarch64_msvc@0.52.5, windows_i686_gnu@0.52.5, windows_i686_gnullvm@0.52.5, windows_i686_msvc@0.52.5, windows_x86_64_gnu@0.52.5, windows_x86_64_gnullvm@0.52.5, windows_x86_64_msvc@0.52.5, zerocopy@0.7.34, zstd@0.12.4, zstd-safe@6.0.6, zstd-sys@2.0.9+zstd.1.5.5 Unicode-DFS-2016 (1): unicode-ident@1.0.12 Unlicense (5): aho-corasick@1.1.3, byteorder@1.5.0, csv@1.3.0, csv-core@0.1.11, memchr@2.7.4 Zlib (1): miniz_oxide@0.7.4

what's the estimated amount of work to spilt the repository to remove the datafusion integration part?

I believe it should be simple since it's just a mod of orc-rust. I'm willing to take on this part of the work.

waynexia commented 1 month ago

I'm not familiar with the ORC and Datafusion integration. From my perspective as a passerby, datafusion-orc seems more like Datafusion integration rather than focusing solely on the ORC format. I'm a bit worried that this might reduce the likelihood

I find a similar question about the relationship between arrow-rs and parquet-rs https://github.com/apache/arrow-rs/issues/1715. I believe this repo was developed and maintained for the same purpose.

However if we are going to implement features that are not a strong demand from Datafusion side (like ORC writer https://github.com/apache/orc/issues/1507) or integrate it with other consumers (like Databend https://github.com/datafuselabs/databend/issues/8016), having a dedicated repo would both reduce the maintenance burden of Datafusion and make the lib itself easier to use.

I agree with the opinion of separating this into two parts. The ORC format resides in a dedicated repo like apache/orc-rust with maintenance from both current contributors and the ORC community. And Datafusion uses it as a downstream user to implement ORC data source. I would like to help with both code work like splitting this code base and non-code work like IP clearance.

alamb commented 1 month ago

I agree with the opinion of separating this into two parts. The ORC format resides in a dedicated repo like apache/orc-rust with maintenance from both current contributors and the ORC community. And Datafusion uses it as a downstream user to implement ORC data source. I would like to help with both code work like splitting this code base and non-code work like IP clearance.

I agree with @waynexia and @progval that the following split makes a lot of sense to me

  1. something like apache/orc-rs (similar to parquet in parquet-rs) that has no datafusion dependencies
  2. this crate datafusion-contrib/datafusion-orc that has the DataFusion table provider and depeneds on apache/orc-rs as well as DataFusion and does the datafusion integration
alamb commented 1 month ago

Like the fact that parquet-rs has been maintained in arrow-rs for a long time.

FWIW I think this was partly an artifiact of history:

  1. there was a time when the parquet PMC was largely focused on java
  2. arrow needed a persistence format and parquet was an obvious choice
  3. so parquet-cpp got made in the arrow repo
  4. we basically followed the same pattern with arrow-rs / parquet-rs (in the arrow repo)

Given the current state of the code, I think it would be plausible to split parquet out of arrow-rs, but I also think unless there is some substantially larger group of maintainers that aren't also maintainers of arrow-rs it is likely easier to leave it there

wgtmac commented 1 month ago

cc @dongjoon-hyun @guiyanakuang @williamhyun @omalley from Apache ORC PMC

Xuanwo commented 1 month ago

Given the current state of the code, I think it would be plausible to split parquet out of arrow-rs, but I also think unless there is some substantially larger group of maintainers that aren't also maintainers of arrow-rs it is likely easier to leave it there

Agreed. I have thought about this before but haven't taken any action yet. I mean, it looks appealing to have apache/parquet-rs, but we need to consider the current project status.

I'm starting this thread because I believe it's beneficial for orc-rs to build a community by developing at upstream, but it doesn't seem applicable to parquet-rs at the moment.

mapleFU commented 1 month ago

Previously we discussed split parquet-cpp out of arrow-c++. However the dependency would be weird since there're:

arrow-dataset -> parquet-arrow -> parquet-core -> some arrow core libs
Xuanwo commented 1 month ago

arrow-dataset -> parquet-arrow -> parquet-core -> some arrow core libs

I believe the situation is different in parquet-rs since it depends on arrow-rs but not reverse. However, this is not the focus of our discussion. We can start another thread for this if interested.

Xuanwo commented 1 month ago

Thank you for the discussion. It looks like we can move forward! I think we can:

cc @alamb @waynexia @wgtmac for comments.

waynexia commented 1 month ago

I'll start preparing a PR to split the current repo.

Do you have something like guidance for IP clearance? I have attended it before but have not prepared one.

Xuanwo commented 1 month ago

I'll start preparing a PR to split the current repo.

Thanks!

Do you have something like guidance for IP clearance? I have attended it before but have not prepared one.

I think we can follow https://incubator.apache.org/ip-clearance/

Here's an example from https://github.com/apache/arrow-rs/issues/2096. We can reach out to @alamb if we encounter any problems.

waynexia commented 1 month ago

Hi @progval @klangner, as part of the IP clearance process, could you please submit an ICLA (Individual Contributor Licence Agreement) following the follow the instructions at https://www.apache.org/licenses/contributor-agreements.html if you do not already have one on file? Thanks in advance for helping with this! If you already have filed one, please let me know the email address associated with your account.

Jefffrey commented 1 month ago

I would like to chime in my thoughts. I do apologize for being inactive, and have been meaning to pickup the work I left off on this repository (specifically the basic write functionality).

The way I see it, the primary focus of this repository is to serve as an integration with DataFusion to allow querying ORC files. Naturally this required first implementing a layer to read ORC files to Arrow, before then being able to integrate into DataFusion itself (similar to how there is parquet-rs, then the actual parquet integration code in DataFusion).

I can see the merit to splitting up this repository, but perhaps still be too early to do so? One benefit of having both the integration with Arrow and integration with DataFusion in a single repository is that it allows easier development, as these interfaces will be interacting with each other. Splitting across different repositories might make it harder to experiment with the interface for each respective integration, which can slow down development.

Furthermore, I don't think there were any immediate plans to develop a native ORC interface; that is, being able to read ORC in Rust without reading it to Arrow (similar to how parquet-rs has a low level column reader/writer API). From my point of view then, it might seem odd to donate a primarily Arrow <-> ORC interface library to ORC.

klangner commented 1 month ago

I think I have already signed it some time ago while doing some other work.

Xuanwo commented 1 month ago

I would like to chime in my thoughts. I do apologize for being inactive, and have been meaning to pickup the work I left off on this repository (specifically the basic write functionality).

Thank you very much for your contribution!

I can see the merit to splitting up this repository, but perhaps still be too early to do so?

From my perspective (as a committer on some Apache projects), it's already late for us to do so.

Developing at upstream can create a solid foundation for our entire community to build upon, making it easier for those interested in using ORC in Rust to find this project. Additionally, we can garner more support from the ORC community. Building a strong community is the key to our success. For example, we started iceberg-rust as a very basic project that could only read tables, but it has now grown to 53 contributors with full catalog support. By donating this to ORC, I expect to build a community around it, similar to what we've done with iceberg-rust.

Therefore, instead of waiting for our project to mature and gain full support, I prefer to start and attract more people to join now. I believe it's fine for us to use the existing orc -> arrow code base as a starting point.

Furthermore, I don't think there were any immediate plans to develop a native ORC interface;

I agree, but it depends on the community's feature requests. I would be happy to work with the community if someone wants to collaborate on this.

alamb commented 1 month ago

Yes -- I think one potential benefit to splitting out orc-rs would be that others who are not using it in the context of DataFusion might be more willing to help with the development.

I do not know how likely that is at this point, though

Xuanwo commented 1 month ago

Yes -- I think one potential benefit to splitting out orc-rs would be that others who are not using it in the context of DataFusion might be more willing to help with the development.

I have three such cases on my tables:

Aitozi commented 1 month ago

Yes -- I think one potential benefit to splitting out orc-rs would be that others who are not using it in the context of DataFusion might be more willing to help with the development.

I have three such cases on my tables:

  • I (of course!) want to build both orc-rs and datafusion-orc seperately.
  • @youngsofun from databend wants native ORC support but not datafusion.
  • @wgtmac from the Apache ORC PMC is interested in a Rust ORC implementation.

From paimon-rust may also need a native ORC support but not datafusion

Xuanwo commented 1 month ago

From paimon-rust may also need a native ORC support but not datafusion

This case is interesting since paimon-rust will need datafusion but not require orc with datafusion. Paimon requires orc to read data but provides datafusion integration on its own.

XuQianJin-Stars commented 1 month ago

Is there still the parquet-rust project?

alamb commented 1 month ago

Is there still the parquet-rust project?

I do not know what parquet-rust refers to

https://parquet.apache.org/docs/contribution-guidelines/sub-projects/ has a list of open source rust implementations

parquet-rs refers to https://github.com/apache/arrow-rs/tree/master/parquet

progval commented 1 month ago

Hi @progval [...], as part of the IP clearance process, could you please submit an ICLA (Individual Contributor Licence Agreement) following the follow the instructions at https://www.apache.org/licenses/contributor-agreements.html if you do not already have one on file?

I believe I also need a CCLA from my previous employer, as well as the current one (since 2024-08-01). This may take a few weeks. I'll tell you when it's done.

Xuanwo commented 1 month ago

I believe I also need a CCLA from my previous employer, as well as the current one (since 2024-08-01). This may take a few weeks. I'll tell you when it's done.

Hi, this project does not belong to your employer (please correct me if I'm wrong). This donation will be sent from datafusion-contrib to orc. I believe an ICLA is sufficient.

Xuanwo commented 1 month ago

Hi, @alamb. This reminds me that we should establish the CLA for all projects in the datafusion-contrib organization. All contributors should agree that contributions to projects under datafusion-contrib will grant the license to the ASF. Please correct me if this isn't meant for datafusion-contrib.

alamb commented 1 month ago

Hi, @alamb. This reminds me that we should establish the CLA for all projects in the datafusion-contrib organization. All contributors should agree that contributions to projects under datafusion-contrib will grant the license to the ASF. Please correct me if this isn't meant for datafusion-contrib.

I think the idea with datafusion-contrib is to minimize process overhead (such as apache CLAs) and mostly serve as a very disparate set of crates. As they mature, we can then apply more process (as we are doing in this case)

The rationale is that many of the crates in datafusion-contrib will likely never get to the stage where they would be donated to the Apache foundation and thus any up-front cost to prepare for that is wasted effort (and thus reduces contributions)

Xuanwo commented 1 month ago

The rationale is that many of the crates in datafusion-contrib will likely never get to the stage where they would be donated to the Apache foundation and thus any up-front cost to prepare for that is wasted effort (and thus reduces contributions)

Understood, thank you. This design makes sense to me.

Jefffrey commented 3 weeks ago

I see mention of not needing the DataFusion integration code as motivation, but could this be addressed by splitting the current project to have two subcrates, one for pure Arrow-ORC and the other for DataFusion integration?

I wanted to do this initially but kept DataFusion as a feature to make it easier to develop with, especially since the DataFusion integration code is currently quite small (though I guess the dependency footprint isn't :sweat_smile: )

Xuanwo commented 4 days ago

I believe I also need a CCLA from my previous employer, as well as the current one (since 2024-08-01). This may take a few weeks. I'll tell you when it's done.

Hi, @progval. Sorry for the interruption. I wanted to check if it works well.

progval commented 4 days ago

Hi, this project does not belong to your employer (please correct me if I'm wrong). This donation will be sent from datafusion-contrib to orc. I believe an ICLA is sufficient.

https://www.apache.org/licenses/contributor-agreements.html says CCLAs are for "For a corporation that assigns employees to work on an Apache project", and I was an employee assigned to work on the project.

Either way, I need my ex-employer's permission for the ICLA

Hi, @progval. Sorry for the interruption. I wanted to check if it works well.

My ex-employer's staff came back from summer vacation this week, and they are about to start processing my request. Current employer won't be an issue.

Sorry for the delay.