Open Xuanwo opened 1 month ago
About the place to move in, we can also consider Arrow or Datafusion, given this repo is deeply involved with Apache Arrow and Datafusion at the API level. Like the fact that parquet-rs
has been maintained in arrow-rs
for a long time.
given this repo is deeply involved with Apache Arrow and Datafusion at the API level
To be honest I find it a bit surprising to have integration with a query engine in an ORC library. Would it make sense to split the Datafusion-related bits into either their own crate (with all the fun of keeping versions in sync), or move them to Datafusion (like it already does with Parquet)?
Cross reference: https://lists.apache.org/thread/zrwnhwojf9v5c58hov8hcnpt03ftf3ql
Hi, I wasn't involved in the datafusion side of development, so I'm not familiar with the ORC and Datafusion integration. From my perspective as a passerby, datafusion-orc
seems more like Datafusion integration rather than focusing solely on the ORC format. I'm a bit worried that this might reduce the likelihood of potential users or contributors finding us.
I agree with @progval that it would be better to separate the ORC support component directly into DataFusion, similar to how parquet
is handled.
We have three possible options:
apache/orc-rust
apache/arrow-rs/orc
apache/datafusion-orc
Chiming in from the Apache ORC community. I'm very excited for the discussion! Sorry that I'm not familiar with rust. For the approach of apache/orc-rust
, I'd like to know what's the current dependency and what's the estimated amount of work to spilt the repository to remove the datafusion integration part?
I'd like to know what's the current dependency
I created a dependency list, and I believe it meets the requirements of the ASF.
please checkout the details here:
0BSD (1): adler@1.0.2 Apache-2.0 (114): adler@1.0.2, ahash@0.8.11, android-tzdata@0.1.1, android_system_properties@0.1.5, anyhow@1.0.86, arrow@52.0.0, arrow-arith@52.0.0, arrow-array@52.0.0, arrow-buffer@52.0.0, arrow-cast@52.0.0, arrow-csv@52.0.0, arrow-data@52.0.0, arrow-ipc@52.0.0, arrow-json@52.0.0, arrow-ord@52.0.0, arrow-row@52.0.0, arrow-schema@52.0.0, arrow-select@52.0.0, arrow-string@52.0.0, autocfg@1.3.0, base64@0.22.1, bitflags@1.3.2, bumpalo@3.16.0, cc@1.0.104, cfg-if@1.0.0, chrono@0.4.38, chrono-tz@0.9.0, chrono-tz-build@0.3.0, const-random@0.1.18, const-random-macro@0.1.16, core-foundation-sys@0.8.6, crc32fast@1.4.2, either@1.13.0, equivalent@1.0.1, fallible-streaming-iterator@0.1.9, flatbuffers@24.3.25, flate2@1.0.30, getrandom@0.2.15, half@2.4.1, hashbrown@0.14.5, heck@0.5.0, iana-time-zone@0.1.60, iana-time-zone-haiku@0.1.2, indexmap@2.2.6, itertools@0.12.1, itoa@1.0.11, jobserver@0.1.31, js-sys@0.3.69, lazy_static@1.5.0, lexical-core@0.8.5, lexical-parse-float@0.8.5, lexical-parse-integer@0.8.6, lexical-util@0.8.5, lexical-write-float@0.8.5, lexical-write-integer@0.8.5, libc@0.2.155, libm@0.2.8, log@0.4.22, miniz_oxide@0.7.4, num@0.4.3, num-bigint@0.4.6, num-complex@0.4.6, num-integer@0.1.46, num-iter@0.1.45, num-rational@0.4.2, num-traits@0.2.19, once_cell@1.19.0, orc-rust@0.3.1, pkg-config@0.3.30, proc-macro2@1.0.86, prost@0.12.6, prost-derive@0.12.6, quote@1.0.36, rand@0.8.5, rand_core@0.6.4, regex@1.10.5, regex-automata@0.4.7, regex-syntax@0.8.4, rustc_version@0.4.0, rustversion@1.0.17, ryu@1.0.18, semver@1.0.23, serde@1.0.203, serde_derive@1.0.203, serde_json@1.0.119, siphasher@0.3.11, snafu@0.8.3, snafu-derive@0.8.3, static_assertions@1.1.0, syn@2.0.68, thiserror@1.0.61, thiserror-impl@1.0.61, unicode-ident@1.0.12, unicode-width@0.1.13, version_check@0.9.4, wasi@0.11.0+wasi-snapshot-preview1, wasm-bindgen@0.2.92, wasm-bindgen-backend@0.2.92, wasm-bindgen-macro@0.2.92, wasm-bindgen-macro-support@0.2.92, wasm-bindgen-shared@0.2.92, windows-core@0.52.0, windows-targets@0.52.5, windows_aarch64_gnullvm@0.52.5, windows_aarch64_msvc@0.52.5, windows_i686_gnu@0.52.5, windows_i686_gnullvm@0.52.5, windows_i686_msvc@0.52.5, windows_x86_64_gnu@0.52.5, windows_x86_64_gnullvm@0.52.5, windows_x86_64_msvc@0.52.5, zerocopy@0.7.34, zstd-safe@6.0.6, zstd-sys@2.0.9+zstd.1.5.5 Apache-2.0 WITH LLVM-exception (1): wasi@0.11.0+wasi-snapshot-preview1 BSD-2-Clause (1): zerocopy@0.7.34 BSD-3-Clause (1): snap@1.1.1 BSL-1.0 (1): ryu@1.0.18 CC0-1.0 (1): tiny-keccak@2.0.2 MIT (115): adler@1.0.2, ahash@0.8.11, aho-corasick@1.1.3, android-tzdata@0.1.1, android_system_properties@0.1.5, anyhow@1.0.86, atoi@2.0.0, autocfg@1.3.0, base64@0.22.1, bitflags@1.3.2, bumpalo@3.16.0, byteorder@1.5.0, bytes@1.6.0, cc@1.0.104, cfg-if@1.0.0, chrono@0.4.38, chrono-tz@0.9.0, chrono-tz-build@0.3.0, comfy-table@7.1.1, const-random@0.1.18, const-random-macro@0.1.16, core-foundation-sys@0.8.6, crc32fast@1.4.2, crunchy@0.2.2, csv@1.3.0, csv-core@0.1.11, either@1.13.0, equivalent@1.0.1, fallible-streaming-iterator@0.1.9, flate2@1.0.30, getrandom@0.2.15, half@2.4.1, hashbrown@0.14.5, heck@0.5.0, iana-time-zone@0.1.60, iana-time-zone-haiku@0.1.2, indexmap@2.2.6, itertools@0.12.1, itoa@1.0.11, jobserver@0.1.31, js-sys@0.3.69, lazy_static@1.5.0, lexical-core@0.8.5, lexical-parse-float@0.8.5, lexical-parse-integer@0.8.6, lexical-util@0.8.5, lexical-write-float@0.8.5, lexical-write-integer@0.8.5, libc@0.2.155, libm@0.2.8, log@0.4.22, lz4_flex@0.11.3, lzokay-native@0.1.0, memchr@2.7.4, miniz_oxide@0.7.4, num@0.4.3, num-bigint@0.4.6, num-complex@0.4.6, num-integer@0.1.46, num-iter@0.1.45, num-rational@0.4.2, num-traits@0.2.19, once_cell@1.19.0, parse-zoneinfo@0.3.1, phf@0.11.2, phf_codegen@0.11.2, phf_generator@0.11.2, phf_shared@0.11.2, pkg-config@0.3.30, proc-macro2@1.0.86, quote@1.0.36, rand@0.8.5, rand_core@0.6.4, regex@1.10.5, regex-automata@0.4.7, regex-syntax@0.8.4, rustc_version@0.4.0, rustversion@1.0.17, semver@1.0.23, serde@1.0.203, serde_derive@1.0.203, serde_json@1.0.119, siphasher@0.3.11, snafu@0.8.3, snafu-derive@0.8.3, static_assertions@1.1.0, strum@0.26.3, strum_macros@0.26.4, syn@2.0.68, thiserror@1.0.61, thiserror-impl@1.0.61, twox-hash@1.6.3, unicode-ident@1.0.12, unicode-width@0.1.13, version_check@0.9.4, wasi@0.11.0+wasi-snapshot-preview1, wasm-bindgen@0.2.92, wasm-bindgen-backend@0.2.92, wasm-bindgen-macro@0.2.92, wasm-bindgen-macro-support@0.2.92, wasm-bindgen-shared@0.2.92, windows-core@0.52.0, windows-targets@0.52.5, windows_aarch64_gnullvm@0.52.5, windows_aarch64_msvc@0.52.5, windows_i686_gnu@0.52.5, windows_i686_gnullvm@0.52.5, windows_i686_msvc@0.52.5, windows_x86_64_gnu@0.52.5, windows_x86_64_gnullvm@0.52.5, windows_x86_64_msvc@0.52.5, zerocopy@0.7.34, zstd@0.12.4, zstd-safe@6.0.6, zstd-sys@2.0.9+zstd.1.5.5 Unicode-DFS-2016 (1): unicode-ident@1.0.12 Unlicense (5): aho-corasick@1.1.3, byteorder@1.5.0, csv@1.3.0, csv-core@0.1.11, memchr@2.7.4 Zlib (1): miniz_oxide@0.7.4
what's the estimated amount of work to spilt the repository to remove the datafusion integration part?
I believe it should be simple since it's just a mod
of orc-rust
. I'm willing to take on this part of the work.
I'm not familiar with the ORC and Datafusion integration. From my perspective as a passerby,
datafusion-orc
seems more like Datafusion integration rather than focusing solely on the ORC format. I'm a bit worried that this might reduce the likelihood
I find a similar question about the relationship between arrow-rs
and parquet-rs
https://github.com/apache/arrow-rs/issues/1715. I believe this repo was developed and maintained for the same purpose.
However if we are going to implement features that are not a strong demand from Datafusion side (like ORC writer https://github.com/apache/orc/issues/1507) or integrate it with other consumers (like Databend https://github.com/datafuselabs/databend/issues/8016), having a dedicated repo would both reduce the maintenance burden of Datafusion and make the lib itself easier to use.
I agree with the opinion of separating this into two parts. The ORC format resides in a dedicated repo like apache/orc-rust
with maintenance from both current contributors and the ORC community. And Datafusion uses it as a downstream user to implement ORC data source. I would like to help with both code work like splitting this code base and non-code work like IP clearance.
I agree with the opinion of separating this into two parts. The ORC format resides in a dedicated repo like apache/orc-rust with maintenance from both current contributors and the ORC community. And Datafusion uses it as a downstream user to implement ORC data source. I would like to help with both code work like splitting this code base and non-code work like IP clearance.
I agree with @waynexia and @progval that the following split makes a lot of sense to me
apache/orc-rs
(similar to parquet in parquet-rs) that has no datafusion dependenciesdatafusion-contrib/datafusion-orc
that has the DataFusion table provider and depeneds on apache/orc-rs
as well as DataFusion and does the datafusion integrationLike the fact that parquet-rs has been maintained in arrow-rs for a long time.
FWIW I think this was partly an artifiact of history:
Given the current state of the code, I think it would be plausible to split parquet out of arrow-rs, but I also think unless there is some substantially larger group of maintainers that aren't also maintainers of arrow-rs it is likely easier to leave it there
cc @dongjoon-hyun @guiyanakuang @williamhyun @omalley from Apache ORC PMC
Given the current state of the code, I think it would be plausible to split parquet out of arrow-rs, but I also think unless there is some substantially larger group of maintainers that aren't also maintainers of arrow-rs it is likely easier to leave it there
Agreed. I have thought about this before but haven't taken any action yet. I mean, it looks appealing to have apache/parquet-rs
, but we need to consider the current project status.
I'm starting this thread because I believe it's beneficial for orc-rs to build a community by developing at upstream, but it doesn't seem applicable to parquet-rs at the moment.
Previously we discussed split parquet-cpp out of arrow-c++. However the dependency would be weird since there're:
arrow-dataset -> parquet-arrow -> parquet-core -> some arrow core libs
arrow-dataset -> parquet-arrow -> parquet-core -> some arrow core libs
I believe the situation is different in parquet-rs
since it depends on arrow-rs
but not reverse. However, this is not the focus of our discussion. We can start another thread for this if interested.
Thank you for the discussion. It looks like we can move forward! I think we can:
datafusion-contrib/orc-rs
first and split the ORC-related code into it (it's better to retain all the history).datafusion-contrib/orc-rs
.datafusion-contrib/orc-rs
to apache/orc-rs
.cc @alamb @waynexia @wgtmac for comments.
I'll start preparing a PR to split the current repo.
Do you have something like guidance for IP clearance? I have attended it before but have not prepared one.
I'll start preparing a PR to split the current repo.
Thanks!
Do you have something like guidance for IP clearance? I have attended it before but have not prepared one.
I think we can follow https://incubator.apache.org/ip-clearance/
Here's an example from https://github.com/apache/arrow-rs/issues/2096. We can reach out to @alamb if we encounter any problems.
Hi @progval @klangner, as part of the IP clearance process, could you please submit an ICLA (Individual Contributor Licence Agreement) following the follow the instructions at https://www.apache.org/licenses/contributor-agreements.html if you do not already have one on file? Thanks in advance for helping with this! If you already have filed one, please let me know the email address associated with your account.
I would like to chime in my thoughts. I do apologize for being inactive, and have been meaning to pickup the work I left off on this repository (specifically the basic write functionality).
The way I see it, the primary focus of this repository is to serve as an integration with DataFusion to allow querying ORC files. Naturally this required first implementing a layer to read ORC files to Arrow, before then being able to integrate into DataFusion itself (similar to how there is parquet-rs, then the actual parquet integration code in DataFusion).
I can see the merit to splitting up this repository, but perhaps still be too early to do so? One benefit of having both the integration with Arrow and integration with DataFusion in a single repository is that it allows easier development, as these interfaces will be interacting with each other. Splitting across different repositories might make it harder to experiment with the interface for each respective integration, which can slow down development.
Furthermore, I don't think there were any immediate plans to develop a native ORC interface; that is, being able to read ORC in Rust without reading it to Arrow (similar to how parquet-rs has a low level column reader/writer API). From my point of view then, it might seem odd to donate a primarily Arrow <-> ORC interface library to ORC.
I think I have already signed it some time ago while doing some other work.
I would like to chime in my thoughts. I do apologize for being inactive, and have been meaning to pickup the work I left off on this repository (specifically the basic write functionality).
Thank you very much for your contribution!
I can see the merit to splitting up this repository, but perhaps still be too early to do so?
From my perspective (as a committer on some Apache projects), it's already late for us to do so.
Developing at upstream can create a solid foundation for our entire community to build upon, making it easier for those interested in using ORC in Rust to find this project. Additionally, we can garner more support from the ORC community. Building a strong community is the key to our success. For example, we started iceberg-rust as a very basic project that could only read tables, but it has now grown to 53 contributors with full catalog support. By donating this to ORC, I expect to build a community around it, similar to what we've done with iceberg-rust.
Therefore, instead of waiting for our project to mature and gain full support, I prefer to start and attract more people to join now. I believe it's fine for us to use the existing orc -> arrow
code base as a starting point.
Furthermore, I don't think there were any immediate plans to develop a native ORC interface;
I agree, but it depends on the community's feature requests. I would be happy to work with the community if someone wants to collaborate on this.
Yes -- I think one potential benefit to splitting out orc-rs would be that others who are not using it in the context of DataFusion might be more willing to help with the development.
I do not know how likely that is at this point, though
Yes -- I think one potential benefit to splitting out orc-rs would be that others who are not using it in the context of DataFusion might be more willing to help with the development.
I have three such cases on my tables:
orc-rs
and datafusion-orc
seperately.Yes -- I think one potential benefit to splitting out orc-rs would be that others who are not using it in the context of DataFusion might be more willing to help with the development.
I have three such cases on my tables:
- I (of course!) want to build both
orc-rs
anddatafusion-orc
seperately.- @youngsofun from databend wants native ORC support but not datafusion.
- @wgtmac from the Apache ORC PMC is interested in a Rust ORC implementation.
From paimon-rust may also need a native ORC support but not datafusion
From paimon-rust may also need a native ORC support but not datafusion
This case is interesting since paimon-rust will need datafusion but not require orc with datafusion. Paimon requires orc to read data but provides datafusion integration on its own.
Is there still the parquet-rust project?
Is there still the parquet-rust project?
I do not know what parquet-rust
refers to
https://parquet.apache.org/docs/contribution-guidelines/sub-projects/ has a list of open source rust implementations
parquet-rs
refers to https://github.com/apache/arrow-rs/tree/master/parquet
Hi @progval [...], as part of the IP clearance process, could you please submit an ICLA (Individual Contributor Licence Agreement) following the follow the instructions at https://www.apache.org/licenses/contributor-agreements.html if you do not already have one on file?
I believe I also need a CCLA from my previous employer, as well as the current one (since 2024-08-01). This may take a few weeks. I'll tell you when it's done.
I believe I also need a CCLA from my previous employer, as well as the current one (since 2024-08-01). This may take a few weeks. I'll tell you when it's done.
Hi, this project does not belong to your employer (please correct me if I'm wrong). This donation will be sent from datafusion-contrib
to orc
. I believe an ICLA is sufficient.
Hi, @alamb. This reminds me that we should establish the CLA for all projects in the datafusion-contrib
organization. All contributors should agree that contributions to projects under datafusion-contrib
will grant the license to the ASF. Please correct me if this isn't meant for datafusion-contrib
.
Hi, @alamb. This reminds me that we should establish the CLA for all projects in the
datafusion-contrib
organization. All contributors should agree that contributions to projects underdatafusion-contrib
will grant the license to the ASF. Please correct me if this isn't meant fordatafusion-contrib
.
I think the idea with datafusion-contrib is to minimize process overhead (such as apache CLAs) and mostly serve as a very disparate set of crates. As they mature, we can then apply more process (as we are doing in this case)
The rationale is that many of the crates in datafusion-contrib will likely never get to the stage where they would be donated to the Apache foundation and thus any up-front cost to prepare for that is wasted effort (and thus reduces contributions)
The rationale is that many of the crates in datafusion-contrib will likely never get to the stage where they would be donated to the Apache foundation and thus any up-front cost to prepare for that is wasted effort (and thus reduces contributions)
Understood, thank you. This design makes sense to me.
I see mention of not needing the DataFusion integration code as motivation, but could this be addressed by splitting the current project to have two subcrates, one for pure Arrow-ORC and the other for DataFusion integration?
I wanted to do this initially but kept DataFusion as a feature to make it easier to develop with, especially since the DataFusion integration code is currently quite small (though I guess the dependency footprint isn't :sweat_smile: )
I believe I also need a CCLA from my previous employer, as well as the current one (since 2024-08-01). This may take a few weeks. I'll tell you when it's done.
Hi, @progval. Sorry for the interruption. I wanted to check if it works well.
Hi, this project does not belong to your employer (please correct me if I'm wrong). This donation will be sent from
datafusion-contrib
toorc
. I believe an ICLA is sufficient.
https://www.apache.org/licenses/contributor-agreements.html says CCLAs are for "For a corporation that assigns employees to work on an Apache project", and I was an employee assigned to work on the project.
Either way, I need my ex-employer's permission for the ICLA
Hi, @progval. Sorry for the interruption. I wanted to check if it works well.
My ex-employer's staff came back from summer vacation this week, and they are about to start processing my request. Current employer won't be an issue.
Sorry for the delay.
Hello, everyone. I am initiating this discussion to explore the possibility of moving into the Apache ORC PMC and developing
apache/orc-rust
.By developing
apache/orc-rust
, we will establish this implementation as the official Rust version of ORC, thereby creating a larger and more cohesive community for those interested in a Rust ORC implementation. This will make it much easier for us to build a community around this project.What are your thoughts? I plan to discuss this with the orc community if contributors are satisfied with it.
cc @Jefffrey @WenyXu @progval @waynexia @klangner @alamb @v0y4g3r @youngsofun @harveyyue