Open asfimport opened 4 years ago
Neal Richardson / @nealrichardson:
Codewise, it probably wouldn't be too bad to add R bindings to cpp/src/arrow/adapters/orc
, and it would probably look a lot like the parquet bindings. One bit of complexity is that you'd need to make it conditionally available based on whether Arrow C++ was built with ORC support enabled. AFAICT ORC is not available on Windows.
The bigger challenge (or at least effort required) would probably be the C++ dependency building, if we wanted this to be a feature generally available to all (non-Windows) users.
Dyfan Jones: To my knowledge R doesn't have any maintained packages that create orc files without the help of spark. There is one package https://github.com/vertica/r-dataconnector that does create R data.frames to orc, however it doesn't appear to be actively maintained.
To give abit of context around this feature request. I am currently developing two R packages (https://github.com/DyfanJones/RAthena and https://github.com/DyfanJones/noctua) that connect to AWS Athena. arrow is used to upload parquet files to AWS S3 and then registered with AWS Athena. I would like to the same with orc files without having to spin up a local spark cluster.
Neal Richardson / @nealrichardson: Sounds like a reasonable objective.
Since your packages are already using reticulate, I checked to see if pyarrow
wheels are built with ORC support, and it appears they are not currently due to an issue with protobuf: https://github.com/apache/arrow/pull/5627
So, yeah, sadly the biggest challenge would be in C++ build support.
What is the advantage of using ORC over parquet in Athena?
Dyfan Jones: Thanks for checking that out for me.
AWS actually doesn't recommend one over the other, it just says to use orc or parquet (https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/)
For Athena, we recommend using either Apache Parquet or Apache ORC, which compress data by default and are splittable.
For the time being I am happy to push users to parquet as this is fully supported in the current version of arrow. :)
Neal Richardson / @nealrichardson: Sounds good. For the future, we can leave this open and see if more ORC advocates turn up.
Ian Alexander Joiner / @iajoiner:
@nealrichardson [~larefly]
We already have an ORC reader and an ORC writer in C++. Right now what we need is really just an R binding for both.
Neal Richardson / @nealrichardson: It's not just that, we also would need to add orc and protobuf to the R package builds. Not necessarily a reason not to do it, but it does factor into the cost/benefit equation.
Ian Cook / @ianmcook: ARROW-7906 added ORC write functionality to the Python bindings and I believe it also improved ORC support in the C++ library
Ian Alexander Joiner / @iajoiner: @nealrichardson Protobuf does have an R interface. As for ORC it looks like there is some work to do.
Ian Alexander Joiner / @iajoiner: @ianmcook Are you actually going to do this one? If not I will.
Arghya Saha: Hi @iajoiner , any update on ORC support for arrow in R?
Ian Alexander Joiner / @iajoiner:
[~arghya18]
Sorry I have been busy with work, personal life and two other projects. I will do this one after getting the options and documentation for C++ and Python done.
Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.
Currently the R package can read/write arrow, feather, parquet etc ... How feasible is it for orc file format to be support with read / write capabilities?
Reporter: Dyfan Jones
Note: This issue was originally created as ARROW-8056. Please see the migration documentation for further details.