apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[R] How to pass Arrow objects like Table between C++ and R? #43675

Closed ajinkya-k closed 2 weeks ago

ajinkya-k commented 2 months ago

Describe the usage question you have. Please include as many useful details as possible.

I was curious how I can pass arrow objects from R to C++ (kind of like R vectors via Rcpp::NumericVector). Here's an example of what I am looking for:

Say I have a function sample_post in R that takes in an arrow table and some parameters:

# R code
sample_post <- function(arrow_tbl, some_vec, params) {
    # do some clean up
    post_object <- .some_rcpp_fn(arrow_tbl, some_vec, params)

    # do some processing on the post object
   return(post_object)
}

For a little more concreteness, let's say some_fn_cpp is doing group_by summaries in each iteration of a loop. In some_fn_rcpp what should be the type for the first argument that corresponds to arrow_tbl?

NOTE: I would prefer using Rcpp but not tied to it. I am okay using something else.

Component(s)

R

assignUser commented 2 months ago

Do you want to operate on the arrow table with libarrow or some custom C++ code (potentially using other libraries)?

ajinkya-k commented 2 months ago

I want to use custom C++ code that will need other libraries

assignUser commented 2 months ago

Sorry for the late reply, I am not sure that's possible with cpp11 (which the arrow uses) but that's not my speciality. I found this related issue: https://github.com/apache/arrow/issues/36274

ajinkya-k commented 1 month ago

So is this an inherent limitation of cpp11?

assignUser commented 1 month ago

I don't really know, sorry. Maybe @jonkeane or @paleolimbot can chime in?

amoeba commented 1 month ago

It seems like this should be possible @ajinkya-k, see https://github.com/apache/arrow/issues/36274#issuecomment-1607431346 and let us know if you think you could adapt that code to your use case. Also note the caveats in that thread.

ajinkya-k commented 1 month ago

Hi @amoeba, thanks for sharing the thread. As is clear in the thread, there is no guarantee of stability which means I cannot roll it up into a package. I was hoping there would be a more stable and permanent way to do this. If not, it might be worth putting in a feature request.

I think being able to access the exact same Arrow object from both R and C++ would be very important to enable more scalable Bayesian analyses that have to rely on C++ code out of necessity. In some of the applications that I am thinking of, summary statistics of specific subsets of the data are required to be computed in C++. This can be very efficiently be achieved using filter and group_by + summarize in C++. But in every iteration of the MCMC loop the subset of units to be filered on or grouped will differ. This is why the arrow object must be available in C++

amoeba commented 1 month ago

The examples given in https://github.com/apache/arrow/issues/36274 should be stable because they use the Arrow C Data Interface, with the help of the nanoarrow package, to pass the arrow::Table between C++ and R. My interpretation of @paleolimbot 's comment was that it's specifically passing pointers to arrow::Tables that's not considered stable. But going through the C Data Interface is stable and is even the Arrow project's recommended way of doing this kind of thing.

ajinkya-k commented 1 month ago

Thanks! I will give it a try

amoeba commented 2 weeks ago

Hi @ajinkya-k, I'm going to close this for now but please feel free to re-open and/or comment here. I'm curious if you were able to get something to work.