apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
173 stars 38 forks source link

[R] nanoarrow as an interchange class? #331

Open eitsupi opened 11 months ago

eitsupi commented 11 months ago

nanoarrow looks very promising for converting between classes that use the Arrow format internally.

For such applications, I imagine that if we define a following default S3 method, we can omit the method definition to each class by defining only the conversion from nanoarrow.

as_foo.default <- function(x) {
  as_foo.nanoarrow_array_stream(as_nanoarrow_array_stream(x))
}

Is this a good idea? (Although of course it may need to be a later version of nanoarrow)

paleolimbot commented 11 months ago

I think that is a good idea, although I might need a more specific example to more concretely comment. Today you might have to use as_nanoarrow_array() as a backup. It may not be quite the same, but in the geos and s2 and geoarrow packages I fall back on wk_handle()'s generic (which avoids most of those packages having to know about any of the others). I'd like to move those to start using as_nanoarrow_array_stream() in a way like you described.

In general, the pattern I envisioned for interchange is:

The as_nanoarrow_array_stream() version is slightly more generic (e.g., can accommodate streaming results and/or Chunked arrays), but isn't quite where it should be (e.g., if you call as_nanoarrow_array_stream() on an arrow::ChunkedArray today, I think you would get an error, even though it should work).

eitsupi commented 11 months ago

Thanks for your reply. What I had in mind was something like a as_polars_df() function that I recently added to the polars package. Functions that deal with things like data.frame (e.g. tibble::as_tibble(), arrow::as_arrow_table(), etc.) naturally execute as.data.frame() inside the default method, but I was wondering if it would be cheaper to do as_nanoarrow_ * instead if Arrow is behind these.

I didn't know of about wk_handle(). I think it would be great if we could do something similar with data.frame.

paleolimbot commented 11 months ago

I think as_polars_df() would probably be the perfect candidate for using as_nanoarrow_array_stream() in the default method! If for some reason the default method is slow for some object type, you (or the owner of the S3 class) can add a dedicated S3 method.

eitsupi commented 5 months ago

I am wondering if it would be possible to recommend that the class be given an attribute indicating that it can perform the conversion to nanoarrow_array_stream at a low cost. In other words, in Python, we can do the conversion with the C Stream interface for objects that have an interchange protocol, but in R, we don't know in advance if we can do as_nanoarrow_array_stream() at a low cost now. So I am not sure always forcing a conversion using as_nanoarrow_array_stream() in a default S3 method for example.

paleolimbot commented 5 months ago

It's a great point that, given an arbitrary nanoarrow_array_stream, there's no way to know how expensive it will be to consume it (or if it supports being consumed from another thread, or maybe other things). I think this is true in Python, too (although you're right that you can check for the __arrow_c_stream__ attribute, where in R it's a bit more awkward to wire up hasS3Method() and I forget if that worked the last time I tried it).

I am not sure this can be added to the nanoarrow_array_stream itself...the object itself is a sort of "safe home" for the underlying stream and there is quite a lot of nanoarrow/R code that moves the C structures from one home to another. Ensuring that the attribute stayed up-to-date would be tricky (but possible if this is important).

Another thing that could be done is to add an argument to as_nanoarrow_array_stream() such that one could do as_nanoarrow_array_stream(something, only_consume_if_this_will_be_fast = TRUE) (obviously with a more compact name). I'm not sure exactly how that would be implemented everywhere, though...often the database or Acero or object that is being exported doesn't have a way to query this, either).

...or maybe other ideas?

I think a user has some context when typing these things, though: if a user types some_arrow_dplyr_query |> as_polars_df(), I am not sure they will be surprised that it takes a while if they just typed a big query (you might be able to compensate for that in as_polars_df() by checking for user interrupts when consuming the stream).

eitsupi commented 5 months ago

Thanks for your reply. I imagined that in Apache Arrow's documentation it would recommend the use of certain attributes. That is, objects that can export Streams at low cost (such as arrow Table and polars DataFrame) have such an attribute.

I was thinking about this because I was wondering if I should allow the conversion from data.frame to nanoarrow_array_stream in the following place. (Search for the name in the query of GlareDB from within the environment, convert it to a memory table of DataFusion, register it, and then execute the query.) https://github.com/eitsupi/r-glaredb/blob/142bcc1d3a91229a6cd3fda7711ac3732f34c1bb/src/rust/src/environment.rs#L68-L92

Given that duckdb on Python does a conversion from polars.LazyFrame to pyarrow.Table when the name of the LazyFrame instance is in the query (which can obviously take a long time), it may not be a problem to allow this, though.