Open eitsupi opened 11 months ago
I think that is a good idea, although I might need a more specific example to more concretely comment. Today you might have to use as_nanoarrow_array()
as a backup. It may not be quite the same, but in the geos
and s2
and geoarrow
packages I fall back on wk_handle()
's generic (which avoids most of those packages having to know about any of the others). I'd like to move those to start using as_nanoarrow_array_stream()
in a way like you described.
In general, the pattern I envisioned for interchange is:
as_nanoarrow_array()
S3 method (e.g., arrow::Array
has a method for as_nanoarrow_array()
)as_nanoarrow_array()
to sanitize the input.The as_nanoarrow_array_stream()
version is slightly more generic (e.g., can accommodate streaming results and/or Chunked arrays), but isn't quite where it should be (e.g., if you call as_nanoarrow_array_stream()
on an arrow::ChunkedArray
today, I think you would get an error, even though it should work).
Thanks for your reply.
What I had in mind was something like a as_polars_df()
function that I recently added to the polars package.
Functions that deal with things like data.frame (e.g. tibble::as_tibble()
, arrow::as_arrow_table()
, etc.) naturally execute as.data.frame()
inside the default method, but I was wondering if it would be cheaper to do as_nanoarrow_ *
instead if Arrow is behind these.
I didn't know of about wk_handle()
.
I think it would be great if we could do something similar with data.frame.
I think as_polars_df()
would probably be the perfect candidate for using as_nanoarrow_array_stream()
in the default method! If for some reason the default method is slow for some object type, you (or the owner of the S3 class) can add a dedicated S3 method.
I am wondering if it would be possible to recommend that the class be given an attribute indicating that it can perform the conversion to nanoarrow_array_stream at a low cost.
In other words, in Python, we can do the conversion with the C Stream interface for objects that have an interchange protocol, but in R, we don't know in advance if we can do as_nanoarrow_array_stream()
at a low cost now.
So I am not sure always forcing a conversion using as_nanoarrow_array_stream()
in a default S3 method for example.
It's a great point that, given an arbitrary nanoarrow_array_stream
, there's no way to know how expensive it will be to consume it (or if it supports being consumed from another thread, or maybe other things). I think this is true in Python, too (although you're right that you can check for the __arrow_c_stream__
attribute, where in R it's a bit more awkward to wire up hasS3Method()
and I forget if that worked the last time I tried it).
I am not sure this can be added to the nanoarrow_array_stream
itself...the object itself is a sort of "safe home" for the underlying stream and there is quite a lot of nanoarrow/R code that moves the C structures from one home to another. Ensuring that the attribute stayed up-to-date would be tricky (but possible if this is important).
Another thing that could be done is to add an argument to as_nanoarrow_array_stream()
such that one could do as_nanoarrow_array_stream(something, only_consume_if_this_will_be_fast = TRUE)
(obviously with a more compact name). I'm not sure exactly how that would be implemented everywhere, though...often the database or Acero or object that is being exported doesn't have a way to query this, either).
...or maybe other ideas?
I think a user has some context when typing these things, though: if a user types some_arrow_dplyr_query |> as_polars_df()
, I am not sure they will be surprised that it takes a while if they just typed a big query (you might be able to compensate for that in as_polars_df()
by checking for user interrupts when consuming the stream).
Thanks for your reply. I imagined that in Apache Arrow's documentation it would recommend the use of certain attributes. That is, objects that can export Streams at low cost (such as arrow Table and polars DataFrame) have such an attribute.
I was thinking about this because I was wondering if I should allow the conversion from data.frame
to nanoarrow_array_stream
in the following place.
(Search for the name in the query of GlareDB from within the environment, convert it to a memory table of DataFusion, register it, and then execute the query.)
https://github.com/eitsupi/r-glaredb/blob/142bcc1d3a91229a6cd3fda7711ac3732f34c1bb/src/rust/src/environment.rs#L68-L92
Given that duckdb on Python does a conversion from polars.LazyFrame
to pyarrow.Table
when the name of the LazyFrame instance is in the query (which can obviously take a long time), it may not be a problem to allow this, though.
nanoarrow looks very promising for converting between classes that use the Arrow format internally.
For such applications, I imagine that if we define a following default S3 method, we can omit the method definition to each class by defining only the conversion from nanoarrow.
Is this a good idea? (Although of course it may need to be a later version of nanoarrow)